VMRFANet:View-Specific Multi-Receptive Field Attention Network for Person Re-identification
Honglong Cai, Yuedong Fang, Zhiguan Wang, Tingchun Yeh, Jinxing Cheng
VVMRFANet: View-Specific Multi-Receptive Field Attention Network forPerson Re-identification
Honglong Cai, Yuedong Fang, Zhiguan Wang, Tingchun Yeh, Jinxing Cheng
Suning Commerce R&D Center USA { honglong.cai, yuedong.fang, doris.wang, tingchun.yeh, jim.cheng } @ussuning.com Keywords: Person Re-identification, Attention, View Specific, Data AugmentationAbstract: Person re-identification (re-ID) aims to retrieve the same person across different cameras. In practice, it stillremains a challenging task due to background clutter, variations on body poses and view conditions, inaccuratebounding box detection, etc. To tackle these issues, in this paper, we propose a novel multi-receptive fieldattention (MRFA) module that utilizes filters of various sizes to help network focusing on informative pixels.Besides, we present a view-specific mechanism that guides attention module to handle the variation of viewconditions. Moreover, we introduce a Gaussian horizontal random cropping/padding method which furtherimproves the robustness of our proposed network. Comprehensive experiments demonstrate the effectivenessof each component. Our method achieves 95.5% / 88.1% in rank-1 / mAP on Market-1501, 88.9% / 80.0%on DukeMTMC-reID, 81.1% / 78.8% on CUHK03 labeled dataset and 78.9% / 75.3% on CUHK03 detecteddataset, outperforming current state-of-the-art methods.
Image-based person re-identification (re-ID) aimsto search people from a large number of boundingboxes that have been detected across different cam-eras. Although extensive amounts of efforts andprogress have been made in the past few years, per-son re-ID remains a challenging task in computer vi-sion. The obstacles mainly come from the low res-olution of images, background clutter, variations ofperson poses, etc .Nowadays, the extracted deep features of pedes-trian bounding boxes through a convolutional neuralnetwork(CNN) is demonstrated to be more discrim-inative and robust. However, most of the existingmethods only learn global features from whole hu-man body images such that some local discriminativeinformation of specific parts may be ignored. To ad-dress this issue, some recent works (Sun et al., 2018;Wang et al., 2018b; Zhang et al., 2017) archived state-of-the-art performance by dividing the extracted hu-man image feature map into horizontal stripes and ag-gregating local representations from these fixed parts.Nevertheless, drawbacks of these part-based modelsare still obvious: 1) Feature units within each localfeature map are treated equally by applying globalaverage/maximum pooling to get refined feature rep-resentation. Thus the resulting models cannot fo- cus more on discriminative local regions. And 2)Pre-defined feature map partition strategies are likelyto suffer from misalignment issues. For example,the performance of methods adopting equal partitionstrategies (e.g. (Sun et al., 2018)) heavily dependson the quality and robustness of pedestrian bound-ing box detection, which itself is a challenging task.Other strategies such as partition based on humanpose (e.g. (Yang et al., 2019)) often introduce sidemodels trained on different datasets. In that case, do-main bias may come into play.Moreover, to our best knowledge, none of thesemethods have made efforts to manage view-specificbias. That is, the variation of view conditions fromdifferent cameras can be dramatic. Thus the extractedfeatures are likely to be biased in a way that intra-class features of images from different views will bepushed apart, and inter-class ones from the same viewwill be pulled closer. To better handle these problems,adopting an attention mechanism is an intuitive andeffective choice. As human vision only focuses onselective parts instead of processing the whole fieldof view at once, attention mechanism aims to detectinformative pixels within an image. It can help to ex-tract features that better represent the regions of inter-est while suppressing the non-target regions. Mean-while, it can be trained along with the feature extrac-tor in an end-to-end manner. a r X i v : . [ c s . C V ] J a n n this work, we explore the application of atten-tion mechanisms on the person re-identification prob-lem. Particularly, the contributions of this paper canbe summarized as follow: • We investigate the idea of combining spatial- andchannel-wise attention in a single module withvarious sized receptive filters, and then mount themodule to a popular strip-based re-ID baseline(Sun et al., 2018) in a parallel way. We believethis is a more general form of attention modulecomparing to the ones in many existing structuresthat try to learn spatial- and channel-wise atten-tion separately. • We explore the potential of using attention mod-ule to inject prior information into feature extrac-tor. To be specific, we utilize the camera ID tag toguide our attention module learning a view spe-cific feature mask that further improves the re-IDperformance. • We propose a novel horizontal data augmentationtechnique against the misalignment risk, which isa well-known shortcoming of strip-based models.
Strip-based models:
Recently, strip-based modelshave been proven to be effective in person re-ID.Part-based Convolutional Baseline (PCB) (Sun et al.,2018) equally slices the final feature map into hor-izontal strips. After refining part pooling, the ex-tracted local features are jointly trained with classi-fication losses and have been concatenated as the fi-nal feature. Lately, (Wang et al., 2018b) proposed amulti-branch network to combine global and partialfeatures at different granularities. With the combina-tion of classification and triplet losses, it pushed there-ID performances to a new level compared with pre-vious state-of-the-art methods. Due to the effective-ness and simplicity, we adopted a modified version ofPCB structure as the baseline in this work.
Attention mechanism in Re-ID:
Another challengein person re-ID is imperfect bounding-box detection.To address this issue, the attention mechanism is anatural choice for aiding the network to learn whereto “look” at. There are a few attempts in the lit-erature that apply attention mechanisms for solvingre-ID task (Cai et al., 2019; Yang et al., 2019; Liet al., 2018; Chang et al., 2018). For example, (Caiet al., 2019) utilized body masks to guide the train-ing of attention module. (Yang et al., 2019) proposedan end-to-end trainable framework composed of localand fusion attention modules that can incorporate im- age partition using human key-points estimation. Ourproposed MRFA module is designed to address theimperfect detection issue mentioned above. Mean-while, unlike (Li et al., 2018) and a few other existingattention-based methods, MRFA tries to preserve thecross-correlation between spatial- and channel-wiseattention.
Metric learning:
Metric learning projects images toa vector space with fixed dimensions and defines ametric to compute distances between embedded fea-tures. one direction is to study the distance functionexplicitly. A representative and illuminating exampleis (Yu et al., 2018): to tackle the unsupervised re-IDproblem, they proposed a deep framework consistingof a CNN feature extractor and an asymmetric metriclayer such that the feature from extractor will be trans-formed specifically according to the view to form thefinal feature in Euclidean space. Like many other re-ID methods, we also incorporate the triplet loss in thiswork to enhance the feature representability. Besides,we also investigate the usage of attention module act-ing like the asymmetric metric layer to learn a view-specific attention map.
In this section, we propose a novel attention mod-ule as well as a framework to train view specificfeature enhancement/attenuation using the attentionmechanism. A data augmentation method to improvethe robustness of strip-based models has also beenpresented.
The overall architecture of our proposed model isshown in Figure 1.
Baseline network:
In this paper, we employResNet50 (He et al., 2015) as a backbone networkwith some modifications following (Sun et al., 2018):the last average pooling and fully connected layershave been removed as well as the down-sampling op-eration at the first layer of stage 5. We denote thedimension of the final feature map as C × H × W ,where C is the encoded channel dimension, and H , W are the height and width respectively. A feature ex-tractor has been applied to the final feature map toget a 512-dimensional global feature vector. Just likePCB, we further divide the final feature map into 6horizontal strips such that each strip is of dimension C × ( H / ) × W . Then each strip is fed to a feature ex-tractor, so we end up getting 6 local feature vectors in Attention
ResNet-50stage
Combined features
Camera loss Camera lossTriplet loss
Global
Softmax lossSoftmax loss
Local
Attention
Figure 1: The structure of the proposed network (VMRFANet). Two attention modules are mounted to the third and fourthstages of ResNet50 backbone. Six local features are extracted from the last feature map together with a global feature. Allseven features are concatenated and normalized to form a final descriptor of a pedestrian bounding box. total with dimension 256 each. Afterward, each fea-ture is input to a fully-connected (FC) layer and thefollowing Softmax function to classify the identitiesof the input images. Finally, all 7 feature vectors (6local and 1 global) are concatenated to form a 2048-dimensional feature vector for inference.
Other components:
Two Multi-Receptive Field At-tention (MRFA) modules, which will be describedlater in detail in Section 3.2, are added to the base-line network. The first attention module takes the fea-ture map after stage 2 block as an input. Its outputmask m ∈ ( , ) × × is then applied to the fea-ture map after stage 3 block by an element-wise multi-plication. The second attention module is mounted tostage 4 block similarly. Additionally, a feature extrac-tor is connected to each attention module to extract a512-dimensional feature for camera view classifica-tion, which will be explained in detail in Section 3.3. Input outputconv 1*1conv 1*1conv 1*1conv 1*1 conv 3*3conv 3*3conv 1*7 conv 7*1 conv 1*1conv 3*3 concatenation
Figure 2: The detailed structure of a Multi-Receptive FieldAttention (MRFA) module.
To design the attention module, we use an Inception-like (Szegedy et al., 2016) architecture. That is, wedesign a shallow network with only up to four convo- (a) (b) (c) (d) (e) (f)Figure 3: Attention map of our MRFA module. (a) (c) (e)show the original images and (b) (d) (f) illustrate the cor-responding attention maps. Attention maps show that ourattention mechanism can focus on the person and filter outthe background noise. lutional layers. Meanwhile, various filter sizes (1 × ×
3, 5 ×
5, 7 ×
7) have been adopted. And following(Szegedy et al., 2016), we further reduce the num-ber of parameters by factorizing convolutions withlarge filters of sizes 5 × × × × ×
1, respectively. The structure of MRFA isshown in Figure 2. Our proposed attention structurecan combine different reception field information andlearn a different level of knowledge to make a deci-sion which region we should pay more attention to.Figure 3 shows that our attention mechanism can fo-cus on the person’s body and filter out backgroundnoise.The input feature of channel dimension C is firstconvolved by four 1 × C / × C , followed by a 1 × C to matchthe channel size of feature from backbone network. A a)(b)(c) Figure 4: Example images from DukeMTMC-reID. (a)show bounding boxes of the same person captured by threedifferent cameras. The included backgrounds and the viewconditions various dramatically. (b) correspond to three dif-ferent identities captured by a single camera such that theyappear to be visually similar. (c) indicate the case of within-view inconsistency, i.e., the same person was captured bythe same camera with different occlusions. tanh + ( , ) . Note that due to spatial down-sampling atthe beginning of stage 3 block, we need to apply aver-age pooling after each 1 × Our goal is to match people across different cameraviews distributed at different locations. The variationof cross-view person appearances can be dramatic dueto various viewpoints, illumination conditions, andocclusion. As we can see, the same person looks dif-ferent under different cameras and different personslook similar under same camera in Figure 4To tackle this issue, we thought it’s effective toutilize the view-specific transformation. To makeour network be aware of different camera views, weforce our model to “know” which view the inputbounding box belongs to. As a result, this task isconverted to a camera ID (view) classification prob-lem. However, in person re-ID task, the goal isto learn a camera-invariant feature which contradictswith camera ID (view) classification. To utilize thecamera-specific information without affecting learn-ing a camera-invariant final feature, we found it is nat- ural to incorporate the view-specific transformationinto our attention mechanism instead of adding on thebackbone network. By adding camera ID (view) clas-sification on the attention mechanism, we make it beaware of the view-specific information and could fo-cus on the right place without affecting the camera-invariant features extracted from the backbone net-work.This distance can be written as: d l ( { xxx i , v i } , { xxx j , v j } ) = (cid:107) UUU T v i xxx i − UUU T v j xxx j (cid:107) (1)where xxx i is the extracted feature of i -th bounding box, v i denotes the corresponding index of camera view,and UUU v i is the view-specific transformation.By connecting a simple feature extractor to eachattention module, we denote the extracted attentionfeature k ( k = ,
2) as aaa k . We further add a fully con-nected layer to each feature extractor, the softmax lossis formulated as: L softmaxcamera = − N N ∑ i = ∑ k = log exp ( WWW Tv i aaa ik ) ∑ N v j = exp ( WWW Tj aaa ik ) (2)where WWW j corresponds to the weight vector for cam-era ID j , with the size of mini-batch N and the numberof cameras in the dataset N v .There remains one issue that needs to be dealt withcarefully: the within-view inconsistency (see row (c)in Figure 4), which arises when bounding boxes aredetected at different locations within frames capturedby the same camera. In that case, the view conditionscan be distinct since different parts of the backgroundwill be included. To address this issue, we adopt a la-bel smoothing (Szegedy et al., 2016) strategy on thesoftmax loss in Equation 2: for a training examplewith ground-truth label v i , we modify the label distri-bution q ( j ) as: q (cid:48) ( j ) = ( − ε ) δ j , v i + ε N v (3)Here δ j , v i is the Kronecker delta function and ( − ε ) controls the level of confidence of the view classifi-cation. Thus the final loss function for view-specificlearning can be written as: L camera = − N N ∑ i = ∑ k = N v ∑ j = log p ( j ) q (cid:48) ( j ) (4)Where p ( j ) is the predicted probability which is cal-culated by applying the softmax function on the out-put vector of the fully connected layer. Person re-identification is essentially a zero-shotlearning task that identities in the training set will notverlap with those in the test set. But in order to letthe network learn discriminate features, we can stillformulate it as a multi-label classification problem byapplying a softmax cross-entropy loss: L ID = − N ∑ k = N ∑ i = log exp ( WWW Ty j , k xxx ik ) ∑ Cj = exp ( WWW Tj , k xxx ik ) (5)where k is the index of features where k ∈ [ , ..., ] corresponds to the 6 local features and k = WWW j , k is the weight vectorfor identity j , and xxx k is the extracted feature from eachcomponent.To further improve the performance and speed upthe convergence, we apply the batch-hard triplet loss(Hermans et al., 2017). Each mini-batch, consistingof N images, is selected with P identities and K im-ages from each identity. L = PK P ∑ i = K ∑ a = [ m + max p = ... K (cid:107) xxx ( i ) a − xxx ( i ) p (cid:107) − min n = ... Kj = ... Pj (cid:54) = i (cid:107) xxx ( i ) a − xxx ( j ) n (cid:107) ] + (6)where xxx ( i ) a , xxx ( i ) p , and xxx ( j ) n are the concatenated and nor-malized final feature vectors which are extracted fromanchor, positive, and negative samples respectively,and m is the margin that restricts the differences be-tween Intra and inter-class distances.To further ensure the cross-view consistency,we also calculate a triplet loss L on a 512-dimensional feature vector extracted from the featuremap after applying the first attention mask.By combining all the above losses, our final ob-jective for end-to-end training can be written as mini-mizing the loss function below: L combined = L ID + λ L + λ L + λ L camera (7)where λ , λ and λ are used to balance between theclassification loss, triplet loss, and camera loss. A major issue that strip-based models cannot circum-vent is misalignment. PCB baseline equally slices thelast feature map into local strips. Although being fo-cused, the receptive field of each strip actually coversa large fraction of an input image. That is, each localstrip can still ‘see’ at least an intact part of the body.Thus, even without explicitly varying feature scales,such as fusing pyramid features or assembling multi-ple branches with different granularities, the potential (a) (b) (c) (d) (e)Figure 5: An example of imperfect bounding box detectionin Market-1501 dataset. (a) is well detected. (b) the bottompart of body has been cropped out. (c) too much backgroundhas been included at the bottom. (d) top part is missing. (e)too much background has been included at the top. Imper-fect bounding box detection causes misalignment problemwhich is particularly noxious to strip-based re-ID models. of our baseline network to handle misalignment is stilltheoretically guaranteed.So the remaining question is how to generate newdata mimicking the imperfections of bounding boxdetection. Some examples of problematic detectionthat can cause misalignment found in Market-1501dataset is shown in Figure 5. Since the feature cut-ting is along the vertical direction and global poolingis applied on each strip, the baseline model is moresensitive to the vertical misalignment than the hori-zontal counterpart. Thus a commonly used randomcropping/padding data augmentation is sub-optimalin this case. Instead, we propose a horizontal dataaugmentation strategy. To be specific, we only ran-domly crop/pad the top or bottom of the input bound-ing boxes, by a fraction of the absolute value of afloat number drawn from a Gaussian distribution withmean 0 and standard deviation σ . That is, we as-sume the level of inaccurate detection follows a formof Gaussian distribution. In all our experiments, thestandard deviation σ is set to 0.05. This fraction isfurther clipped at 0.15 to prevent generating outliers.Cropping is adopted when the random number is neg-ative, otherwise, padding is applied. Only with aprobability of 0.4, the input images will be augmentedin the above way. We conduct extensive tests to validate our proposedmethod on three publicly available person ReIDdatasets.
Market-1501:
This dataset (Zheng et al., 2015) con-sists of 32,668 images of 1,501 labeled persons cap-tured from 6 cameras. The dataset is split up into training set which contains 12,936 images of 751identities, and test set with 3,368 query images and19,732 gallery images of 750 identities.
DukeMTMC-reID:
This dataset is a subset ofDukeMTMC (Ristani et al., 2016) which contains36,411 images of 1,812 persons captured by 8 cam-eras. 16,522 images of 702 identities were selectedas training samples, and the remaining 702 identitiesare in the testing set consisting of 2,228 query imagesand 17,661 gallery images.
CUHK03:
CUHK03 (Li et al., 2014) consists of14096 images from 1467 identities. The wholedataset is captured by six cameras and each iden-tity is observed by at least two disjoint cameras. Inthis paper, we follow the new protocol (Zhong et al.,2017a) which divides the CUHK03 dataset into atraining/testing set similar to Market-1501.
Evaluation Metrics:
To evaluate each compo-nent of our proposed model and also compare theperformance with existing state-of-the-art methods,we adopt Cumulative Matching Characteristic(CMC)(Gray et al., 2007) at rank-1 and Mean Average Pre-cision(mAP) in all our experiments. Note that all theexperiments are conducted in a single-query settingwithout applying re-ranking (Zhong et al., 2017a).
Data Pre-processing:
During training, the input im-ages will be re-sized to a resolution of 384 ×
128 tobetter capture detailed information. We deploy ran-dom horizontal flipping and random erasing (Zhonget al., 2017b) for data augmentation. Note that ourcomplete framework contains a horizontal data aug-mentation which will be deployed before image re-sizing.
Loss Hyper-parameters:
In all our experiments, weset the parameter of label smoothing softmax loss ε = .
1. Because our classification loss is the addition ofglobal classification loss and local classification loss,so we give weight to the triplet loss. The parametersfor the combined loss are set to λ = λ = λ =
1. Here we set P =
24 and K = Optimization:
We use SGD with momentum 0.9 tooptimize our model. The weight decay factor is setto 0.0005. To let the components that haven’t beenpre-trained get up to speed, we set the initial learn-ing rate of attention modules, feature extractors, andclassifiers to 0.1, while we set the initial learning rateof the backbone network to 0.01. The learning ratewill be dropped by half at epochs 150, 180, 210, 240,270, 300, 330, 360, and we let the training run for 450epochs in total.
We further perform comprehensive ablation studieswith each component of our proposed model onMarket-1501 datasets.
Table 1: Evaluating each component in our proposedmethod.
Dataset Market-1501Metric(%) rank 1 mAP
Baseline 93.2 82.2Base+MRFA 93.8 83.2- features before ⊗ +CAM 93.3 82.8- features after ⊗ +CAM 93.3 83.1Base+MRFA+CAM 94.3 83.9Base+MRFA+CAM+TL 95.2 87.5Base+MRFA+CAM+TL+HDA We first evaluate theeffect of our proposed multi-receptive field attention(MRFA) module by comparing it with the baselinenetwork. The results are shown in table 1. We ob-serve an improvement of 0 . / . rank 1/mAP onMarket-1501. Notice that MRFA is only added to thelast two stages of the ResNet50 baseline. We observelittle improvements when adding MRFA to the frontstages of the backbone network. Considering the costof a more complicated network, we decide to only addMRFA on the last two stages. Effectiveness of View-specific Learning:
We com-pare the performance of our proposed model with andwithout adding the camera ID classification loss tothe MRFA modules (see first and the last row of ta-ble 1). We see 0.5%/0.7% gain at rank 1/mAP onMarket1501 with view specific learning on attentionmechanism.To further show the necessity for adding cameraloss on attention mechanism and the primary causeof the performance gain is not simply because of in-troducing a harder objective, we conduct experimentmoving two camera losses from attention mechanismto features of corresponding stages (stage 3 and stage4) of the backbone network. We experiment two set-tings, one is to add camera loss before ⊗ operationwith attention and another is to add camera loss after ⊗ operation. In both setting (see fourth and fifth rowsin table 1) , we see degradation on rank 1 and mAP . Itdemonstrated that adding camera loss directly on thebackbone network is not helpful. It likely affects thecamera-invariant features extracted by the backbonenetwork. Benefit of Combined Objective Training withTriplet and Softmax Loss:
Our network is trained byminimizing both triplet loss and softmax loss jointly. able 2: Comparison with the state-of-the-arts on Market-1501 and DukeMTMC-ReID datasets. The best results are in bold,while the numbers with underlines denote the second best.
Model Market1501 DukeMTMC-reID rank 1 mAP rank 1 mAP
SVDNet(Sun et al., 2017) 82.3 62.1 76.7 56.8PAN(Zheng et al., 2018) 82.8 63.4 71.6 51.5MultiScale(Chen et al., 2017) 88.9 73.1 79.2 60.6MLFN (Chang et al., 2018) 90.0 74.3 81.0 62.8HA-CNN(Li et al., 2018) 91.2 75.7 80.5 63.8Mancs(Wang et al., 2018a) 93.1 82.3 84.9 71.8Attention-Driven(Yang et al., 2019) 94.9 86.4 86.0 74.5PCB+RPP(Sun et al., 2018) 93.8 81.6 83.3 69.2HPM (Fu et al., 2018) 94.2 82.7 86.6 74.3MGN (Wang et al., 2018b)
Table 3: Comparison of results on CUHK03-labeled (CUHK03-L) and CUHK03-detected (CUHK03-D) with new protocol(Zhong et al., 2017a). The best results are in bold, while the numbers with underlines denote the second best.
Model CUHK03-L CUHK03-D rank 1 mAP rank 1 mAP
SVDNet(Sun et al., 2017) 40.9 37.8 41.5 37.3MLFN(Chang et al., 2018) 54.7 49.2 52.8 47.8HA-CNN(Li et al., 2018) 44.4 41.0 41.7 38.6PCB+RPP(Sun et al., 2018) – – 63.7 57.5MGN(Wang et al., 2018b) 68.0 67.4 68.0 66.0MRFANet (Ours)
We evaluated its performance comparing to our base-line+MRFA+CAM setting. We found that the combi-nation of losses not only brings significant improve-ments ( + . / + . rank 1/mAP on Market-1501)on the performance but also speeds up the conver-gence. Notably, the triplet loss is essential since itserves as the cross-view consistency regularizationterm in the view-specific learning mechanism. Impact of Horizontal Data Augmentation onStrip-based Re-ID Model:
Finally, we add hor-izontal data augmentation to the network Base-line+MRFA+CAM and get our final view-specificmulti-receptive field attention network (VMRFANet:Baseline+MRFA+CAM+HDA). We do the compar-isons of the models with and without horizontal dataaugmentation. The performance gain ( + . / + . rank 1/mAP on Market-1501 dataset) provesthe effectiveness of the data augmentation strategyagainst misalignment. We evaluate our proposed model against current state-of-the-arts methods on three large benchmarks. Thecomparisons on Market-1501 and DukeMTMC-reIDare summarized in Table 2, while the results onCUHK03 is shown in Table 3.
Results on Market-1501:
Our method achieves thebest result on mAP metric, and the second best on rank 1 . It outperforms all other approaches excepta strip-based method MGN (Wang et al., 2018b) on rank 1 metric. However, MGN incorporates threeindependent branches after stage 3 of the ResNet50backbone to extract features with multi-granularity.Moreover, the difference is only marginal, and ourmethod has achieved this competitive result using amuch smaller network. Remarkably, on this datasetwhose bounding boxes are automatically detected,the Gaussian horizontal data augmentation strategygreatly improves the robustness of the model.
Results on DukeMTMC-reID:
Our method achievesthe best results on this dataset at both metrics. No-tably, PCB (Sun et al., 2018) is a strip-based modelthat serves as the starting point of our approach. Wesurpassed it by + .
8% on mAP and + .
6% on rank1 . MGN gets the second best results among all com-pared methods on this dataset. On the other hand, ourmodel outperforms the listed attention-based modelsby a large margin.
Results on CUHK03:
To evaluate our proposedmethod on CUHK03, we follow the new protocol(Zhong et al., 2017a). However, since only a relativelabel (with binary values 1 and 2) is used for iden-tifying which camera that an image is coming from,e found it hard to extract the exact camera IDs fromCUHK03. Thus we only test our model without en-abling the view-specific learning on this dataset. Intable 3, we show the results of our proposed methodon CUHK03. Remarkably, although the MRFA mod-ule is not guided by camera ID, our model still out-performs all other methods by a large margin.
In this work, we introduce a novel multi-receptivefield attention module which brings a considerableperformance boost to a strip-based person re-ID net-work. Besides, we propose a horizontal data augmen-tation strategy which is shown to be particularly help-ful against misalignment issues. Combined with theidea of injecting view information through the atten-tion module, our proposed model achieves superiorperformance comparing to current state-of-the-art onthree widely used person re-identification benchmarkdatasets.
REFERENCES
Cai, H., Wang, Z., and Cheng, J. (2019). Multi-scale body-part mask guided attention for person re-identification. In .Chang, X., Hospedales, T. M., and Xiang, T. (2018). Multi-level factorisation net for person re-identification. .Chen, Y., Zhu, X., and Gong, S. (2017). Person re-identification by deep learning multi-scale represen-tations. In , pages 2590–2600.Fu, Y., Wei, Y., Zhou, Y., Shi, H., Huang, G., Wang, X.,Yao, Z., and Huang, T. (2018). Horizontal pyramidmatching for person re-identification. arXiv preprintarXiv:1804.05275 .Gray, D., Brennan, S., and Tao, H. (2007). Evaluating ap-pearance models for recognition, reacquisition, andtracking.He, K., Zhang, X., Ren, S., and Sun, J. (2015). Deep resid-ual learning for image recognition. arXiv preprintarXiv:1512.03385 .Hermans, A., Beyer, L., and Leibe, B. (2017). In defenseof the triplet loss for person re-identification. arXivpreprint arXiv:1703.07737 .Li, W., Zhao, R., Xiao, T., and Wang, X. (2014). Deep-reid: Deep filter pairing neural network for person re-identification. In
The IEEE Conference on ComputerVision and Pattern Recognition (CVPR) . Li, W., Zhu, X., and Gong, S. (2018). Harmonious attentionnetwork for person re-identification. In
Proceedings ofthe IEEE Conference on Computer Vision and PatternRecognition , pages 2285–2294.Ristani, E., Solera, F., Zou, R., Cucchiara, R., and Tomasi,C. (2016). Performance measures and a data setfor multi-target, multi-camera tracking. In
EuropeanConference on Computer Vision workshop on Bench-marking Multi-Target Tracking .Sun, Y., Zheng, L., Deng, W., and Wang, S. (2017). Svd-net for pedestrian retrieval. .Sun, Y., Zheng, L., Yang, Y., Tian, Q., and Wang, S. (2018).Beyond part models: Person retrieval with refined partpooling (and a strong convolutional baseline). In
Pro-ceedings of the European Conference on Computer Vi-sion (ECCV) , pages 480–496.Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna,Z. (2016). Rethinking the inception architecture forcomputer vision. In
The IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR) .Wang, C., Zhang, Q., Huang, C., Liu, W., and Wang, X.(2018a). Mancs: A multi-task attentional networkwith curriculum sampling for person re-identification.In
The European Conference on Computer Vision(ECCV) .Wang, G., Yuan, Y., Chen, X., Li, J., and Zhou, X. (2018b).Learning discriminative features with multiple granu-larities for person re-identification. In
Proceedings ofthe 26th ACM International Conference on Multime-dia , MM ’18, pages 274–282, New York, NY, USA.ACM.Yang, F., Yan, K., Lu, S., Jia, H., Xie, X., and Gao, W.(2019). Attention driven person re-identification.
Pat-tern Recognition , 86:143 – 155.Yu, H., Wu, A., and Zheng, W. (2018). Unsupervised per-son re-identification by deep asymmetric metric em-bedding.
IEEE Transactions on Pattern Analysis andMachine Intelligence , pages 1–1.Zhang, X., Luo, H., Fan, X., Xiang, W., Sun, Y., Xiao,Q., Jiang, W., Zhang, C., and Sun, J. (2017). Aligne-dreid: Surpassing human-level performance in personre-identification. arXiv preprint arXiv:1711.08184 .Zheng, L., Shen, L., Tian, L., Wang, S., Wang, J., and Tian,Q. (2015). Scalable person re-identification: A bench-mark. In
The IEEE International Conference on Com-puter Vision (ICCV) .Zheng, Z., Zheng, L., and Yang, Y. (2018). Pedestrian align-ment network for large-scale person re-identification.
IEEE Transactions on Circuits and Systems for VideoTechnology , page 11.Zhong, Z., Zheng, L., Cao, D., and Li, S. (2017a). Re-ranking person re-identification with k-reciprocal en-coding. .Zhong, Z., Zheng, L., Kang, G., Li, S., and Yang, Y.(2017b). Random erasing data augmentation. arXivpreprint arXiv:1708.04896arXivpreprint arXiv:1708.04896