TransReID: Transformer-based Object Re-Identification
Shuting He, Hao Luo, Pichao Wang, Fan Wang, Hao Li, Wei Jiang
TTransReID: Transformer-based Object Re-Identification
Shuting He , Hao Luo , Pichao Wang , Fan Wang , Hao Li , Wei Jiang Alibaba Group, Zhejiang University { shuting he,jiangwei zju } @zju.edu.cn { michuan.lh,pichao.wang,fan.w,lihao.lh } @alibaba-inc.com Abstract
In this paper, we explore the Vision Transformer(ViT), a pure transformer-based model, for the objectre-identification (ReID) task. With several adaptations,a strong baseline ViT-BoT is constructed with ViTas backbone, which achieves comparable results toconvolution neural networks- (CNN-) based frameworkson several ReID benchmarks. Furthermore, two modulesare designed in consideration of the specialties of ReIDdata: (1) It is super natural and simple for Transformer toencode non-visual information such as camera or viewpointinto vector embedding representations. Plugging into theseembeddings, ViT holds the ability to eliminate the biascaused by diverse cameras or viewpoints.(2) We designa Jigsaw branch, parallel with the Global branch, tofacilitate the training of the model in a two-branch learningframework. In the Jigsaw branch, a jigsaw patch module isdesigned to learn robust feature representation and help thetraining of transformer by shuffling the patches. With thesenovel modules, we propose a pure-transformer frameworkdubbed as TransReID, which is the first work to use a pureTransformer for ReID research to the best of our knowledge.Experimental results of TransReID are superior promising,which achieve state-of-the-art performance on both personand vehicle ReID benchmarks.
1. Introduction
Object re-identification (ReID) is a challenging taskincluding person ReID and vehicle ReID, which aims toidentify all images of a target object in a cross-camerasystem. CNN-based methods [16, 23, 31, 37] have achievedgreat success in both person and vehicle ReID lately. In therecent years, as transformer becomes popular for languagemodeling, it has also shown promising performance innumerous computer-vision applications [9, 15]. On the onehand, ‘CNN + Transformer’ becomes a popular paradigmfor computer vision [1, 10, 2, 30, 18, 17, 41]; on the other * This work was done when Shuting He was intern at Alibabasupervised by Hao Luo and Pichao Wang.
Inference Time P er f o r m a n ce o n M S M T ( m A P ( % )) ResNet50 ResNet101 ResNet152ResNeSt50 ResNeSt101 ResNeSt200ViT-BoTViT-BoT(S=14) ViT-BoT(S=12)TransReID TransReID * Performance vs Computational Cost
Figure 1: The comparision among TransReID, ViT-BoT,ResNet and ResNeSt on MSMT17. The computational costof ResNet50 is taken as the baseline for inference timecomparison.hand, pure-transformer [27, 3, 7, 34] is attracting more andmore attention.The Vision Transformer (ViT) [7] is the first work toshow that a pure transformer architecture can be directlyapplied for image classification, by treating an imageas a sequence of patches. ViT achieves impressiveperformance with large-scale pretraining datasets. Bycutting the image into patches, transformers are capable ofglobally attending all the patches at every layer, makingthe spatial correspondence weaker between the input andintermediate features. However, in ReID, the spatialalignment is critical for feature learning [32, 26]. Thisnaturally raises the question whether ViT can be fine-tuned for tasks that need more spatial alignment thanimage recognition does. We adopt the ViT model asbackbone for feature extraction, and by making severaladaptions, a reasonably strong baseline, named as ViT-BoT, is constructed inspired by BagofTricks (BoT) [23].ViT-BoT achieves comparable performance with CNN-based backbones including ResNet [12], ResNeSt [42]and MGN [37], etc. It demonstrates the potential thattransformer-based model can be a peer backbone for ReID.Different from image classification, ReID data usually a r X i v : . [ c s . C V ] F e b nclude some non-visual clues such as cameras andviewpoints, which have been verified effective in CNN-based works [51, 5, 50]. However, it usually requiresspecial design for CNN-based methods to incorporate theseuseful information. For example, Camera-based BatchNormalization (CBN) [46] modifies the BN layers to learncamera-related knowledge; Viewpoint-Aware Network(VANet) [5] designs a viewpoint-aware metric learningapproach for similar viewpoint and different viewpoints intwo feature spaces to learn viewpoint-invariant features.Instead of treating them with various special designs, wepropose a unified framework to simply incorporate differentkinds of non-visual clues (side information) to learninvariant features in the context of transformer. Comparedwith these CNN-based models, these side informationcan be easily encoded through vector embedding in thetransformer, and the module is introduced as the SideInformation Embedding (SIE).For better training of ViT-BoT, we design a new branchcalled Jigsaw branch. It is in the last layer of ViT-BoT, parallel with the standard Global branch. Eventhough the Global branch encodes global information ofall patches, only a few discriminative patches make themain contribution. Thus, in the Jigsaw branch, a newmodule named as jigsaw patch module (JPM) is designed,inspired by these stripe based methods [32, 37, 24, 28].Different from those CNN-based methods, where the stripesare collected as local continuous patches, the patches inJPM are shuffled to form new larger patches. There aretwo pros for this design. Firstly, the shuffled patches pushthe model to learn a robust representation that is invariantto perturbations; secondly, with shuffled patches, the newlyformed patches contain global information, making it easierto make decision of ID recognition. Combining the noveldesigned SIE and JPM modules, we propose the final modelarchitecture termed as TranReID. As shown in Figure 1,TransReID achieves great Speed-accuracy tradeoff.In summary, the contributions of this paper are asfollows:1) We propose a pure transformer framework to ReIDtasks for the first time, and construct a strong baselineViT-BoT with several adaptations. ViT-BoT achievescomparable performance with state-of-the-art CNN-basedframeworks.2) The Side Information Embedding (SIE) is introducedas a unified framework to encode various kinds of sideinformation for the object ReID. It is demonstrated that SIEcan reduce the feature bias caused by different cameras orobject viewpoints.3) Jigsaw Patches Module (JPM) is proposed to takeadvantage of stripe-based ideas. With a shuffle operation,the JPM facilitates the training for a better and more robustfeature representation in a two-branch learning framework. 4) The TransReID achieves state-of-the-art performanceon both person and vehicle ReID benchmarks includingMSMT17, Market-1501, DukeMTMC-reID, Occluded-Duke, VeRi-776 and VehicleID.
2. Related Work
The studies of object ReID mainly focus on person ReIDand vehicle ReID. The current state-of-the-art methodsare mostly based on the CNN structure. Summarizingthese works, we can find that representation learning, localfeatures and invariant features are critical for successfulReID, and the corresponding related works are presentedas below.
Representation Learning.
A popular pipeline forobject ReID is to design suitable loss functions to traina CNN backbone (e.g. ResNet [12]), which is used toextract features of images. The loss functions can becategorized into classification-based loss and metric loss.For classification-based loss (a.k.a. ID loss), Zheng etal . [47] proposed the ID-discriminative embedding (IDE)to train the ReID model as image classification and itis fine-tuned from the ImageNet [6] pre-trained models.Different from ID loss, metric loss regards the ReID taskas a clustering or ranking problem. The most widely usedmetric loss is the triplet loss [19]. Luo et al . [23] proposedthe BNNeck to better combine ID loss and triplet loss. Sun et al . [31] proposed a unified perspective for ID loss andtriplet loss.
Local Features.
It learns fine-grained features toaggregate different part/region features. The fine-grainedparts are either automatically generated by roughlyhorizontal stripes or by semantic parsing. Methods likePCB [32], MGN [37], AlignedReID++ [24], SAN [28],etc, divide an image into several stripes and extract localfeatures for each stripe. Using parsing or keypointestimation to align different parts or two objects has alsobeen proven effective for both person and vehicle ReID[22, 25, 39, 26].
Invariant Features.
In a cross-camera system, thereexists pose, orientation, illumination, resolution variancescaused by different camera setup and object viewpoints.Some works [51, 5] use side information such as cameraID or viewpoint information to learn invariant features.For example, Camera-based Batch Normalization (CBN)[51] forces the image data from different cameras to beprojected onto the same subspace, so that the distributiongap between any camera pair is largely diminished.Viewpoint/Orientation-invariant feature learning [5, 50] isalso important for both person and vehicle ReID. .2. Transformer in Vision
The Transformer model is proposed in [35] to handlesequential data in the field of natural language processing(NLP). Many studies also show its effectiveness forcomputer-vision tasks. Han et al . [9] and Salman et al . [15]have surveyed the application of the Transformer in the fieldof computer vision.The Transformer model is initially used to handlesequential features extracted by CNN models for thevideos. Girdhar et al . [8] use a variant of transformerarchitecture to aggregate contextual cues in a video relevantto a particular person. The Transformer models are thenextended to some popular computer-vision tasks includingimage processing [2], object detection [1] , semanticsegmentation [46, 41], object tracking [30], etc. Forexample, Image Processing Transformer (IPT) [2] takesadvantage of transformers by using large scale pre-trainingand achieves the state-of-the-art performance on severalimage processing tasks like super-resolution, denoising andde-raining. The detection transformer (DETR) [1] redesignsthe framework of object detection, which is a simple andfully end-to-end object detector.Pure Transformer models are becoming more and morepopular. ViT [7] is proposed recently which applies apure transformer directly to sequences of image patches.However, ViT requires a large-scale dataset to pretrain themodel. To overcome this shortcoming, Touvron et al .[34] propose a framework called DeiT which introduces ateacher-student strategy specific for transformers to speedup ViT training without the requirement of large-scalepretraining data. We extend ViT to object ReID task andshow its effectiveness in this paper.
3. Methodology
In this section, we make some adaptations for ReIDtask in Sec. 3.1 and the newly constructed baseline isnamed as ViT-BoT. Based on ViT-BoT, we propose theTransReID framework in Sec. 3.2 which includes a SideInformation Embedding (SIE) module and a Jigsaw PatchModule (JPM).
The framework of ViT-BoT is shown in Figure 2.Since the original training details of ViT are not directlyapplicable for ReID task, several detailed adaptations aremade to achieve the strong baseline ViT-BoT.
Overlapping Patches.
As a preprocessing step, ViTsplits the images into N non-overlapping patches, leavingthe local neighboring structures around the patches not wellpreserved. Instead, we use a sliding window to generatepatches with overlapping pixels. Assuming that the stepsize of the sliding window is S pixels, size of the patch is Linear Projection of Flattened PatchesTransformer LayerTransformer Layer ⋮
0 * 1 2 7 8
Position Embedding
ID LossTriplet Loss BN ⋯⋯ * * Figure 2: The framework of ViT-BoT. The prepossessingof the input image is same with ViT ( a non-overlappingpartition is shown). The final output token is used as theglobal feature. P = 16 pixels and the shape of the area where two adjacentpatches overlap is ( P − S ) × P . To summarize, given aninput image, after resizing to a fixed resolution H × W , willbe split into N patches. N = N H × N W = (cid:98) H + S − PS (cid:99) × (cid:98) W + S − PS (cid:99) (1)where (cid:98)·(cid:99) is the the floor function. The larger theoverlapping area, the more patches the image will be splitinto. More patches usually can bring better performanceand in the meanwhile cause more computations. A trade-off between performance and computational cost needs tobe made here. For better distinguishment, ViT-BoT s =12 means the image is split with S = 12 , and for case S = P (the non-overlapping version) the subscript is omitted forsimplicity. Position Embedding.
The position embedding ρ i encodes the position information of the i -th patch p i ,which is important for the Transformer Encoder to encodespatial information. The parameters of ViT pre-trained onImageNet are loaded to facilitate training. However, asthe image resolution for ReID task is different from theone in ViT, the position embedding pretrained on ImageNetcannot be directly imported here. Therefore, a bilinearinterpolation is introduced to the position embedding atruntime to help ViT-BoT to handle any given input size andshape. Similar to ViT, the position embedding of ViT-BoTis also learnable. Feature Learning.
Given an image split into a sequenceof patches, a learnable embedding (i.e. class token) isprepended to the embeddings of patches, and the class inear Projection of Flattened Patches
Transformer LayerTransformer Layer ⋮
0 * 1 2 3 4 5 6 7 8
Position Embedding
Side Information Embedding
Transformer Layer
Jigsaw Patch Module
Loss * *** *
Camera-1 C a m er a - C a m er a - C a m er a - C a m er a - ViewPoint
RFL R
LFLR RRRFCamera-4
Global Branch * * * * *
Loss Loss Loss Loss
Jigsaw Branch
Figure 3: The framework of proposed TransReID. Side Information Embedding (light blue) encodes non-visual informationsuch as camera or viewpoint into embedding representations. It is input into the transformer encoder together with patchembedding and position embedding. The last layer includes two independent transformer layers. One is standard to encodethe global feature. The other one contains the Jigsaw Patch Module (JPM) which shuffles all patches and regroups them intoseveral groups. All these groups are input into a shared transformer layer to learn local features. Both the global feature andlocal features contribute to ReID loss.token of the last encoder layer (final class token) serves asthe global feature representation of an image. The finalclass token is denoted as f , and the remaining outputscorresponding to the input patches are denoted as P o = { p o , p o , p o , ..., p o N } , where N is the number of totalpatches. Inspired by [23], we introduce the BNNeck afterthe final class token. The ID loss L ID is the cross-entropyloss without label smoothing. For a triplet set { a, p, n } , thetriplet loss is the soft-margin version as follows: L T = log (cid:104) (cid:16) (cid:107) f a − f p (cid:107) − (cid:107) f a − f n (cid:107) (cid:17)(cid:105) (2) Although ViT-BoT can achieve impressive performancein the object ReID task, it does not take advantage of thespecialties in ReID data. To make better exploration ofside information and fine-grained parts, we propose theSide Information Embedding (SIE) and the Jigsaw PatchModule (JPM). With SIE and JPM, the proposed frameworkTransReID is presented in Figure 3.
In object ReID, a challenging problem is the appearancebias caused by various cameras, viewing angles and otherfactors. To tackle this problem, CNN-based frameworks usually need to modify the network structure or designspecific loss functions to incorporate those non-visual clues(side information) such as camera IDs and viewpointpredictions.Transformer model is perfectly suited here, as it caneasily fuse these side information by encoding theminto embedding representations. Similar to the positionembedding, we can apply learnable layers to encode sideinformation. In specific, if the camera ID of an imageis C , then its camera embedding can be denoted as S ( C ) . Different from the position embedding which variesbetween patches, camera embedding S ( C ) is the same forall patches of an image. In addition, if the viewpoint ofthe object is available, either by a viewpoint estimationalgorithm or human annotations, we can also encode theviewpoint label V as S ( V ) for all patches of an image.Now it comes the problem about how to integrate twodifferent types of information. A trivial solution might bedirectly adding up the two embeddings S ( C ) + S ( V ) , but itmight make the two embeddings canceled out by each other.We propose to encode camera ID C and viewpoint label V jointly as S ( C, V ) . In other words, for C N camera IDsand V N viewpoint labels, S ( C, V ) has a total of C N × V N different values. Finally, the input embedding of i -th patchs as follows: E i = F ( p i ) + ρ i + λ S ( C, V ) (3)where F is the linearly projection to learn featureembedding and λ is a hyperparameter to balance the weightof S ( C, V ) . As the position embedding ρ i is differentfor each patch but the same across different images, and S ( C, V ) is the same for each patch but may have differentvalues for different images, TransReID is able to learn twodifferent embeddings which can then be added directly. Thewhole input embeddings is [ E ; E , E , ..., E N ] , where E is the class token.Here we demonstrate the usage of SIE with camera andviewpoint information which are both categorical variables,however, SIE can be extended to encode more kindsof information, including both categorical and numericalvariables. In our experiments on different benchmarks,camera and/or viewpoint information is included whereveravailable. We change the last layer of ViT-BoT to two parallelbranches which learn global features and local featureswith two independent Transformer layers. Suppose thehidden features input to the last layer are denoted as Z l − = [ z l − ; z l − , z l − , ..., z Nl − ] . The Global branchis a standard transformer which encodes Z l − into Z l =[ f g ; z l , z l , ..., z Nl ] , where f g is viewed as the global featureof CNN-based methods. To learn fine-grained part-levelfeatures, a straightforward solution one would like to try issplitting [ z l − , z l − , ..., z Nl − ] into k groups in order whichconcatenate the shared token z l − and then feed k featuregroups into a transformer layer to learn k local featuresdenoted as { f l , f l , ..., f kl } . f kl is the output token of k -th group. Two recent works [40, 13] show that a tokenembedding is mainly determined by its nearby tokens.Therefore, an group of nearby patches embedding mainlyobserve a limited continuous area.To address this issue, we propose the Jigsaw PatchModule (JPW) to shuffle patches before grouping them.The shuffle operation is implemented by a shift operationand a patch shuffle operation inspired by ShuffleNet [43] asfollows:• Step1: The shift operation.
The first m patches(we recommend m < H N ) are moved to the end. [ z l − , z l − , ..., z Nl − ] is shifted in m steps to become [ z m +1 l − , z m +2 l − , ..., z Nl − , z l − , z l − , ..., z ml − ] .• Step2: The patch shuffle operation.
The shiftedpatches is further shuffled by the patch shuffleoperation with k groups. The hidden features become [ z x l − , z x l − , ..., z x N l − ] , x i ∈ [1 , N ] . We divide the shuffled features into k groups accordingto the previous description. JPM encodes them into k localfeatures { f l , f l , ..., f kl } by a shared transformer. With theshuffle operation, the local feature f kl can cover patchesfrom different body parts or vehicle parts. The globalfeature f g and k local features are trained with L ID and L T . The overall loss is computed as follow: L = L ID ( f g ) + L T ( f g ) + 1 k (cid:88) ( L ID ( f il ) + L T ( f il )) (4)During inference, we concatenate the global featureand local features [ f g , f l , f l , ..., f kl ] as the final featurerepresentation. Using f g only is a variation with lowercomputational cost and slight performance degradation.
4. Experiments
We evaluate our proposed method on four personReID datasets, Market-1501 [45], DukeMTMC-reID[29], MSMT17 [38], Occluded-Duke [26], and twovehicle ReID datasets, VeRi-776 [21] and VehicleID[20]. It is noted that, unlike other datasets. Imagesin Occluded-Duke are selected from DukeMTMC-reIDand the training/query/gallery set contains 9%/ 100%/10% occluded images respectively. All datasets exceptVehicleID provide camera ID for each image, while onlyVeRi-776 dataset provides viewpoint labels for each image.The details of these datasets are summarized in Table 1.
Dataset Object
Table 1: Statistics of datasets used in the paper.
Unless otherwise specified, all person images are resizedto × and all vehicle images are resized to × .The training images are augmented with random horizontalflipping, padding with 10 pixels, random cropping andrandom erasing [48]. The batch size is set to 64 with 4images per ID. SGD optimizer is employed with the weightdecay of 1e-4. The learning rate is initialized as 0.01 withcosine learning rate decay. Unless otherwise specified, weset m = 5 , k = 4 and m = 8 , k = 4 for person and vehicleReID datasets, respectively, in this paper.raining MSMT17 VeRi-776Backbone Time mAP R1 mAP R1ResNet50 1x 51.3 75.3 76.4 95.2ResNet101 1.48x 53.8 77.0 76.9 95.2ResNet152 1.96x 55.6 78.4 77.1 95.9ResNeSt50 1.86x 61.2 82.0 77.6 96.2ResNeSt200 4.72x 63.5 83.5 77.9 96.4ViT-BoT 1.75x 61.0 81.8 78.2 96.5ViT-BoT s =14 s =12 with FP16 training. Evaluation Protocols.
Following conventions in ReIDcommunity, we evaluate all methods with CumulativeMatching Characteristic (CMC) curves and the meanAverage Precision (mAP). All the experimental results areperformed under the setting of single query.
In this section, ViT-BoT is compared with BoT to mainlydemonstrate its effectiveness as a baseline. To show thetrade-off between computation and performance, severaldifferent backbones are chosen for BoT, and differentchoices of step size S to form overlapping patches are alsopresented. For a comprehensive comparison, in additionto the performance on ReID benchmarks, inference timeconsumption of all backbones is included as well.As shown in Table 2, BoT with larger backboneconsistently achieve better performance on both MSMT17and VeRi-776. There exist a large gap in model capacitybetween the ResNet series and ViT. However, ViT-BoT achieves similar performance with ResNeSt50 [42]backbone on MSMT17 and VeRi-776, with less inferencetime than ResNeSt50 (1.75x vs 1.86x). When we reducethe step size of the sliding window s , the performance ofthe ViT-BoT can be improved while the inference time isalso increasing. ViT-BoT s =12 is faster than ResNeSt200(2.64x vs 4.75x), and ViT-BoT s =12 performs slightly betterthan ResNeSt200 on ReID benchmarks. Therefore, ViT-BoT s =12 achieves better speed-accuracy trade-offs thanResNeSt200. In addition, we believe that ViT still haslots of room for improvement in terms of computationalefficiency. http://pytorch.org MSMT17 VeRi-776Model Cam View mAP R1 mAP R1ViT-BoT 61.0 81.8 78.2 96.5 √ √ - - 78.5 96.9 √ √ - - ViT-BoT s =12 √ √ - - 79.3 97.0 √ √ - - Table 3: Ablation Study of SIE. Since the person ReIDdatasets do not provide viewpoint annotations, viewpointinformation can only be encoded in VeRi-776.
Performance Analysis.
In Table 3, we evaluate theeffectiveness of the SIE on MSMT17 and VeRi-776.MSMT17 doesn’t provide viewpoint annotations, so resultsof SIE only encoding camera information are shown forMSMT17. VeRi-776 does not only have camera ID of eachimage, but is also annotated with 8 different viewpointsaccording to vehicle orientation, therefore, the results areshown with SIE encoding various combinations of cameraID and/or viewpoints info.When SIE encodes only the camera IDs of images,ViT-BoT and ViT-BoT s =12 get 1.4% and 1.1% mAPimprovements on MSMT17, respectively. Similarconclusion can be made on VeRi-776. ViT-BoT obtains78.5% mAP when SIE encode viewpoint information. Theaccuracy increase to 79.6% mAP when both camera IDsand viewpoint labels are encoded at the same time. If theencoding is changed to S ( C ) + S ( V ) , which is sub-optimalas discussed in Section 3.2.1, ViT-BoT only can achieve78.3% mAP on VeRi-776. Therefore, the proposed S ( C, V ) is a better encoding manner.As shown in the bottom half of Table 3, SIE is alsoeffective when added to a stronger baseline ViT-BoT s =12 .The observation is similar to the case in Vit-BoT, andthe mAP with all possible information encoded can beimproved to 65.9% and 80.3% on MSMT17 and VeRi-776,respectively. Appearance Feature Bias.
In order to verify thatSIE can reduce the appearance bias, the distributions ofpairwise similarities are visualized for pairs of inter-camera,intra-camera, inter-viewpoint, and intra-viewpoint on VeRi-776 in Figure 4. The distribution gaps between camerasand viewpoints are obvious in Figure 4a and Figure 4b,respectively. When we introduce the SIE module into Vit-BoT, the distribution gaps between inter-camera/viewpointand intra-camera/viewpoint are reduced in Figure 4c andFigure 4d, which shows that the SIE module reduces the .2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Pairwise Similarity P r o b a b ili t y intra_camerainter_camera (a) Camera + w/o SIE Pairwise Similarity P r o b a b ili t y intra_viewpointinter_viewpoint (b) Viewpoint + w/o SIE Pairwise Similarity P r o b a b ili t y intra_camerainter_camera (c) Camera + w/ SIE Pairwise Similarity P r o b a b ili t y intra_viewpointinter_viewpoint (d) Viewpoint + w/ SIE Figure 4: We visualize the distributions of inter-camera,intra-camera, inter-viewpoint and intra-viewpoint distanceon VeRi-776. (a) and (c) show inter-camera and intra-camera similarities. (b) and (d) show inter-viewpoint andintra-viewpoint similarities.
Value of m A P ( % ) Performance on MSMT17 R a n K - ( % ) (a) MSMT17 Value of m A P ( % ) Performance on VeRi-776 R a n K - ( % ) (b) VeRi-776 Figure 5: Impact of the hyper-parameter λ .appearance bias caused by various cameras and viewpoints. Ablation Study of λ . The balancing weight λ is ahyper-parameter to tune in SIE module. We analyze theinfluence of λ on the performance in Figure 5. When λ = 0 , the baseline achieves 61.0% mAP and 78.2%mAP on MSMT17 and VeRi-776, respectively. With λ increasing, the mAP is improved to 63.0% mAP ( λ =2 . for MSMT17) and 79.9% mAP ( λ = 2 . for VeRi-776), which means the SIE module now is beneficial forlearning invariant features. Continuing to increase λ , theperformance is degraded because the weights for featureembedding and the position embedding are weakened. The effectiveness of the proposed JPM module isvalidated in Table 4. With the baseline Vit-BoT, JPMprovides +2.6% mAP and 1.0% mAP improvements onMSMT17 and VeRi-776, respectively. Increasing thenumber of groups k can improve the performance whileslightly increase inference time. In our experience, k = 4 isa choice to trade off speed and performance. Comparing MSMT17 VeRi-776Backbone +JPM w/o shuffle 4 63.1 82.4 79.0 96.7+JPM w/o local 4 63.5 82.5 79.1 96.6ViT-BoT s =12 - 64.4 83.5 79.0 96.5+JPM 4 +JPM w/o shuffle 4 66.1 84.5 79.6 96.7+JPM w/o local 4 66.3 84.5 79.8 96.8 Table 4: The ablation study of jigsaw patch module. ‘w/oshuffle’ means the patch features are split into parts withoutshuffle. ‘w/o local’ means we evaluate the global featurewithout concatenating local features.MSMT17 VeRi-776Backbone mAP R1 mAP R1ViT-BoT 61.0 81.8 78.2 96.5+SIE 62.4 81.9 79.6 96.9+JPM 63.6 82.5 79.2 96.8TransReID 64.9 83.3 80.6 96.9ViT-BoT s =12 ∗ Table 5: The ablation study of TransReID.JPM and JPM w/o shuffle, we can observe the shuffleoperation help the model learn more discriminative featureswith +0.5% mAP and +0.2% mAP improvements onMSMT17 and VeRi-776, respectively. It is also observedthat, if only the global feature f g is used in inference stage(still trained with full JPM), the performance (denoted as“w/o local”) is nearly comparable with the version of fullset of features, which suggests us to use only the globalfeature as an efficient variation with lower storage costand computational cost in the inference stage. For thestronger baseline ViT-BoT s = 12 , we can observe similarconclusions, and the JPM module improve the performanceby +2.1% mAP and +1.0% mAP on MSMT17 and VeRi-776, respectively. Finally, we evaluate the benefits of introducing SIE andJPM in Table 5. For the ViT-BoT, SIE and JPM improvethe performance by +1.4%/+2.6% mAP and +1.4%/+1.0%mAP on MSMT17/VeRi-776, respectively. With these twomodules used together, TransReID achieves 64.9% (+3.9%)mAP and 80.6% (+2.4%) mAP on MSMT17 and VeRi-
SMT17 Market-1501 DukeMTMC-reID Occluded-Duke VeRi-776 VehicleIDMethod Size mAP R1 mAP R1 mAP R1 mAP R1 Method Size mAP R1 R1 R5CBN c (cid:13) [51] 256 ×
128 42.9 72.8 77.3 91.3 67.3 82.5 - - PRReID[11] 256 ×
256 72.5 93.3 72.6 88.6OSNet [49] 256 ×
128 52.9 78.7 84.9 94.8 73.5 88.6 - - SAN[28] 256 ×
256 72.5 93.3 79.7 94.3MGN [37] 384 ×
128 52.1 76.9 86.9 95.7 78.4 88.7 - - UMTS [14] 256 ×
256 75.9 95.8 80.9 87.0RGA-SC [44] 256 ×
128 57.5 80.3 - - - - VANet v (cid:13) [5] 224 ×
224 66.3 89.8 83.3 96.0ABDNet [4] 384 × - - PVEN v (cid:13) [25] 256 ×
256 79.5 95.6
PGFA [26] 256 ×
128 - - 76.8 91.2 65.5 82.6 37.3 51.4 SAVER [16] 256 × ×
128 - - 84.9 94.2 75.6 86.9
CFVMNet [33] 256 ×
256 77.1 95.3 81.4 94.1TransReID c (cid:13) ×
128 64.9 83.3 88.2 95.0 80.6 89.6 55.7 64.2 TransReID v (cid:13) ×
256 79.6 97.0 83.6 97.1TransReID ∗ c (cid:13) × TransReID b (cid:13) ×
256 80.6 96.9 - -TransReID c (cid:13) ×
128 66.6 84.6 88.8 95.0 81.8 90.4 57.2 64.0 TransReID ∗ v (cid:13) ×
256 80.5 96.8
TransReID ∗ c (cid:13) × TransReID ∗ b (cid:13) × - - Table 6: Comparison with state-of-the-art methods. The star * in the superscript means the backbone is ViT-BoT s =12 .Results are shown for person ReID datasets (left) and vehicle ReID datasets (right). Only the small subset of VehicleID isused in this paper. c (cid:13) and v (cid:13) indicate the methods are using camera IDs and viewpoint labels, respectively. b (cid:13) means both areused. Viewpoint and camera information are only used wherever available. Best results for previous methods and best of ourmethods are labeled in bold.776, respectively. For the ViT-BoT s =12 , SIE and JPMcan also provide similar performance boost, i.e. 67.4%(+3.0%) mAP and 81.7% (+2.6%) mAP on MSMT17 andVeRi-776, respectively. The experimental results shows theeffectiveness of our proposed module SIE and JPM, and theoverall framework. In Table 6, our TransReID is compared with state-of-the-art methods on six benchmarks including person ReID,occluded ReID and vehicle ReID.
Person ReID.
We compare TransReID with othermethods on MSMT17, Market-1501 and DukeMTMC-reID. Since the image resolution is a critical factor formodel performance, we evaluate our method with twodifferent resolutions. On MSMT17 and DukeMTMC-reID, TransReID ∗ outperforms the previous state-of-the-art ABDNet by a large margin (+8.6%/+4.0% mAP). OnMarket-1501, TransReID ∗ (256 × Occluded ReID.
Compared to PGFA and HOReID,TransReID achieves 55.7% mAP with a large marginimprovement (+11.9% mAP) on Occluded-Duke, withoutrequiring any semantic information to align body parts,which shows the ability of TransReID to generaterobust feature representations. Furthermore, TransReID ∗ improves the performance to 59.2% mAP with the help ofoverlapping patches. Vehicle ReID.
On VeRi-776, TransReID ∗ reaches81.7% mAP surpassing SAVER by 2.1% mAP. Whenonly using viewpoint annotations, TransReID ∗ stilloutperforms VANet and SAVER on both VeRi-776 and VehicleID. Additionally, our method achieves state-of-the-art performance to 85.2% mAP on a larger datasetVehicleID.
5. Conclusion
In this paper, we investigate a pure transformerframework for the object ReID task. A CNN-based baselineBoT is extended to ViT-BoT with several adaptions. ViT-BoT achieves comparable performance on both person andvehicle ReID benchmarks. Based on ViT-BoT, we proposedtwo novel modules to the Transformer framework, i.e., side information embedding (SIE) and jigsaw patch module(JPM). Experiments conducted on MSMT17, Market-1501, DukeMTMC-reID, Occluded-Duke, VeRi-776 andVehicleID with various settings validate the effectivenessof our framework TransReID. The proposed TransReIDachieves state-of-the-arts on all above six benchmarks.Even though ViT just opens the door for puretransformer based model on image classification, thepromising results achieved by TransReID make us believethat the transformer has great potential for ReID. It isexpected that the ViT-BoT or TransReID could be as astarting point for more research works dedicated to thetransformer-based ReID framework. In the future, we planto explore a more efficient transformer-based framework forvision tasks, especially on the trade-off of representationcapability and computational cost.
References [1] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, NicolasUsunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In
EuropeanConference on Computer Vision , pages 213–229. Springer,2020. 1, 32] Hanting Chen, Yunhe Wang, Tianyu Guo, Chang Xu, YipingDeng, Zhenhua Liu, Siwei Ma, Chunjing Xu, Chao Xu, andWen Gao. Pre-trained image processing transformer. arXivpreprint arXiv:2012.00364 , 2020. 1, 3[3] Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu,Heewoo Jun, David Luan, and Ilya Sutskever. Generativepretraining from pixels. In
International Conference onMachine Learning , pages 1691–1703. PMLR, 2020. 1[4] Tianlong Chen, Shaojin Ding, Jingyi Xie, Ye Yuan, WuyangChen, Yang Yang, Zhou Ren, and Zhangyang Wang. Abd-net: Attentive but diverse person re-identification. In
Proceedings of the IEEE/CVF International Conference onComputer Vision , pages 8351–8361, 2019. 8[5] Ruihang Chu, Yifan Sun, Yadong Li, Zheng Liu, Chi Zhang,and Yichen Wei. Vehicle re-identification with viewpoint-aware metric learning. In
Proceedings of the IEEE/CVFInternational Conference on Computer Vision , pages 8282–8291, 2019. 2, 8[6] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li,and Li Fei-Fei. Imagenet: A large-scale hierarchical imagedatabase. In , pages 248–255. Ieee, 2009. 2[7] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov,Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner,Mostafa Dehghani, Matthias Minderer, Georg Heigold,Sylvain Gelly, et al. An image is worth 16x16 words:Transformers for image recognition at scale. arXiv preprintarXiv:2010.11929 , 2020. 1, 3[8] Rohit Girdhar, Joao Carreira, Carl Doersch, and AndrewZisserman. Video action transformer network. In
Proceedings of the IEEE/CVF Conference on ComputerVision and Pattern Recognition , pages 244–253, 2019. 3[9] Kai Han, Yunhe Wang, Hanting Chen, Xinghao Chen,Jianyuan Guo, Zhenhua Liu, Yehui Tang, An Xiao, ChunjingXu, Yixing Xu, et al. A survey on visual transformer. arXivpreprint arXiv:2012.12556 , 2020. 1, 3[10] Liang Han, Pichao Wang, Zhaozheng Yin, Fan Wang,and Hao Li. Exploiting better feature aggregation forvideo object detection. In
Proceedings of the 28th ACMInternational Conference on Multimedia , pages 1469–1477,2020. 1[11] Bing He, Jia Li, Yifan Zhao, and Yonghong Tian. Part-regularized near-duplicate vehicle re-identification. In
Proceedings of the IEEE/CVF Conference on ComputerVision and Pattern Recognition , pages 3997–4005, 2019. 8[12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Deep residual learning for image recognition. In
Proceedingsof the IEEE conference on computer vision and patternrecognition , pages 770–778, 2016. 1, 2[13] Zi-Hang Jiang, Weihao Yu, Daquan Zhou, Yunpeng Chen,Jiashi Feng, and Shuicheng Yan. Convbert: Improving bertwith span-based dynamic convolution.
Advances in NeuralInformation Processing Systems , 33, 2020. 5[14] Xin Jin, Cuiling Lan, Wenjun Zeng, and Zhibo Chen.Uncertainty-aware multi-shot knowledge distillation forimage-based object re-identification. In
Proceedings of theAAAI Conference on Artificial Intelligence , volume 34, pages11165–11172, 2020. 8 [15] Salman Khan, Muzammal Naseer, Munawar Hayat,Syed Waqas Zamir, Fahad Shahbaz Khan, and MubarakShah. Transformers in vision: A survey. arXiv preprintarXiv:2101.01169 , 2021. 1, 3[16] Pirazh Khorramshahi, Neehar Peri, Jun-cheng Chen, andRama Chellappa. The devil is in the details: Self-supervised attention for vehicle re-identification. In
European Conference on Computer Vision , pages 369–386.Springer, 2020. 1, 8[17] Xiangyu Li, Yonghong Hou, Pichao Wang, ZhiminGao, Mingliang Xu, and Wanqing Li. Transformerguided geometry model for flow-based unsupervised visualodometry.
Neural Computing and Applications , pages 1–12,2021. 1[18] Xiangyu Li, Yonghong Hou, Pichao Wang, Zhimin Gao,Mingliang Xu, and Wanqing Li. Trear: Transformer-basedrgb-d egocentric action recognition.
IEEE Transactions onCognitive and Developmental Systems , pages 1–7, 2021. 1[19] Hao Liu, Jiashi Feng, Meibin Qi, Jianguo Jiang, andShuicheng Yan. End-to-end comparative attention networksfor person re-identification.
IEEE Transactions on ImageProcessing , 26(7):3492–3506, 2017. 2[20] Hongye Liu, Yonghong Tian, Yaowei Yang, Lu Pang,and Tiejun Huang. Deep relative distance learning: Tellthe difference between similar vehicles. In
Proceedingsof the IEEE conference on computer vision and patternrecognition , pages 2167–2175, 2016. 5[21] Xinchen Liu, Wu Liu, Huadong Ma, and Huiyuan Fu. Large-scale vehicle re-identification in urban surveillance videos.In , pages 1–6. IEEE, 2016. 5[22] Xinchen Liu, Wu Liu, Jinkai Zheng, Chenggang Yan, andTao Mei. Beyond the parts: Learning multi-view cross-part correlation for vehicle re-identification. In
Proceedingsof the 28th ACM International Conference on Multimedia ,pages 907–915, 2020. 2[23] Hao Luo, Youzhi Gu, Xingyu Liao, Shenqi Lai, and WeiJiang. Bag of tricks and a strong baseline for deep person re-identification. In
Proceedings of the IEEE/CVF Conferenceon Computer Vision and Pattern Recognition Workshops ,pages 0–0, 2019. 1, 2, 4[24] Hao Luo, Wei Jiang, Xuan Zhang, Xing Fan, Jingjing Qian,and Chi Zhang. Alignedreid++: Dynamically matchinglocal information for person re-identification.
PatternRecognition , 94:53–61, 2019. 2[25] Dechao Meng, Liang Li, Xuejing Liu, Yadong Li, ShijieYang, Zheng-Jun Zha, Xingyu Gao, Shuhui Wang, andQingming Huang. Parsing-based view-aware embeddingnetwork for vehicle re-identification. In
Proceedings ofthe IEEE/CVF Conference on Computer Vision and PatternRecognition , pages 7103–7112, 2020. 2, 8[26] Jiaxu Miao, Yu Wu, Ping Liu, Yuhang Ding, and YiYang. Pose-guided feature alignment for occluded person re-identification. In
Proceedings of the IEEE/CVF InternationalConference on Computer Vision , pages 542–551, 2019. 1, 2,5, 8[27] Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, LukaszKaiser, Noam Shazeer, Alexander Ku, and Dustin Tran.mage transformer. In
International Conference on MachineLearning , pages 4055–4064. PMLR, 2018. 1[28] Jingjing Qian, Wei Jiang, Hao Luo, and Hongyan Yu. Stripe-based and attribute-aware network: A two-branch deepmodel for vehicle re-identification.
Measurement Scienceand Technology , 31(9):095401, 2020. 2, 8[29] Ergys Ristani, Francesco Solera, Roger Zou, Rita Cucchiara,and Carlo Tomasi. Performance measures and a data set formulti-target, multi-camera tracking. In
European conferenceon computer vision , pages 17–35. Springer, 2016. 5[30] Peize Sun, Yi Jiang, Rufeng Zhang, Enze Xie, JinkunCao, Xinting Hu, Tao Kong, Zehuan Yuan, Changhu Wang,and Ping Luo. Transtrack: Multiple-object tracking withtransformer. arXiv preprint arXiv:2012.15460 , 2020. 1, 3[31] Yifan Sun, Changmao Cheng, Yuhan Zhang, Chi Zhang,Liang Zheng, Zhongdao Wang, and Yichen Wei. Circleloss: A unified perspective of pair similarity optimization.In
Proceedings of the IEEE/CVF Conference on ComputerVision and Pattern Recognition , pages 6398–6407, 2020. 1,2[32] Yifan Sun, Liang Zheng, Yi Yang, Qi Tian, and ShengjinWang. Beyond part models: Person retrieval with refinedpart pooling (and a strong convolutional baseline). In
Proceedings of the European Conference on ComputerVision (ECCV) , pages 480–496, 2018. 1, 2[33] Ziruo Sun, Xiushan Nie, Xiaoming Xi, and YilongYin. Cfvmnet: A multi-branch network for vehicle re-identification based on common field of view. In
Proceedingsof the 28th ACM International Conference on Multimedia ,pages 3523–3531, 2020. 8[34] Hugo Touvron, Matthieu Cord, Matthijs Douze, FranciscoMassa, Alexandre Sablayrolles, and Herv´e J´egou. Trainingdata-efficient image transformers & distillation throughattention. arXiv preprint arXiv:2012.12877 , 2020. 1, 3[35] Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, andIllia Polosukhin. Attention is all you need. In
Proceedingsof the 31st International Conference on Neural InformationProcessing Systems , pages 6000–6010, 2017. 3[36] Guan’an Wang, Shuo Yang, Huanyu Liu, Zhicheng Wang,Yang Yang, Shuliang Wang, Gang Yu, Erjin Zhou, and JianSun. High-order information matters: Learning relationand topology for occluded person re-identification. In
Proceedings of the IEEE/CVF Conference on ComputerVision and Pattern Recognition , pages 6449–6458, 2020. 8[37] Guanshuo Wang, Yufeng Yuan, Xiong Chen, Jiwei Li, andXi Zhou. Learning discriminative features with multiplegranularities for person re-identification. In
Proceedingsof the 26th ACM international conference on Multimedia ,pages 274–282, 2018. 1, 2, 8[38] Longhui Wei, Shiliang Zhang, Wen Gao, and Qi Tian.Person transfer gan to bridge domain gap for person re-identification. In
Proceedings of the IEEE conference oncomputer vision and pattern recognition , pages 79–88, 2018.5[39] Longhui Wei, Shiliang Zhang, Hantao Yao, Wen Gao, and QiTian. Glad: Global-local-alignment descriptor for pedestrian retrieval. In
Proceedings of the 25th ACM internationalconference on Multimedia , pages 420–428. ACM, 2017. 2[40] Zhanghao Wu, Zhijian Liu, Ji Lin, Yujun Lin, and SongHan. Lite transformer with long-short range attention.In
International Conference on Learning Representations ,2019. 5[41] Enze Xie, Wenjia Wang, Wenhai Wang, Peize Sun, Hang Xu,Ding Liang, and Ping Luo. Segmenting transparent object inthe wild with transformer. arXiv preprint arXiv:2101.08461 ,2021. 1, 3[42] Hang Zhang, Chongruo Wu, Zhongyue Zhang, Yi Zhu, ZhiZhang, Haibin Lin, Yue Sun, Tong He, Jonas Mueller, RManmatha, et al. Resnest: Split-attention networks. arXivpreprint arXiv:2004.08955 , 2020. 1, 6[43] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and JianSun. Shufflenet: An extremely efficient convolutional neuralnetwork for mobile devices. In
Proceedings of the IEEEconference on computer vision and pattern recognition ,pages 6848–6856, 2018. 5[44] Zhizheng Zhang, Cuiling Lan, Wenjun Zeng, Xin Jin, andZhibo Chen. Relation-aware global attention for person re-identification. In
Proceedings of the IEEE/CVF Conferenceon Computer Vision and Pattern Recognition , pages 3186–3195, 2020. 8[45] Liang Zheng, Liyue Shen, Lu Tian, Shengjin Wang,Jingdong Wang, and Qi Tian. Scalable person re-identification: A benchmark. In
Computer Vision, IEEEInternational Conference on Computer Vision , pages 1116–1124, 2015. 5[46] Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu,Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng,Tao Xiang, Philip HS Torr, et al. Rethinking semanticsegmentation from a sequence-to-sequence perspective withtransformers. arXiv preprint arXiv:2012.15840 , 2020. 2, 3[47] Zhedong Zheng, Liang Zheng, and Yi Yang. Adiscriminatively learned cnn embedding for personreidentification.
ACM Transactions on MultimediaComputing, Communications, and Applications (TOMM) ,14(1):13, 2018. 2[48] Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, andYi Yang. Random erasing data augmentation. In
Proceedingsof the AAAI Conference on Artificial Intelligence , volume 34,pages 13001–13008, 2020. 5[49] Kaiyang Zhou, Yongxin Yang, Andrea Cavallaro, andTao Xiang. Omni-scale feature learning for person re-identification. In
Proceedings of the IEEE/CVF InternationalConference on Computer Vision , pages 3702–3712, 2019. 8[50] Zhihui Zhu, Xinyang Jiang, Feng Zheng, Xiaowei Guo,Feiyue Huang, Xing Sun, and Weishi Zheng. Awareloss with angular regularization for person re-identification.In
Proceedings of the AAAI Conference on ArtificialIntelligence , volume 34, pages 13114–13121, 2020. 2[51] Zijie Zhuang, Longhui Wei, Lingxi Xie, Tianyu Zhang,Hengheng Zhang, Haozhe Wu, Haizhou Ai, and Qi Tian.Rethinking the distribution gap of person re-identificationwith camera-based batch normalization. In