[PDF] Lightweight Multi-Branch Network for Person Re-Identification

Abstract

Person Re-Identification aims to retrieve person identities from images captured by multiple cameras or the same cameras in different time instances and locations. Because of its importance in many vision applications from surveillance to human-machine interaction, person re-identification methods need to be reliable and fast. While more and more deep architectures are proposed for increasing performance, those methods also increase overall model complexity. This paper proposes a lightweight network that combines global, part-based, and channel features in a unified multi-branch architecture that builds on the resource-efficient OSNet backbone. Using a well-founded combination of training techniques and design choices, our final model achieves state-of-the-art results on CUHK03 labeled, CUHK03 detected, and Market-1501 with 85.1% mAP / 87.2% rank1, 82.4% mAP / 84.9% rank1, and 91.5% mAP / 96.3% rank1, respectively.

Full PDF

aa r X i v : . [ c s . C V ] J a n LIGHTWEIGHT MULTI-BRANCH NETWORK FOR PERSON RE-IDENTIFICATION

Fabian Herzog ∗ Xunbo Ji ∗ Torben Teepe Stefan H¨ormann Johannes Gilg Gerhard Rigoll

Technical University of Munich

ABSTRACT

Person Re-Identiﬁcation aims to retrieve person identities from im-ages captured by multiple cameras or the same cameras in differenttime instances and locations. Because of its importance in many vi-sion applications from surveillance to human-machine interaction,person re-identiﬁcation methods need to be reliable and fast. Whilemore and more deep architectures are proposed for increasing per-formance, those methods also increase overall model complexity.This paper proposes a lightweight network that combines global,part-based, and channel features in a uniﬁed multi-branch archi-tecture that builds on the resource-efﬁcient OSNet backbone. Us-ing a well-founded combination of training techniques and designchoices, our ﬁnal model achieves state-of-the-art results on CUHK03labeled, CUHK03 detected, and Market-1501 with 85.1% mAP /87.2% rank1, 82.4% mAP / 84.9% rank1, and 91.5% mAP / 96.3%rank1, respectively.

Index Terms — Person Re-Identiﬁcation, Deep Learning, ImageProcessing

1. INTRODUCTION

Person Re-Identiﬁcation (PREID) is an important computer visiontask for video surveillance applications. Formally, the problem canbe stated as follows [1]. Given a probe image P , and a gallery of M images G = { G i } Mi =1 , all of which annotated with an associatedidentity id( G i ) ∈ N , the goal is to ﬁnd a similarity measure sim ( · ) such that i ∗ = arg max i =1 ,...,M sim( P , G i ) ⇒ id( P ) = id( G i ∗ ) . (1)While it is no surprise that the success of deep learning and the needfor PREID as a processing step for person tracking has resulted innumerous approaches, the problem remains challenging, especiallywhen it comes to balancing performance and low complexity of themodels.Recently, multiple-branch architectures have been proposed inparticular [2–6]. These methods allow the network to focus on dif-ferent person features in individual branches, e.g., on distinct spatialparts or channels. Although branching generally increases modelperformance, it comes with higher computational costs, especiallyif the number of branches or the total number of operations in themis increased. We claim that additional model complexity is not nec-essary and propose a network that outperforms other multi-branch ∗ Equal contribution. Correspondence to: [email protected] gratefully acknowledge the ﬁnancial support from DeutscheForschungsgemeinschaft (DFG) under grant number RI 658/25-2.© 2021 IEEE. Personal use of this material is permitted. Permissionfrom IEEE must be obtained for all other uses, in any current or future media,including reprinting/republishing this material for advertising or promotionalpurposes, creating new collective works, for resale or redistribution to serversor lists, or reuse of any copyrighted component of this work in other works. approaches by using a suitable feature extractor and the right com-bination of training techniques.The resulting network consists of three branches that opti-mize the global, partial, and channel-wise representations usingsimple computations, respectively. Despite this branching, we suc-ceed in keeping the number of parameters low using OSNet [7], alightweight feature extractor that has recently proven to be more efﬁ-cient and accurate than other backbones for PREID tasks. Our deepneural network achieves state-of-the-art results on two importantbenchmark datasets, Market-1501 [8] and CUHK03 [9]. In detailedablation studies, we demonstrate how the respective branches in-crease model performance, why our network performs better thanother multi-branch approaches, and what training techniques arenecessary to train a multi-branch architecture with OSNet backbone.Code and pretrained models of our research are publicly available .

2. RELATED WORK

While PREID has been studied as a computer vision task for a longtime [10], deep learning accelerated the research progress and modelperformance signiﬁcantly, dominating the scene ever since [1, 2, 7,11–13]. PREID approaches can be categorized as follows. First,several methods focus on improving feature extraction for the globalinput images [7, 12, 14]. Luo et al . [12] contributed with compre-hensive research of many training techniques and were able to ﬁndcombinations that boost the overall performance. Zhou et al . [7], onthe other hand, concentrated on the feature extraction itself, propos-ing OSNet, a multi-scale network designed explicitly for the PREIDtask that outperforms standard ResNet50 [15] backbones despite amuch lower number of parameters.Another important research direction is ﬁnding spatial partitionsof the persons’ images [3,13,16]. Usually, the input image is dividedinto disjoint parts, often horizontal stripes, to obtain partitioned fea-tures that are discriminative for person matching. Sun et al . [13]utilized the idea of part pooling, where the partitioning is done viaspatial pooling after the convolutional layers of the backbone. Thisidea has since been used in other architectures [2, 3, 16]. In thiscontext, many multi-branch or multi-stage approaches have been de-veloped [2, 3, 6]. They mostly try to learn global and spatial partfeatures in individual branches or combine part, channel, and globalfeatures, either through pooling [2, 3, 16] or attention [4, 5, 17, 18].

3. METHODOLOGY3.1. Network Architecture

Like all recent works on the problem, we design an end-to-end neu-ral network architecture based on strong image feature extractionbackbones pretrained on ImageNet [19]. In this subsection, we de-scribe the architecture and training of the proposed network to solve https://github.com/jixunbo/LightMBN ig. 1 : Structure of our network.

After forwarding images through the ﬁrst three blocks of an OSNet backbone, our network continues inthree distinct branches to learn global, channel-based and part-based features. All volumes are forwarded to BNNeck layers to produce ﬁnalembeddings suited for different loss functions.Eq. (1). Our goal is to utilize a multi-branch architecture similar toMGN [2] and SCR [3] that leverages global, part-based and channel-based features, while keeping the overall number of parameters andembeddings low. Consequently and as illustrated in Fig. 1, our net-work consists of three branches: The global branch, the part branch,and the channel branch.Let X ∈ R × × be an input image. Before separatinginto distinct branches, the image X is passed through a truncatedOSNet [7] backbone, up until the ﬁrst layer of the third block, i.e., conv3 0 , as in [6]. This concept has been employed before withResNet50 [2, 3], using the ﬁrst blocks up to conv4 0 . We chose OS-Net over ResNet due to its superior performance and lower complex-ity for PREID tasks [6, 7]. After forwarding X through the initiallayers, the network forms the three branches, which comprise the re-maining layers of OSNet up to the ﬁfth block. By this design, onlythe layers up to conv3 0 are shared by all the branches, and for eachindividual branch, we obtain a tensor of dimension × × .In the global branch , we obtain two global representations asfollows: First, we aggregate the information by applying 2D averagepooling on the tensor, obtaining the -dimensional vector g . Forthe second global representation, the initial × × -tensor isused as an input for a drop block , inspired by [20]. The drop block re-moves the highest activated horizontal regions from the tensor, forc-ing the network to emphasize on less discriminative regions, whichincreases the robustness of the resulting representation. Having re-moved the regions of highest activity, we apply 2D max pooling onthe resulting tensor, obtaining another -dimensional vector g drop .In the channel branch , the initial × × -tensor is reducedto a -dimensional vector and then partitioned into two vectors oflength each. We use × convolutions to scale the represen-tations back up, obtaining two -dimensional vectors c and c .Here, the parameters of the × convolutions are shared amongboth channel parts.Finally, in the part branch , we transform the initial × × -tensor into three representations. We use average pooling to obtain a volume of size × × that we split into two -dimensionalpart-based representations p and p , representing the upper andlower body, respectively. Additionally, we use max pooling on theinitial volume, obtaining another -dimensional global represen-tation p g within the part branch.We use a BNNeck [12] for all branch vector representations cal-culated in this way. Each BNNeck block consists of batch normal-ization and a fully connected to number-of-classes layer. The aim ofthis block is to optimize embeddings for two different metric spacesat the same time. Embeddings obtained before the batch normaliza-tion layer are used for optimization with respect to a ranking loss(e.g., triplet loss [21]), while embeddings obtained after the fullyconnected layer are used for optimization with respect to an identityloss (e.g., Cross-Entropy (CE) loss). Embeddings obtained after thebatch normalization but before the fully connected layer ﬁnd a bal-ance between the representations of the two different metric spaces(i.e., ranking space and identitiy space) and are therefore used forinference. From the resulting embeddings we form two sets, givenby I := { ˆ g , ˆ g drop , ˆ p , ˆ p , ˆ p g , ˆ c , ˆ c } , (2) R := { g , g drop , p g } , (3)for training in identity and rank spaces, respectively, where ˆ · denotesthe tensors of the BNNeck representations after the fully connectedlayer. For training, we use a combination of CE loss and Multi-Similarity(MS) loss [22]. The latter was designed to take advantage of exist-ing pair-wise methods and sampling strategies by exploiting a softweighting scheme that considers both self-similarity and relativesimilarity. We compute MS loss L MS for global embeddings R obtained before batch normalization, and CE loss L CE on all em-beddings I obtained after applying softmax activation to the fully able 1 : Comparison of our method with state-of-the-art.

The table lists our results on the two most used benchmarks, Market-1501 andCHUK03. The latter was evaluated on the labeled set (CHUK03-L) and the detection set (CHUK03-D) in multi-gallery-shot setting (cf. [28]).Note that all results are reported without re-ranking (cf. [28]).

Market-1501 CHUK03-L CHUK03-DType Method Publication Backbone r1 mAP r1 mAP r1 mAPGlobal feature BagOfTricks [12] CVPRW’19 ResNet50 94.5 85.9 – – – –OSNet [7] ICCV’19 OSNet 94.8 84.9 – – 72.3 67.8BDB [14] ICCV’19 ResNet50 95.3 86.7 79.4 76.7 76.4 73.5Part-based PCB+RPP [13] ECCV’18 ResNet50 93.8 81.6 – – – –MGN [2] ACM MM 18 ResNet50 95.7 86.9 68.0 67.4 66.8 66.0Pyramid [16] CVPR’19 ResNet101 95.7 88.2 78.9 76.9 79.9 74.8SCR [3] WACV’20 ResNet50 95.7 89.0 83.8 80.4 82.2 77.6Attention-based MHN [4] ICCV’19 ResNet50 95.1 85.0 77.2 72.4 71.7 65.4ABD [5] ICCV’19 ResNet50 95.6 88.3 – – – –PLR-OSNet [6] PRCV ’20 OSNet 95.6 88.9 84.6 80.5 80.4 77.2SCSN [17] CVPR’20 ResNet50 95.7 88.5 86.8 84.0 84.7 81.0Compact Re-ID [18] ACM ICMR ’20 other 96.2 89.7 – – – –

Ours LightMBN

OSNet

LightMBN (computed via [8]) OSNet connected layer, i.e., L MS ( f ( X ) , y )) := X r ∈R L MS ( r , y ) , (4) L CE ( f ( X ) , y )) := X i ∈I L CE ( i , y ) , (5)where f ( X ) is our networks output when forwarding X . For CEloss L CE , we further use label smoothing [12,23], which is a regular-ization technique that encourages the model not to be too conﬁdenton the training data. It adds a uniform noise distribution in CE cal-culation to soften the ground truth labels, which helps to improvemodel generalization. Thus, the overall objective loss function is L = λ CE L CE + λ MS L MS , (6)where λ CE and λ MS are suitable weights. Additionally, we use ran-dom erasing augmentation (REA) [24], which randomly substitutesa rectangle with the image’s mean value. It has demonstrated toimprove model generalization and to produce higher variance train-ing data. Cosine annealing strategies are common in PREID net-works [7,25]. To further boost performance, we use warm-up cosineannealing [26,27] as our learning rate strategy rather than traditionalstep learning rate schedules. The learning rate ﬁrst grows linearlyfrom · − to · − in 10 epochs, then cosine decay to · − is applied in the remaining epochs. The learning rate lr( t ) at epoch t with T total epochs is given by lr( t ) = ( · − · t , if t ≤ · − · (cid:16) (cid:16) π t − T − (cid:17)(cid:17) , if < t ≤ T.

4. EXPERIMENTAL RESULTSDatasets.

We evaluated the model on two of the most widelyused large-scale datasets, Market-1501 [8] and CUHK03 [9]. TheMarket-1501 dataset contains 32,668 images of 1,501 persons across6 cameras, whereas the CUHK03 dataset comprises 13,164 imagesof 1,360 person across 6 cameras. For CUHK03, we use the new767-split protocol [28], obtaining results for the labeled (CUHK03-L) and detected (CUHK03-D) conﬁgurations separately. We did not evaluate on DukeMTMC-ReID since use of this dataset has beenprohibited by the authors.

Training Details.

For training, input images are normalized tochannel-wise zero-mean and a standard variation of 1 and spatialresolution of × . Data augmentation is performed by resizingimages to 105% width and height and random cropping, as well asrandom horizontal ﬂip with a probability of 0.5. Models are trainedfor 140 epochs for Market-1501 and 180 epochs for CUHK03 with abatchsize of 48. A batch consists of 8 samples for 6 identities each.The parameters are optimized by using using the Adam optimizer[29] with ǫ = 1 e − , β = 0 . and β = 0 . . The backbones arepre-trained on ImageNet [19] and all experiments are implementedwith PyTorch [30]. To balance the losses we chose λ CE = λ MS =0 . . Evaluation Details.

Cosine distance is utilized to compute cumula-tive matching characteristics (CMC) [31]. Query and gallery imagesare re-sized to × pixels and normalized. For a fair com-parison with other existing methods, the CMC rank-1 accuracy (r1)and mean Average Precision (mAP) are reported as evaluation met-rics. Results with the same identity and the same camera ID as thequery image are not counted. The authors of [32] state in their ofﬁ-cial code repository that mAP values computed with recent PREIDframeworks are about 1%-point higher than those computed by theoriginal Matlab evaluation code of Market-1501 [8]. We were ableto reproduce this. For completeness and fair comparison, we alsostate the mAP values for our ﬁnal models as computed by the origi-nal evaluation script. We hope to raise more awareness to this issueby providing both results. Table 1 compares the performance of our model with that of other re-cent methods. Our model achieves state-of-the-art results on Market-1501, CUHK-L and CUHK-D, both in terms of rank-1 accuracy andmAP. The large difference in performance with regard to the mAPon all datasets is particularly noticeable. Interestingly, despite itssimplicity, our architecture achieves better performance than othermulti-branch approaches. Architecturally, our model is closely re- https://github.com/VisualComputingInstitute/triplet-reid able 2 : Ablation study of branch inﬂuences.

We investigateour models performance under the speciﬁed branch conﬁgurations,where G+C+P refers to our original model.Market-1501 CUHK03-DBranch rank1 mAP rank1 mAPGlobal (G) 95.4 89.3 80.8 77.3Channel (C) 95.9 88.8 74.7 71.2Part (P) 95.9 90.2 80.3 77.9C+P 96.1 91.2 82.7 79.8G+C 96.0 90.9 82.0 79.7G+P 96.1 91.2 83.4 81.3G+C+P 96.3 91.5 84.9 82.4lated to previous work such as MGN [2], PLR-OSNet [6], and, inparticular, SCR [3]. All of these approaches use a truncated back-bone followed by branching. MGN relies on ResNet50 and only usesspatial partitions, whereas our model builds upon OSNet and alsobetter exploits the PREID problem by additionally using channelpartitions. In this regard, SCR is the most similar architecture sinceboth spatial and channel partitions are used for multi-loss training.However, for good performance, SCR requires nearly twice as manyembeddings as our model and creates part and channel partitionsin the same branch, which could theoretically impede the branches’specialization.

When introducing branches to a neural net-work architecture, the parameter count can raise substantially. Thus,any such introduction has to be well-justiﬁed. Table 2 depicts ournetwork’s performance for different branch combinations. The re-sults suggest that single branches perform similarly when the othertwo respective branches are deactivated. Among all branches, thechannel branch has the lowest performance on CUHK03-D, indicat-ing that global features are very important for generalization on thisdataset. As can be seen by the pairwise combination of branches, thepart branch inﬂuences the performance signiﬁcantly on CUHK03-D.By using all three branches together, our model achieves state-of-the-art results on both datasets.

Inﬂuence of Backbones.

Table 3 shows some examples of the dif-ferent performances of ResNet50 and OSNet. The raw model withResNet50 (i.e., the one without beneﬁcial additions) has the weak-est performance among all models. Only with all possible additionsit is able to achieve similar performance of a raw model with OS-Net backbone. The best conﬁguration that can be achieved withResNet50 is still inferior than our ﬁnal model. Our model withOSNet backbone only has about 9 million parameters, compared toabout 23 million with ResNet50 backbone.

Inﬂuence of Learning Rate Schedule.

As can be seen in Table 3,when substituting the cosine warmup annealing schedule with a con-stant schedule, performance decreases. For the constant schedule,we have reduced the initial learning rate of × − three times bya factor of in the 50th, 80th and 110th epoch, respectively. Theresults indicate the importance of a suitable learning rate strategy forPRID on both datasets. Inﬂuence of Drop Block.

The results in Table 3 suggest that thedrop block has hardly any inﬂuence on the performance on Market-1501. On the other hand, results on the CUHK03 dataset clearly show that the drop block can lead to better generalization on the testset and increases both metrics.

Table 3 : Ablation study of training techniques.

We investigateour models performance under the speciﬁed training modiﬁcations.Here, WCA indicates use of warmup cosine annealing, MS the useof MS loss over triplet loss, DB the use of drop block, and OSNetthe use of OSNet over ResNet50 as backbone, respectively.

Conﬁguration Market-1501 CUHK03-DOSNet WCA MS DB r1 mAP r1 mAP ✗ ✗ ✗ ✗ ✗ ✓ ✓ ✓ ✓ ✗ ✗ ✗ ✓ ✓ ✗ ✗ ✓ ✓ ✗ ✓ ✓ ✗ ✓ ✓ ✓ ✓ ✓ ✗ ✓ ✓ ✓ ✓

Inﬂuence of Loss Functions.

We trained various modiﬁcations ofour model with triplet loss instead of MS loss. Using MS loss in theﬁnal model slightly increases the rank-1 and mAP performance onCUHK03, but not on Market-1501. Thus, the choice of ranking lossfunction can be important for generalization on smaller datasets.

5. CONCLUSION

We have presented a multi-branch neural network that achieves state-of-the-art results on Market-1501 and CUHK03. Although branchesincrease the overall parameter count, we can keep the overall modelcomplexity low by utilizing a lightweight OSNet backbone and suit-able training techniques. The distinct branches of our network cancapture the essential person features. Overall our research suggeststhat learning rate schedules and the backbone choice heavily inﬂu-ence the model performance and that drop blocks and MS loss assistthe model in generalizing the smaller CUHK03 dataset. We con-clude that multi-branch architectures should focus on the right com-bination of training techniques and OSNet feature extraction in favorof adding model complexity.

6. REFERENCES [1] Liang Zheng, Yi Yang, and Alexander G Hauptmann, “Per-son re-identiﬁcation: Past, present and future,” arXiv preprintarXiv:1610.02984 , 2016.[2] Guanshuo Wang, Yufeng Yuan, Xiong Chen, Jiwei Li, andXi Zhou, “Learning discriminative features with multiple gran-ularities for person re-identiﬁcation,” in

Proceedings of the26th ACM international conference on Multimedia , 2018, pp.274–282.[3] Hao Chen, Benoit Lagadec, and Francois Bremond, “Learn-ing discriminative and generalizable representations by spatial-channel partition for person re-identiﬁcation,” in

The IEEEWinter Conference on Applications of Computer Vision , 2020,pp. 2483–2492.[4] Binghui Chen, Weihong Deng, and Jiani Hu, “Mixed high-order attention network for person re-identiﬁcation,” in

Pro-ceedings of the IEEE International Conference on ComputerVision , 2019, pp. 371–381.5] Tianlong Chen, Shaojin Ding, Jingyi Xie, Ye Yuan, WuyangChen, Yang Yang, Zhou Ren, and Zhangyang Wang, “Abd-net: Attentive but diverse person re-identiﬁcation,” in

Pro-ceedings of the IEEE International Conference on ComputerVision , 2019, pp. 8351–8361.[6] Ben Xie, Xiaofu Wu, Suofei Zhang, Shiliang Zhao, and MingLi, “Learning diverse features with part-level resolution forperson re-identiﬁcation,” arXiv preprint arXiv:2001.07442 ,2020.[7] Kaiyang Zhou, Yongxin Yang, Andrea Cavallaro, and Tao Xi-ang, “Omni-scale feature learning for person re-identiﬁcation,”in

Proceedings of the IEEE International Conference on Com-puter Vision , 2019, pp. 3702–3712.[8] Liang Zheng, Liyue Shen, Lu Tian, Shengjin Wang, JingdongWang, and Qi Tian, “Scalable person re-identiﬁcation: Abenchmark,” in

Proceedings of the IEEE international con-ference on computer vision , 2015, pp. 1116–1124.[9] Wei Li, Rui Zhao, Tong Xiao, and Xiaogang Wang, “Deepreid:Deep ﬁlter pairing neural network for person re-identiﬁcation,”in

CVPR , 2014.[10] Niloofar Gheissari, Thomas B Sebastian, and Richard Hartley,“Person reidentiﬁcation using spatiotemporal appearance,” in . IEEE, 2006, vol. 2, pp.1528–1535.[11] Mang Ye, Jianbing Shen, Gaojie Lin, Tao Xiang, LingShao, and Steven CH Hoi, “Deep learning for personre-identiﬁcation: A survey and outlook,” arXiv preprintarXiv:2001.04193 , 2020.[12] Hao Luo, Youzhi Gu, Xingyu Liao, Shenqi Lai, and WeiJiang, “Bag of tricks and a strong baseline for deep personre-identiﬁcation,” in

Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition Workshops , 2019,pp. 0–0.[13] Yifan Sun, Liang Zheng, Yi Yang, Qi Tian, and ShengjinWang, “Beyond part models: Person retrieval with reﬁned partpooling (and a strong convolutional baseline),” in

Proceed-ings of the European Conference on Computer Vision (ECCV) ,2018, pp. 480–496.[14] Zuozhuo Dai, Mingqiang Chen, Xiaodong Gu, Siyu Zhu,and Ping Tan, “Batch dropblock network for person re-identiﬁcation and beyond,” in

Proceedings of the IEEE In-ternational Conference on Computer Vision , 2019, pp. 3691–3701.[15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun,“Deep residual learning for image recognition,” in

Proceed-ings of the IEEE conference on computer vision and patternrecognition , 2016, pp. 770–778.[16] Feng Zheng, Cheng Deng, Xing Sun, Xinyang Jiang, XiaoweiGuo, Zongqiao Yu, Feiyue Huang, and Rongrong Ji, “Pyrami-dal person re-identiﬁcation via multi-loss dynamic training,” in

Proceedings of the IEEE Conference on Computer Vision andPattern Recognition , 2019, pp. 8514–8522.[17] Xuesong Chen, Canmiao Fu, Yong Zhao, Feng Zheng,Jingkuan Song, Rongrong Ji, and Yi Yang, “Salience-guidedcascaded suppression network for person re-identiﬁcation,” in

Proceedings of the IEEE/CVF Conference on Computer Visionand Pattern Recognition , 2020, pp. 3300–3310. [18] Hussam Lawen, Avi Ben-Cohen, Matan Protter, Itamar Fried-man, and Lihi Zelnik-Manor, “Compact network training forperson reid,” in

Proceedings of the 2020 International Confer-ence on Multimedia Retrieval , 2020, pp. 164–171.[19] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton, “Ima-genet classiﬁcation with deep convolutional neural networks,”in

Advances in neural information processing systems , 2012,pp. 1097–1105.[20] Rodolfo Quispe and Helio Pedrini, “Top-db-net: Top drop-block for activation enhancement in person re-identiﬁcation,” arXiv preprint arXiv:2010.05435 , 2020.[21] Florian Schroff, Dmitry Kalenichenko, and James Philbin,“Facenet: A uniﬁed embedding for face recognition and clus-tering,” in

Proceedings of the IEEE conference on computervision and pattern recognition , 2015, pp. 815–823.[22] Xun Wang, Xintong Han, Weilin Huang, Dengke Dong, andMatthew R Scott, “Multi-similarity loss with general pairweighting for deep metric learning,” in

Proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion , 2019, pp. 5022–5030.[23] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, JonShlens, and Zbigniew Wojna, “Rethinking the inception ar-chitecture for computer vision,” in

Proceedings of the IEEEconference on computer vision and pattern recognition , 2016,pp. 2818–2826.[24] Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, andYi Yang, “Random erasing data augmentation,” arXiv preprintarXiv:1708.04896 , 2017.[25] Xiangyu Zhu, Zhenbo Luo, Pei Fu, and Xiang Ji, “Voc-reid:Vehicle re-identiﬁcation based on vehicle-orientation-camera,”in

Proceedings of the IEEE/CVF Conference on Computer Vi-sion and Pattern Recognition Workshops , 2020, pp. 602–603.[26] Priya Goyal, Piotr Doll´ar, Ross Girshick, Pieter Noordhuis,Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, YangqingJia, and Kaiming He, “Accurate, large minibatch sgd: Trainingimagenet in 1 hour,” arXiv preprint arXiv:1706.02677 , 2017.[27] Tong He, Zhi Zhang, Hang Zhang, Zhongyue Zhang, Jun-yuan Xie, and Mu Li, “Bag of tricks for image classiﬁcationwith convolutional neural networks,” in

Proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion , 2019, pp. 558–567.[28] Zhun Zhong, Liang Zheng, Donglin Cao, and Shaozi Li, “Re-ranking person re-identiﬁcation with k-reciprocal encoding,”in

Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition , 2017, pp. 1318–1327.[29] Diederik P Kingma and Jimmy Ba, “Adam: A method forstochastic optimization,” arXiv preprint arXiv:1412.6980 ,2014.[30] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan,Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmai-son, Luca Antiga, and Adam Lerer, “Automatic differentiationin pytorch,” 2017.[31] Hyeonjoon Moon and P Jonathon Phillips, “Computationaland performance aspects of pca-based face-recognition algo-rithms,”

Perception , vol. 30, no. 3, pp. 303–321, 2001.[32] Alexander Hermans, Lucas Beyer, and Bastian Leibe, “In de-fense of the triplet loss for person re-identiﬁcation,” arXivpreprint arXiv:1703.07737arXivpreprint arXiv:1703.07737