[PDF] Deep Learning for Person Re-identification: A Survey and Outlook

Abstract

Person re-identification (Re-ID) aims at retrieving a person of interest across multiple non-overlapping cameras. With the advancement of deep neural networks and increasing demand of intelligent video surveillance, it has gained significantly increased interest in the computer vision community. By dissecting the involved components in developing a person Re-ID system, we categorize it into the closed-world and open-world settings. The widely studied closed-world setting is usually applied under various research-oriented assumptions, and has achieved inspiring success using deep learning techniques on a number of datasets. We first conduct a comprehensive overview with in-depth analysis for closed-world person Re-ID from three different perspectives, including deep feature representation learning, deep metric learning and ranking optimization. With the performance saturation under closed-world setting, the research focus for person Re-ID has recently shifted to the open-world setting, facing more challenging issues. This setting is closer to practical applications under specific scenarios. We summarize the open-world Re-ID in terms of five different aspects. By analyzing the advantages of existing methods, we design a powerful AGW baseline, achieving state-of-the-art or at least comparable performance on twelve datasets for FOUR different Re-ID tasks. Meanwhile, we introduce a new evaluation metric (mINP) for person Re-ID, indicating the cost for finding all the correct matches, which provides an additional criteria to evaluate the Re-ID system for real applications. Finally, some important yet under-investigated open issues are discussed.

Full PDF

IIEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 1

Deep Learning for Person Re-identiﬁcation:A Survey and Outlook

Mang Ye, Jianbing Shen,

Senior Member, IEEE , Gaojie Lin, Tao XiangLing Shao and Steven C. H. Hoi,

Fellow, IEEE

Abstract —Person re-identiﬁcation (Re-ID) aims at retrieving a person of interest across multiple non-overlapping cameras. With theadvancement of deep neural networks and increasing demand of intelligent video surveillance, it has gained signiﬁcantly increasedinterest in the computer vision community. By dissecting the involved components in developing a person Re-ID system, we categorizeit into the closed-world and open-world settings. The widely studied closed-world setting is usually applied under variousresearch-oriented assumptions, and has achieved inspiring success using deep learning techniques on a number of datasets. We ﬁrstconduct a comprehensive overview with in-depth analysis for closed-world person Re-ID from three different perspectives, includingdeep feature representation learning, deep metric learning and ranking optimization. With the performance saturation underclosed-world setting, the research focus for person Re-ID has recently shifted to the open-world setting, facing more challengingissues. This setting is closer to practical applications under speciﬁc scenarios. We summarize the open-world Re-ID in terms of ﬁvedifferent aspects. By analyzing the advantages of existing methods, we design a powerful AGW baseline, achieving state-of-the-art orat least comparable performance on twelve datasets for FOUR different Re-ID tasks. Meanwhile, we introduce a new evaluation metric(mINP) for person Re-ID, indicating the cost for ﬁnding all the correct matches, which provides an additional criteria to evaluate theRe-ID system for real applications. Finally, some important yet under-investigated open issues are discussed.

Index Terms —Person Re-Identiﬁcation, Pedestrian Retrieval, Literature Survey, Evaluation Metric, Deep Learning (cid:70)

NTRODUCTION P ERSON re-identiﬁcation (Re-ID) has been widely studied asa speciﬁc person retrieval problem across non-overlappingcameras [1], [2]. Given a query person-of-interest, the goalof Re-ID is to determine whether this person has appearedin another place at a distinct time captured by a differentcamera, or even the same camera at a different time instant[3]. The query person can be represented by an image [4],[5], [6], a video sequence [7], [8], and even a text description[9], [10]. Due to the urgent demand of public safety andincreasing number of surveillance cameras, person Re-ID isimperative in intelligent surveillance systems with signiﬁ-cant research impact and practical importance.Re-ID is a challenging task due to the presence of dif-ferent viewpoints [11], [12], varying low-image resolutions[13], [14], illumination changes [15], unconstrained poses[16], [17], [18], occlusions [19], [20], heterogeneous modal-ities [10], [21], complex camera environments, backgroundclutter [22], unreliable bounding box generations, etc. Theseresult in varying variations and uncertainty. In addition, forpractical model deployment, the dynamic updated cameranetwork [23], [24], large scale gallery with efﬁcient retrieval • M. Ye is with the School of Computer Science, Wuhan University, Chinaand Inception Institute of Artiﬁcial Intelligence, UAE. • J. Shen and L. Shao are with the Inception Institute of Artiﬁcial Intelli-gence, UAE. E-mail: { mangye16, shenjianbingcg } @gmail.com • G. Lin is with the School of Computer Science, Beijing Institute ofTechnology, China. • T. Xiang is with the Centre for Vision Speech and Signal Processing,University of Surrey, UK. Email: [email protected] • S. C. H. Hoi is with the Singapore Management University, and SalesforceResearch Asia, Singapore. Email: [email protected] [25], group uncertainty [26], signiﬁcant domain shift [27],unseen testing scenarios [28], incremental model updating[29] and changing cloths [30] also greatly increase the dif-ﬁculties. These challenges lead that Re-ID is still unsolvedproblem. Early research efforts mainly focus on the hand-crafted feature construction with body structures [31], [32],[33], [34], [35] or distance metric learning [36], [37], [38],[39], [40], [41]. With the advancement of deep learning,person Re-ID has achieved inspiring performance on thewidely used benchmarks [5], [42], [43], [44]. However, thereis still a large gap between the research-oriented scenariosand practical applications [45]. This motivates us to conducta comprehensive survey, develop a powerful baseline fordifferent Re-ID tasks and discuss several future directions.Though some surveys have also summarized the deeplearning techniques [2], [46], [47], our survey makes threemajor differences: 1) We provide an in-depth and com-prehensive analysis of existing deep learning methods bydiscussing their advantages and limitations, analyzing thestate-of-the-arts. This provides insights for future algo-rithm design and new topic exploration. 2) We design anew powerful baseline (AGW: Attention Generalized meanpooling with Weighted triplet loss) and a new evaluationmetric (mINP: mean Inverse Negative Penalty) for futuredevelopments. AGW achieves state-of-the-art performanceon twelve datasets for four different Re-ID tasks. mINPprovides a supplement metric to existing CMC/mAP, in-dicating the cost to ﬁnd all the correct matches. 3) We makean attempt to discuss several important research directionswith under-investigated open issues to narrow the gap be-tween the closed-world and open-world applications, takinga step towards real-world Re-ID system design. a r X i v : . [ c s . C V ] J a n EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2

2. Bounding Box Generation1. Raw Data Collection 3. Train Data Annotation

Re-ID

Model

4. Training Model

QueryGallerySearch …

5. Testing

Fig. 1: The ﬂow of designing a practical person Re-ID system, including ﬁve main steps: 1)

Raw Data Collection , (2)

BoundingBox Generation , 3)

Training Data Annotation , 4)

Model Training and 5)

Pedestrian Retrieval .TABLE 1:

Closed-world vs . Open-world Person Re-ID. Closed-world (Section 2) Open-world (Section 3) (cid:88)

Single-modality Data Heterogeneous Data (§ 3.1) (cid:88)

Bounding Boxes Generation Raw Images/Videos (§ 3.2) (cid:88)

Sufﬁcient Annotated Data Unavailable/Limited Labels (§ 3.3) (cid:88)

Correct Annotation Noisy Annotation (§ 3.4) (cid:88)

Query Exists in Gallery Open-set (§ 3.5)

Unless otherwise speciﬁed, person Re-ID in this surveyrefers to the pedestrian retrieval problem across multiplesurveillance cameras, from a computer vision perspective.Generally, building a person Re-ID system for a speciﬁcscenario requires ﬁve main steps (as shown in Fig. 1):1) Step 1:

Raw Data Collection : Obtaining raw video datafrom surveillance cameras is the primary requirementof practical video investigation. These cameras are usu-ally located in different places under varying environ-ments [48]. Most likely, this raw data contains a largeamount of complex and noisy background clutter.2) Step 2:

Bounding Box Generation : Extracting the bound-ing boxes which contain the person images from theraw video data. Generally, it is impossible to manuallycrop all the person images in large-scale applications.The bounding boxes are usually obtained by the persondetection [49], [50] or tracking algorithms [51], [52].3) Step 3:

Training Data Annotation : Annotating the cross-camera labels. Training data annotation is usually indis-pensable for discriminative Re-ID model learning dueto the large cross-camera variations. In the existence oflarge domain shift [53], we often need to annotate thetraining data in every new scenario.4) Step 4:

Model Training : Training a discriminative androbust Re-ID model with the previous annotated personimages/videos. This step is the core for developing aRe-ID system and it is also the most widely studiedparadigm in the literature. Extensive models have beendeveloped to handle the various challenges, concen-trating on feature representation learning [54], [55],distance metric learning [56], [57] or their combinations.5) Step 5:

Pedestrian Retrieval : The testing phase conductsthe pedestrian retrieval. Given a person-of-interest(query) and a gallery set, we extract the feature rep-resentations using the Re-ID model learned in previousstage. A retrieved ranking list is obtained by sortingthe calculated query-to-gallery similarity. Some meth-ods have also investigated the ranking optimization toimprove the retrieval performance [58], [59].According to the ﬁve steps mentioned above, we cate-gorize existing Re-ID methods into two main trends: closed-world and open-world settings, as summarized in Table 1. Astep-by-step comparison is in the following ﬁve aspects: 1)

Single-modality vs . Heterogeneous Data : For the raw datacollection in Step 1, all the persons are representedby images/videos captured by single-modality visiblecameras in the closed-world setting [5], [8], [31], [42],[43], [44]. However, in practical open-world applica-tions, we might also need to process heterogeneousdata, which are infrared images [21], [60], sketches [61],depth images [62], or even text descriptions [63]. Thismotivates the heterogeneous Re-ID in § 3.1.2) Bounding Box Generation vs . Raw Images/Videos : For thebounding box generation in Step 2, the closed-worldperson Re-ID usually performs the training and testingbased on the generated bounding boxes, where thebounding boxes mainly contain the person appearanceinformation. In contrast, some practical open-worldapplications require end-to-end person search from theraw images or videos [55], [64]. This leads to anotheropen-world topic, i.e ., end-to-end person search in § 3.2.3) Sufﬁcient Annotated Data vs . Unavailable/Limited Labels :For the training data annotation in Step 3, the closed-world person Re-ID usually assumes that we haveenough annotated training data for supervised Re-IDmodel training. However, label annotation for eachcamera pair in every new environment is time consum-ing and labor intensive, incurring high costs. In open-world scenarios, we might not have enough annotateddata ( i.e ., limited labels) [65] or even without any labelinformation [66]. This inspires the discussion of theunsupervised and semi-supervised Re-ID in § 3.3.4) Correct Annotation vs . Noisy Annotation : For Step 4, exist-ing closed-world person Re-ID systems usually assumethat all the annotations are correct, with clean labels.However, annotation noise is usually unavoidable dueto annotation error ( i.e ., label noise) or imperfect detec-tion/tracking results ( i.e ., sample noise, partial Re-ID[67]). This leads to the analysis of noise-robust personRe-ID under different noise types in § 3.4.5) Query Exists in Gallery vs . Open-set : In the pedestrianretrieval stage (Step 5), most existing closed-world per-son Re-ID works assume that the query must occur inthe gallery set by calculating the CMC [68] and mAP[5]. However, in many scenarios, the query person maynot appear in the gallery set [69], [70], or we need toperform the veriﬁcation rather than retrieval [26]. Thisbrings us to the open-set person Re-ID in § 3.5.This survey ﬁrst introduces the widely studied personRe-ID under closed-world settings in § 2. A detailed reviewon the datasets and the state-of-the-arts are conducted in§ 2.4. We then introduce the open-world person Re-ID in § 3.An outlook for future Re-ID is presented in § 4, including EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 3 a new evaluation metric (§ 4.1), a new powerful AGWbaseline (§ 4.2). We discuss several under-investigated openissues for future study (§ 4.3). Conclusions will be drawn in§ 5. A structure overview is shown in the supplementary.

LOSED - WORLD P ERSON R E -I DENTIFICATION

This section provides an overview for closed-world personRe-ID. As discussed in § 1, this setting usually has the fol-lowing assumptions: 1) person appearances are captured bysingle-modality visible cameras, either by image or video; 2)The persons are represented by bounding boxes, where mostof the bounding box area belongs the same identity; 3) Thetraining has enough annotated training data for superviseddiscriminative Re-ID model learning; 4) The annotations aregenerally correct; 5) The query person must appear in thegallery set. Typically, a standard closed-world Re-ID sys-tem contains three main components:

Feature RepresentationLearning (§ 2.1), which focuses on developing the featureconstruction strategies;

Deep Metric Learning (§ 2.2), whichaims at designing the training objectives with different lossfunctions or sampling strategies; and

Ranking Optimization (§ 2.3), which concentrates on optimizing the retrievedranking list. An overview of the datasets and state-of-the-arts with in-depth analysis is provided in § 2.4.2.

We ﬁrstly discuss the feature learning strategies in closed-world person Re-ID. There are four main categories (asshown in Fig. 2): a) Global Feature (§ 2.1.1), it extracts aglobal feature representation vector for each person imagewithout additional annotation cues [55]; b) Local Feature(§ 2.1.2), it aggregates part-level local features to formulatea combined representation for each person image [75], [76],[77]; c) Auxiliary Feature (§ 2.1.3), it improves the featurerepresentation learning using auxiliary information, e.g .,attributes [71], [72], [78], GAN generated images [42], etc.d) Video Feature (§ 2.1.4), it learns video representationfor video-based Re-ID [7] using multiple image frames andtemporal information [73], [74]. We also review severalspeciﬁc architecture designs for person Re-ID in § 2.1.5.

Global feature representation learning extracts a global fea-ture vector for each person image, as shown in Fig. 2(a).Since deep neural networks are originally applied in imageclassiﬁcation [79], [80], global feature learning is the primarychoice when integrating advanced deep learning techniquesinto the person Re-ID ﬁeld in early years.To capture the ﬁne-grained cues in global feature learn-ing, A joint learning framework consisting of a single-image representation (SIR) and cross-image representation(CIR) is developed in [81], trained with triplet loss usingspeciﬁc sub-networks. The widely-used ID-discriminativeEmbedding (IDE) model [55] constructs the training pro-cess as a multi-class classiﬁcation problem by treating eachidentity as a distinct class. It is now widely used in Re-IDcommunity [42], [58], [77], [82], [83]. Qian et al. [84] developa multi-scale deep representation learning model to capturediscriminative cues at different scales.

Attention Information.

Attention schemes have beenwidely studied in literature to enhance representation learn-ing [85]. 1)

Group 1: Attention within the person image.

Typ-ical strategies include the pixel level attention [86] andthe channel-wise feature response re-weighting [86], [87],[88], [89], or background suppressing [22]. The spatial in-formation is integrated in [90]. 2)

Group 2: attention acrossmultiple person images.

A context-aware attentive featurelearning method is proposed in [91], incorporating bothan intra-sequence and inter-sequence attention for pair-wisefeature alignment and reﬁnement. The attention consistencyproperty is added in [92], [93]. Group similarity [94], [95]is another popular approach to leverage the cross-imageattention, which involves multiple images for local andglobal similarity modeling. The ﬁrst group mainly enhancesthe robustness against misalignment/imperfect detection,and the second improves the feature learning by mining therelations across multiple images.

It learns part/region aggregated features, making it robustagainst misalignment [77], [96]. The body parts are eitherautomatically generated by human parsing/pose estimation(Group 1) or roughly horizontal division (Group 2).With automatic body part detection, the popular solutionis to combine the full body representation and local part fea-tures [97], [98]. Speciﬁcally, the multi-channel aggregation[99], multi-scale context-aware convolutions [100], multi-stage feature decomposition [17] and bilinear-pooling [97]are designed to improve the local feature learning. Ratherthan feature level fusion, the part-level similarity combina-tion is also studied in [98]. Another popular solution is toenhance the robustness against background clutter, usingthe pose-driven matching [101], pose-guided part attentionmodule [102], semantically part alignment [103], [104].For horizontal-divided region features, multiple part-level classiﬁers are learned in Part-based ConvolutionalBaseline (PCB) [77], which now serves as a strong partfeature learning baseline in the current state-of-the-art[28], [105], [106]. To capture the relations across multiplebody parts, the Siamese Long Short-Term Memory (LSTM)architecture [96], second-order non-local attention [107],Interaction-and-Aggregation (IA) [108] are designed to re-inforce the feature learning.The ﬁrst group uses human parsing techniques to obtainsemantically meaningful body parts, which provides well-align part features. However, they require an additionalpose detector and are prone to noisy pose detections [77].The second group uses a uniform partition to obtain thehorizontal stripe parts, which is more ﬂexible, but it issensitive to heavy occlusions and large background clutter.

Auxiliary feature representation learning usually requiresadditional annotated information ( e.g ., semantic attributes[71]) or generated/augmented training samples to reinforcethe feature representation [19], [42].

Semantic Attributes . A joint identity and attribute learn-ing baseline is introduced in [72]. Su et al. [71] proposea deep attribute learning framework by incorporating thepredicted semantic attribute information, enhancing the

EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 4

CNN CNN CNN“male”, “short hair” CNN(a) Global (b) Local (c) Auxiliary (d) Video

Fig. 2: Four different feature learning strategies. a) Global Feature, learning a global representation for each person imagein § 2.1.1; b) Local Feature, learning part-aggregated local features in § 2.1.2; c) Auxiliary Feature, learning the featurerepresentation using auxiliary information, e.g ., attributes [71], [72] in § 2.1.3 and d) Video Feature , learning the videorepresentation using multiple image frames and temporal information [73], [74] in § 2.1.4.generalizability and robustness of the feature representationin a semi-supervised learning manner. Both the semanticattributes and the attention scheme are incorporated toimprove part feature learning [109]. Semantic attributes arealso adopted in [110] for video Re-ID feature representationlearning. They are also leveraged as the auxiliary supervi-sion information in unsupervised learning [111].

Viewpoint Information.

The viewpoint information isalso leveraged to enhance the feature representation learn-ing [112], [113]. Multi-Level Factorisation Net (MLFN) [112]also tries to learn the identity-discriminative and view-invariant feature representations at multiple semantic levels.Liu et al. [113] extract a combination of view-generic andview-speciﬁc learning. An angular regularization is incor-porated in [114] in the viewpoint-aware feature learning.

Domain Information.

A Domain Guided Dropout(DGD) algorithm [54] is designed to adaptively mine thedomain-sharable and domain-speciﬁc neurons for multi-domain deep feature representation learning. Treating eachcamera as a distinct domain, Lin et al. [115] propose a multi-camera consistent matching constraint to obtain a globallyoptimal representation in a deep learning framework. Sim-ilarly, the camera view information or the detected cameralocation is also applied in [18] to improve the feature repre-sentation with camera-speciﬁc information modeling.

GAN Generation.

This section discusses the use of GANgenerated images as the auxiliary information. Zheng et al. [42] start the ﬁrst attempt to apply the GAN technique forperson Re-ID. It improves the supervised feature represen-tation learning with the generated person images. Pose con-straints are incorporated in [116] to improve the quality ofthe generated person images, generating the person imageswith new pose variants. A pose-normalized image genera-tion approach is designed in [117], which enhances the ro-bustness against pose variations. Camera style information[118] is also integrated in the image generation process toaddress the cross camera variations. A joint discriminativeand generative learning model [119] separately learns theappearance and structure codes to improve the image gen-eration quality. Using the GAN generated images is also awidely used approach in unsupervised domain adaptationRe-ID [120], [121], approximating the target distribution.

Data Augmentation.

For Re-ID, custom operations arerandom resize, cropping and horizontal ﬂip [122]. Besides,adversarially occluded samples [19] are generated to aug-ment the variation of training data. A similar randomerasing strategy is proposed in [123], adding random noiseto the input images. A batch DropBlock [124] randomly drops a region block in the feature map to reinforce theattentive feature learning. Bak et al. [125] generate the virtualhumans rendered under different illumination conditions.These methods enrich the supervision with the augmentedsamples, improving the generalizability on the testing set.

Video-based Re-ID is another popular topic [126], whereeach person is represented by a video sequence with mul-tiple frames. Due to the rich appearance and temporalinformation, it has gained increasing interest in the Re-ID community. This also brings in additional challenges invideo feature representation learning with multiple images.The primary challenge is to accurately capture the tem-poral information. A recurrent neural network architectureis designed for video-based person Re-ID [127], whichjointly optimizes the ﬁnal recurrent layer for temporalinformation propagation and the temporal pooling layer.A weighted scheme for spatial and temporal streams isdeveloped in [128]. Yan et al. [129] present a progres-sive/sequential fusion framework to aggregate the frame-level human region representations. Semantic attributes arealso adopted in [110] for video Re-ID with feature disentan-gling and frame re-weighting. Jointly aggregating the frame-level feature and spatio-temporal appearance information iscrucial for video representation learning [130], [131], [132].Another major challenge is the unavoidable outliertracking frames within the videos. Informative frames areselected in a joint Spatial and Temporal Attention PoolingNetwork (ASTPN) [131], and the contextual information isintegrated in [130]. A co-segmentation inspired attentionmodel [132] detects salient features across multiple videoframes with mutual consensus estimation. A diversity regu-larization [133] is employed to mine multiple discriminativebody parts in each video sequence. An afﬁne hull is adoptedto handle the outlier frames within the video sequence [83].An interesting work [20] utilizes the multiple video framesto auto-complete occluded regions. These works demon-strate that handling the noisy frames can greatly improvethe video representation learning.It is also challenging to handle the varying lengths ofvideo sequences, Chen et al. [134] divide the long videosequences into multiple short snippets, aggregating the top-ranked snippets to learn a compact embedding. A clip-levellearning strategy [135] exploits both spatial and temporaldimensional attention cues to produce a robust clip-levelrepresentation. Both the short- and long-term relations [136]are integrated in a self-attention scheme.

EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 5

Framing person Re-ID as a speciﬁc pedestrian retrievalproblem, most existing works adopt the network archi-tectures [79], [80] designed for image classiﬁcation as thebackbone. Some works have tried to modify the backbonearchitecture to achieve better Re-ID features. For the widelyused ResNet50 backbone [80], the important modiﬁcationsinclude changing the last convolutional stripe/size to 1 [77],employing adaptive average pooling in the last poolinglayer [77], and adding bottleneck layer with batch normal-ization after the pooling layer [82].Accuracy is the major concern for speciﬁc Re-ID networkarchitecture design to improve the accuracy, Li et al. [43]start the ﬁrst attempt by designing a ﬁlter pairing neuralnetwork (FPNN), which jointly handles misalignment andocclusions with part discriminative information mining.Wang et al. [89] propose a BraidNet with a specially de-signed WConv layer and Channel Scaling layer. The WConvlayer extracts the difference information of two images toenhance the robustness against misalignments and ChannelScaling layer optimizes the scaling factor of each inputchannel. A Multi-Level Factorisation Net (MLFN) [112]contains multiple stacked blocks to model various latentfactors at a speciﬁc level, and the factors are dynamicallyselected to formulate the ﬁnal representation. An efﬁcientfully convolutional Siamese network [137] with convolutionsimilarity module is developed to optimize multi-level sim-ilarity measurement. The similarity is efﬁciently capturedand optimized by using the depth-wise convolution.Efﬁciency is another important factor for Re-ID architec-ture design. An efﬁcient small scale network, namely Omni-Scale Network (OSNet) [138], is designed by incorporatingthe point-wise and depth-wise convolutions. To achievemulti-scale feature learning, a residual block composed ofmultiple convolutional streams is introduced.With the increasing interest in auto-machine learning, anAuto-ReID [139] model is proposed. Auto-ReID provides anefﬁcient and effective automated neural architecture designbased on a set of basic architecture components, using apart-aware module to capture the discriminative local Re-ID features. This provides a potential research direction inexploring powerful domain-speciﬁc architectures.

Metric learning has been extensively studied before the deeplearning era by learning a Mahalanobis distance function[36], [37] or projection matrix [40]. The role of metric learn-ing has been replaced by the loss function designs to guidethe feature representation learning. We will ﬁrst review thewidely used loss functions in § 2.2.1 and then summarize thetraining strategies with speciﬁc sampling designs § 2.2.2.

This survey only focuses on the loss functions designedfor deep learning [56]. An overview of the distance metriclearning designed for hand-crafted systems can be found in[2], [143]. There are three widely studied loss functions withtheir variants in the literature for person Re-ID, includingthe identity loss, veriﬁcation loss and triplet loss. An illus-tration of three loss functions is shown in Fig. 3.

CNN “Suman” (a) Identity LossCNN(b) Verification LossCNN CNN(c) Triplet LossCNNCNN … … ? PullPush

Margin

BeforeAfter

ClassifierSame / Not

Fig. 3: Three kinds of widely used loss functions in the litera-ture. (a) Identity Loss [42], [82], [118], [140] ; (b) VeriﬁcationLoss [94], [141] and (c) Triplet Loss [14], [22], [57]. Manyworks employ their combinations [87], [137], [141], [142].

Identity Loss.

It treats the training process of personRe-ID as an image classiﬁcation problem [55], i.e ., eachidentity is a distinct class. In the testing phase, the outputof the pooling layer or embedding layer is adopted as thefeature extractor. Given an input image x i with label y i , thepredicted probability of x i being recognized as class y i isencoded with a softmax function, represented by p ( y i | x i ) .The identity loss is then computed by the cross-entropy L id = − n (cid:88) ni =1 log( p ( y i | x i )) , (1)where n represents the number of training samples withineach batch. The identity loss has been widely used in ex-isting methods [19], [42], [82], [92], [95], [106], [118], [120],[140], [144]. Generally, it is easy to train and automaticallymine the hard samples during the training process, asdemonstrated in [145]. Several works have also investigatedthe softmax variants [146], such as the sphere loss in [147]and AM softmax in [95]. Another simple yet effective strat-egy, i.e ., label smoothing [42], [122], is generally integratedinto the standard softmax cross-entropy loss. Its basic ideais to avoid the model ﬁtting to over-conﬁdent annotatedlabels, improving the generalizability [148]. Veriﬁcation Loss.

It optimizes the pairwise relationship,either with a contrastive loss [96], [120] or binary veriﬁcationloss [43], [141]. The contrastive loss improves the relativepairwise distance comparison, formulated by L con = (1 − δ ij ) { max(0 , ρ − d ij ) } + δ ij d ij , (2)where d ij represents the Euclidean distance between theembedding features of two input samples x i and x j . δ ij isa binary label indicator ( δ ij = 1 when x i and x j belong tothe same identity, and δ ij = 0 , otherwise). ρ is a marginparameter. There are several variants, e.g ., the pairwisecomparison with ranking SVM in [81].Binary veriﬁcation [43], [141] discriminates the positiveand negative of a input image pair. Generally, a differentialfeature f ij is obtained by f ij = ( f j − f j ) [141], where f i and f j are the embedding features of two samples x i and x j . The veriﬁcation network classiﬁes the differential feature EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 6 into positive or negative. We use p ( δ ij | f ij ) to represent theprobability of an input pair ( x i and x j ) being recognized as δ ij (0 or 1). The veriﬁcation loss with cross-entropy is L veri ( i, j ) = − δ ij log( p ( δ ij | f ij )) − (1 − δ ij ) log(1 − p ( δ ij | f ij )) . (3)The veriﬁcation is often combined with the identity lossto improve the performance [94], [96], [120], [141]. Triplet loss.

It treats the Re-ID model training processas a retrieval ranking problem. The basic idea is that thedistance between the positive pair should be smaller thanthe negative pair by a pre-deﬁned margin [57]. Typically, atriplet contains one anchor sample x i , one positive sample x j with the same identity, and one negative sample x k froma different identity. The triplet loss with a margin parameteris represented by L tri ( i, j, k ) = max( ρ + d ij − d ik , , (4)where d ( · ) measures the Euclidean distance between twosamples. The large proportion of easy triplets will dominatethe training process if we directly optimize above lossfunction, resulting in limited discriminability. To alleviatethis issue, various informative triplet mining methods havebeen designed [14], [22], [57], [97]. The basic idea is to selectthe informative triplets [57], [149]. Speciﬁcally, a moderatepositive mining with a weight constraint is introduced in[149], which directly optimizes the feature difference. Her-mans et al. [57] demonstrate that the online hardest positiveand negative mining within each training batch is beneﬁcialfor discriminative Re-ID model learning. Some methods alsostudied the point to set similarity strategy for informativetriplet mining [150], [151]. This enhances robustness againstthe outlier samples with a soft hard-mining scheme.To further enrich the triplet supervision, a quadrupletdeep network is developed in [152], where each quadrupletcontains one anchor sample, one positive sample and twomined negative samples. The quadruplets are formulatedwith a margin-based online hard negative mining. Optimiz-ing the quadruplet relationship results in smaller intra-classvariation and larger inter-class variation.The combination of triplet loss and identity loss is one ofthe most popular solutions for deep Re-ID model learning[28], [87], [90], [93], [103], [104], [116], [137], [142], [153],[154]. These two components are mutually beneﬁcial fordiscriminative feature representation learning. OIM loss.

In addition to the above three kinds of lossfunctions, an Online Instance Matching (OIM) loss [64] isdesigned with a memory bank scheme. A memory bank { v k , k = 1 , , · · · , c } contains the stored instance features,where c denotes the class number. The OIM loss is thenformulated by L oim = − n (cid:88) ni =1 log exp( v Ti f i /τ ) (cid:80) ck =1 exp( v Tk f i /τ ) , (5)where v i represents the corresponding stored memory fea-ture for class y i , and τ is a temperature parameter thatcontrols the similarity space [145]. v Ti f i measures the onlineinstance matching score. The comparison with a memorizedfeature set of unlabelled identities is further included tocalculate the denominator [64], handling the large instancenumber of non-targeted identities. This memory scheme isalso adopted in unsupervised domain adaptive Re-ID [106]. Cam 1 (a) Initial Ranking

Cam 2 Cam 3 Cam 4Query Easy Match Hard MatchHard MatchSearch … … …

Search (b) Re-Ranking new rank list … Search

Cam 1 (a) Initial Ranking

Cam 2 Cam 3 Cam 4Query (1) Easy (3) Hard (2) HardSearch … … …

Search (b) Re-Ranking new rank list … Search

Fig. 4: An illustration of re-ranking in person Re-ID. Givena query example, an initial rank list is retrieved, wherethe hard matches are ranked in the bottom. Using the top-ranked easy positive match (1) as query to search in thegallery, we can get the hard match (2) and (3) with similaritypropagation in the gallery set.

The batch sampling strategy plays an important role indiscriminative Re-ID model learning. It is challenging sincethe number of annotated training images for each identityvaries signiﬁcantly [5]. Meanwhile, the severely imbalancedpositive and negative sample pairs increases additionaldifﬁculty for the training strategy design [40].The most commonly used training strategy for handlingthe imbalanced issue is identity sampling [57], [122]. Foreach training batch, a certain number of identities arerandomly selected, and then several images are sampledfrom each selected identity. This batch sampling strategyguarantees the informative positive and negative mining.To handle the imbalance issue between the positive andnegative, adaptive sampling is the popular approach toadjust the contribution of positive and negative samples,such as Sample Rate Learning (SRL) [89], curriculum sam-pling [87]. Another approach is sample re-weighting, usingthe sample distribution [87] or similarity difference [52] toadjust the sample weight. An efﬁcient reference constraintis designed in [155] to transform the pairwise/triplet sim-ilarity to a sample-to-reference similarity, addressing theimbalance issue and enhancing the discriminability, whichis also robust to outliers.To adaptively combine multiple loss functions, a multi-loss dynamic training strategy [156] adaptively reweightsthe identity loss and triplet loss, extracting appropriatecomponent shared between them. This multi-loss trainingstrategy leads to consistent performance gain.

Ranking optimization plays a crucial role in improvingthe retrieval performance in the testing stage. Given aninitial ranking list, it optimizes the ranking order, either byautomatic gallery-to-gallery similarity mining [58], [157] orhuman interaction [158], [159]. Rank/Metric fusion [160],[161] is another popular approach for improving the rankingperformance with multiple ranking list inputs.

EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 7

The basic idea of re-ranking is to utilize the gallery-to-gallery similarity to optimize the initial ranking list, asshown in Fig. 4. The top-ranked similarity pulling andbottom-ranked dissimilarity pushing is proposed in [157].The widely-used k -reciprocal reranking [58] mines the con-textual information. Similar idea for contextual informationmodeling is applied in [25]. Bai et al. [162] utilize the geo-metric structure of the underlying manifold. An expandedcross neighborhood re-ranking method [18] is introducedby integrating the cross neighborhood distance. A localblurring re-ranking [95] employs the clustering structure toimprove neighborhood similarity measurement. Query Adaptive.

Considering the query difference,some methods have designed the query adaptive retrievalstrategy to replace the uniform searching engine to improvethe performance [163], [164]. Andy et al. [163] propose aquery adaptive re-ranking method using locality preserv-ing projections. An efﬁcient online local metric adaptationmethod is presented in [164], which learns a strictly localmetric with mined negative samples for each probe.

Human Interaction.

It involves using human feedbackto optimize the ranking list [158]. This provides reliable su-pervision during the re-ranking process. A hybrid human-computer incremental learning model is presented in [159],which cumulatively learns from human feedback, improv-ing the Re-ID ranking performance on-the-ﬂy.

Rank fusion exploits multiple ranking lists obtained withdifferent methods to improve the retrieval performance [59].Zheng et al. [165] propose a query adaptive late fusionmethod on top of a “L” shaped observation to fuse methods.A rank aggregation method by employing the similarity anddissimilarity is developed in [59]. The rank fusion process inperson Re-ID is formulated as a consensus-based decisionproblem with graph theory [166], mapping the similarityscores obtained by multiple algorithms into a graph withpath searching. An Uniﬁed Ensemble Diffusion (UED) [161]is recently designed for metric fusion. UED maintains theadvantages of three existing fusion algorithms, optimizedby a new objective function and derivation. The metricensemble learning is also studied in [160].

Datasets.

We ﬁrst review the widely used datasets for theclosed-world setting, including 11 image datasets (VIPeR[31], iLIDS [167], GRID [168], PRID2011 [126], CUHK01-03 [43], Market-1501 [5], DukeMTMC [42], Airport [169]and MSMT17 [44]) and 7 video datasets (PRID-2011 [126],iLIDS-VID [7], MARS [8], Duke-Video [144], Duke-Tracklet[170], LPW [171] and LS-VID [136]). The statistics of thesedatasets are shown in Table 2. This survey only focuses onthe general large-scale datsets for deep learning methods. Acomprehensive summarization of the Re-ID datasets can befound in [169] and their website . Several observations canbe made in terms of the dataset collection over recent years:

1. https://github.com/NEU-Gou/awesome-reid-dataset

TABLE 2: Statistics of some commonly used datasets forclosed-world person Re-ID. “both” means that it containsboth hand-cropped and detected bounding boxes. “C&M”means both CMC and mAP are evaluated.

Image datasets

Dataset Time

Video datasets

Dataset time

1) The dataset scale (both

Evaluation Metrics.

To evaluate a Re-ID system, Cumu-lative Matching Characteristics (CMC) [68] and mean Aver-age Precision (mAP) [5] are two widely used measurements.CMC- k ( a.k.a , Rank- k matching accuracy) [68] representsthe probability that a correct match appears in the top- k ranked retrieved results. CMC is accurate when only oneground truth exists for each query, since it only considersthe ﬁrst match in evaluation process. However, the galleryset usually contains multiple groundtruths in a large cameranetwork, and CMC cannot completely reﬂect the discrim-inability of a model across multiple cameras.Another metric, i.e ., mean Average Precision (mAP) [5],measures the average retrieval performance with multiplegrountruths. It is originally widely used in image retrieval.For Re-ID evaluation, it can address the issue of two systemsperforming equally well in searching the ﬁrst ground truth(might be easy match as in Fig. 4), but having differentretrieval abilities for other hard matches.Considering the efﬁciency and complexity of traininga Re-ID model, some recent works [138], [139] also reportthe FLoating-point Operations Per second (FLOPs) and thenetwork parameter size as the evaluation metrics. Thesetwo metrics are crucial when the training/testing device haslimited computational resources. We review the state-of-the-arts from both image-based andvideo-based perspectives. We include methods published intop CV venues over the past three years.

EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 8 P$3 5 D QN 3&% $XWR5H,' $%'0*1%DJ7ULFNV261HW'*1HW 6&$/0+1 31HW%'% 621$6)7 &RQV$WW'HQVH63\UDPLG,$1HW3\U1HW 9$/+XPDQ (a) SOTA on Market-1501 [5] P$3 5 D QN 3&% $%'0*1%DJ7ULFNV261HW'*1HW 6&$/0+131HW %'% 621$6)7&RQV$WW'HQVH6 3\UDPLG,$1HW3\U1HW 9$/ (b) SOTA on DukeMTMC [42] P$3 5 D QN $XWR5H,'0*1261HW 6&$/0+1 31HW %'%6)7'HQVH63\UDPLG3\U1HW3&% 621$ (c) SOTA on CUHK03 [43] P$3 5 D QN 261HW6)7,$1HW3&% $XWR5H,' $%''*1HW (d) SOTA on MSMT17 [44] Fig. 5:

State-of-the-arts (SOTA) on four image-based person Re-ID datasets. Both the Rank-1 accuracy (%) and mAP value (%) arereported. For CUHK03 [43], the detected data under the setting [58] is reported. For Market-1501, the single query setting is used.The best result is highlighted with a red star. All the listed results do not use re-ranking or additional annotated information.

Image-based Re-ID.

There are a large number of pub-lished papers for image-based Re-ID . We mainly reviewthe works published in 2019 as well as some represen-tative works in 2018. Speciﬁcally, we include PCB [77],MGN [172], PyrNet [6], Auto-ReID [139], ABD-Net [173],BagTricks [122], OSNet [138], DGNet [119], SCAL [90], MHN[174], P2Net [104], BDB [124], SONA [107], SFT [95], ConsAtt[93], DenseS [103], Pyramid [156], IANet [108], VAL [114].We summarize the results on four datasets (Fig. 5). Thisoverview motivates ﬁve major insights, as discussed below.First, with the advancement of deep learning, mostof the image-based Re-ID methods have achieved higherrank-1 accuracy than humans (93.5% [175]) on the widelyused Market-1501 dataset. In particular, VAL [114] obtainsthe best mAP of 91.6% and Rank-1 accuracy of 96.2% onMarket-1501 dataset. The major advantage of VAL is theusage of viewpoint information. The performance can befurther improved when using re-ranking or metric fusion.The success of deep learning on these closed-world datasetsalso motivates the shift focus to more challenging scenarios, i.e ., large data size [136] or unsupervised learning [176].Second, part-level feature learning is beneﬁcial for dis-criminative Re-ID model learning. Global feature learningdirectly learns the representation on the whole image with-out the part constraints [122]. It is discriminative whenthe person detection/ tracking can accurately locate thehuman body. When the person images suffer from largebackground clutter or heavy occlusions, part-level featurelearning usually achieves better performance by miningdiscriminative body regions [67]. Due to its advantage inhandling misalignment/occlusions, we observe that mostof the state-of-the-art methods developed recently adoptthe features aggregation paradigm, combining the part-leveland full human body features [139], [156].Third, attention is beneﬁcial for discriminative Re-IDmodel learning. We observe that all the methods (ConsAtt[93], SCAL [90], SONA [107], ABD-Net [173]) achievingthe best performance on each dataset adopt an attentionscheme. The attention captures the relationship betweendifferent convolutional channels, multiple feature maps,hierarchical layers, different body parts/regions, and evenmultiple images. Meanwhile, discriminative [173], diverse[133], consistent [93] and high-order [107] properties are

2. https://paperswithcode.com/task/person-re-identiﬁcation incorporated to enhance the attentive feature learning. Con-sidering the powerful attention schemes and the speciﬁcityof the Re-ID problem, it is highly possible that attentivedeeply learned systems will continue dominating the Re-IDcommunity, with more domain speciﬁc properties.Fourth, multi-loss training can improve the Re-ID modellearning. Different loss functions optimize the networkfrom a multi-view perspective. Combining multiple lossfunctions can improve the performance, evidenced by themulti-loss training strategy in the state-of-the-art methods,including ConsAtt [93], ABD-Net [173] and SONA [107]. Inaddition, a dynamic multi-loss training strategy is designedin [156] to adaptively integrated two loss functions. Thecombination of identity loss and triplet loss with hard min-ing is the primary choice. Moreover, due to the imbalancedissue, sample weighting strategy generally improves theperformance by mining informative triplets [52], [89].Finally, there is still much room for further improvementdue to the increasing size of datasets, complex environment,limited training samples. For example, the Rank-1 accuracy(82.3%) and mAP (60.8%) on the newly released MSMT17dataset [44] are much lower than that on Market-1501 (Rank-1: 96.2% and mAP 91.7%) and DukeMTMC (Rank-1: 91.6%and mAP 84.5%). On some other challenging datasets withlimited training samples ( e.g ., GRID [168] and VIPeR [31]),the performance is still very low. In addition, Re-ID modelsusually suffers signiﬁcantly on cross-dataset evaluation [28],[54], and the performance drops dramatically under adver-sarial attack [177]. We are optimistic that there would beimportant breakthroughs in person Re-ID, with increasingdiscriminability, robustness, and generalizability.

Video-based Re-ID.

Video-based Re-ID has receivedless interest, compared to image-based Re-ID. We reviewthe deeply learned Re-ID models, including CoSeg [132],GLTR [136], STA [135], ADFD [110], STC [20], DRSA [133],Snippet [134], ETAP [144], DuATM [91], SDM [178], TwoS[128], ASTPN [131], RQEN [171], Forest [130], RNN [127]and IDEX [8]. We also summarize the results on four videoRe-ID datasets, as shown in Fig. 6. From these results, thefollowing observations can be drawn.First, a clear trend of increasing performance can beseen over the years with the development of deep learningtechniques. Speciﬁcally, the Rank-1 accuracy increases from70% (RNN [127] in 2016) to 95.5% (GLTR [136] in 2019) onPRID-2011 dataset, and from 58% (RNN [127]) to 86.3%

EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 9 <HDU 5 D QN */75$')''56$6QLSSHW6'07ZR6$6731 54(1)RUHVW511,'(; (a) SOTA on PRID-2011 [126] <HDU 5 D QN &R6HJ*/75$')'67&'56$6QLSSHW6'07ZR6$6731 54(1)RUHVW511,'(; (b) SOTA on iLIDS-VID [7] P$3 5 D QN &R6HJ*/7567$$')' 67&'56$ 6QLSSHW(7$3'X$7054(1)RUHVW,'(; (c) SOTA on MARS [8] P$3 5 D QN &R6HJ*/7567$67&(7$3 (d) SOTA on Duke-Video [144] Fig. 6:

State-of-the-arts (SOTA) on four widely used video-based person Re-ID datasets. The Rank-1 accuracies (%) over years arereported. mAP values (%) on MARS [8] and Duke-Video [144] are reported. For Duke-Video, we refer to the settings in [144]. Thebest result is highlighted with a red star. All the listed results do not use re-ranking or additional annotated information. (ADFD [110]) on iLIDS-VID dataset. On the large-scaleMARS dataset, the Rank-1 accuracy/mAP increase from68.3%/49.3% (IDEX [8]) to 88.5%/82.3% (STC [20]). On theDuke-Video dataset [144], STA [135] also achieves a Rank-1accuracy of 96.2%, and the mAP is 94.9%.Second, spatial and temporal modeling is crucial fordiscriminative video representation learning. We observethat all the methods (STA [135], STC [20], GLTR [136])design spatial-temporal aggregation strategies to improvethe video Re-ID performance. Similar to image-based Re-ID, the attention scheme across multiple frames [110], [135]also greatly enhances the discriminability. Another interest-ing observation in [20] demonstrates that utilizing multipleframes within the video sequence can ﬁll in the occludedregions, which provides a possible solution for handling thechallenging occlusion problem in the future.Finally, the performance on these datases has reached asaturation state, usually about less than 1% accuracy gain onthese four video datasets. However, there is still large roomfor improvements on the challenging cases. For example, onthe newly collected video dataset, LS-VID [136], the Rank-1 accuracy/mAP of GLTR [136] are only 63.1%/44.43%,while GLTR [136] can achieve state-of-the-art or at leastcomparable performance on the other four daatsets. LS-VID[136] contains signiﬁcantly more identities and video se-quences. This provides a challenging benchmark for futurebreakthroughs in video based Re-ID.

PEN - WORLD P ERSON R E -I DENTIFICATION

This section reviews open-world person Re-ID as discussedin § 1, including heterogeneous Re-ID by matching per-son images across heterogeneous modalities (§ 3.1), end-to-end Re-ID from the raw images/videos (§ 3.2), semi-/unsupervised learning with limited/unavailable anno-tated labels (§ 3.3), robust Re-ID model learning with noisyannotations (§ 3.4) and open-set person Re-ID when thecorrect match does not occur in the gallery (§ 3.5).

This subsection summarizes four main kinds of hetero-geneous Re-ID, including Re-ID between depth and RGBimages (§ 3.1.1), text-to-image Re-ID (§ 3.1.2), visible-to-infrared Re-ID (§ 3.1.3) and cross resolution Re-ID (§ 3.1.4).

Depth images capture the body shape and skeleton in-formation. This provides the possibility for Re-ID underillumination/clothes changing environments, which is alsoimportant for personalized human interaction applications.A recurrent attention-based model is proposed in [179]to address the depth-based person identiﬁcation. In a re-inforcement learning framework, they combine the convo-lutional and recurrent neural networks to identify small,discriminative local regions of the human body.Karianakis et al. [180] leverage the large RGB datasetsto design a split-rate RGB-to-Depth transfer method, whichbridges the gap between the depth images and the RGB im-ages. Their model further incorporates a temporal attentionto enhance video representation for depth Re-ID.Some methods [62], [181] have also studied the combi-nation of RGB and depth information to improve the Re-IDperformance, addressing the clothes-changing challenge.

Text-to-image Re-ID addresses the matching between a textdescription and RGB images [63]. It is imperative when thevisual image of query person cannot be obtained, and onlya text description can be alternatively provided.A gated neural attention model [63] with recurrent neu-ral network learns the shared features between the textdescription and the person images. This enables the end-to-end training for text to image pedestrian retrieval. Cheng et al. [182] propose a global discriminative image-languageassociation learning method, capturing the identity discrim-inative information and local reconstructive image-languageassociation under a reconstruction process. A cross projec-tion learning method [183] also learns a shared space withimage-to-text matching. A deep adversarial graph attentionconvolution network is designed in [184] with graph rela-tion mining. However, the large semantic gap between thetext descriptions and the visual images is still challenging.Meanwhile, how to combine the texts and hand-paintingsketch image is also worth studying in the future.

Visible-Infrared Re-ID handles the cross-modality matchingbetween the daytime visible and night-time infrared images.It is important in low-lighting conditions, where the imagescan only be captured by infrared cameras [21], [60], [185].

EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 10 Wu et al. [21] start the ﬁrst attempt to address this issue,by proposing a deep zero-padding framework [21] to adap-tively learn the modality sharable features. A two streamnetwork is introduced in [142], [186] to model the modality-sharable and -speciﬁc information, addressing the intra- andcross-modality variations simultaneously. Besides the cross-modality shared embedding learning [187], the classiﬁer-level discrepancy is also investigated in [188]. Recent meth-ods [189], [190] adopt the GAN technique to generate cross-modality person images to reduce the cross-modality dis-crepancy at both image and feature level. Hierarchical cross-Modality disentanglement factors are modeled in [191]. Adual-attentive aggregation learning method is presented in[192] to capture multi-level relations. Cross-Resolution Re-ID conducts the matching betweenlow-resolution and high-resolution images, addressing thelarge resolution variations [13], [14]. A cascaded SR-GAN[193] generates the high-resolution person images in a cas-caded manner, incorporating the identity information. Li etal. [194] adopt the adversarial learning technique to obtainresolution-invariant image representations.

End-to-end Re-ID alleviates the reliance on additional stepfor bounding boxes generation. It involves the person Re-IDfrom raw images or videos, and multi-camera tracking.

Re-ID in Raw Images/Videos

This task requires thatthe model jointly performs the person detection and re-identiﬁcation in a single framework [55], [64]. It is challeng-ing due to the different focuses of two major components.Zheng et al. [55] present a two-stage framework, and sys-tematically evaluate the beneﬁts and limitations of persondetection for the later stage person Re-ID. Xiao et al. [64]design an end-to-end person search system using a singleconvolutional neural network for joint person detection andre-identiﬁcation. A Neural Person Search Machine (NPSM)[195] is developed to recursively reﬁne the searching areaand locate the target person by fully exploiting the con-textual information between the query and the detectedcandidate region. Similarly, a contextual instance expansionmodule [196] is learned in a graph learning frameworkto improve the end-to-end person search. A query-guidedend-to-end person search system [197] is developed usingthe Siamese squeeze-and-excitation network to capture theglobal context information with query-guided region pro-posal generation. A localization reﬁnement scheme withdiscriminative Re-ID feature learning is introduced in [198]to generate more reliable bounding boxes. An IdentityDiscriminativE Attention reinforcement Learning (IDEAL)method [199] selects informative regions for auto-generatedbounding boxes, improving the Re-ID performance.Yamaguchi et al. [200] investigate a more challengingproblem, i.e ., searching for the person from raw videos withtext description. A multi-stage method with spatio-temporalperson detection and multi-modal retrieval is proposed.Further exploration along this direction is expected.

Multi-camera Tracking

End-to-end person Re-ID is alsoclosely related to multi-person, multi-camera tracking [52]. A graph-based formulation to link person hypotheses isproposed for multi-person tracking [201], where the holisticfeatures of the full human body and body pose layout arecombined as the representation for each person. Ristani etal. [52] learn the correlation between the multi-target multi-camera tracking and person Re-ID by hard-identity miningand adaptive weighted triplet learning. Recently, a localityaware appearance metric (LAAM) [202] with both intra- andinter-camera relation modeling is proposed.

Early unsupervised Re-ID mainly learns invariant compo-nents, i.e ., dictionary [203], metric [204] or saliency [66],which leads to limited discriminability or scalability.For deeply unsupervised methods, cross-camera labelestimation is one the popular approaches [176], [205]. Dy-namic graph matching (DGM) [206] formulates the labelestimation as a bipartite graph matching problem. To fur-ther improve the performance, global camera network con-straints [207] are exploited for consistent matching. Liu et al. progressively mine the labels with step-wise metricpromotion [204]. A robust anchor embedding method [83]iteratively assigns labels to the unlabelled tracklets to en-large the anchor video sequences set. With the estimatedlabels, deep learning can be applied to learn Re-ID models.For end-to-end unsupervised Re-ID, an iterative clus-tering and Re-ID model learning is presented in [205].Similarly, the relations among samples are utilized in ahierarchical clustering framework [208]. Soft multi-labellearning [209] mines the soft label information from a ref-erence set for unsupervised learning. A Tracklet AssociationUnsupervised Deep Learning (TAUDL) framework [170]jointly conducts the within-camera tracklet association andmodel the cross-camera tracklet correlation. Similarly, anunsupervised camera-aware similarity consistency miningmethod [210] is also presented in a coarse-to-ﬁne con-sistency learning scheme. The intra-camera mining andinter-camera association is applied in a graph associationframework [211]. The semantic attributes are also adoptedin Transferable Joint Attribute-Identity Deep Learning (TJ-AIDL) framework [111]. However, it is still challenging formodel updating with newly arriving unlabelled data.Besides, several methods have also tried to learn a part-level representation based on the observation that it iseasier to mine the label information in local parts thanthat of a whole image. A PatchNet [153] is designed tolearn discriminative patch features by mining patch levelsimilarity. A Self-similarity Grouping (SSG) approach [212]iteratively conducts grouping (exploits both the global bodyand local parts similarity for pseudo labeling) and Re-IDmodel training in a self-paced manner.

Semi-/Weakly supervised Re-ID.

With limited label in-formation, a one-shot metric learning method is proposed in[213], which incorporates a deep texture representation anda color metric. A stepwise one-shot learning method (EUG)is proposed in [144] for video-based Re-ID, gradually select-ing a few candidates from unlabeled tracklets to enrich thelabeled tracklet set. A multiple instance attention learningframework [214] uses the video-level labels for representa-tion learning, alleviating the reliance on full annotation.

EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 11

Unsupervised domain adaptation (UDA) transfers theknowledge on a labeled source dataset to the unlabeled tar-get dataset [53]. Due to the large domain shift and powerfulsupervision in source dataset, it is another popular approachfor unsupervised Re-ID without target dataset labels.

Target Image Generation.

Using GAN generation totransfer the source domain images to target-domain styleis a popular approach for UDA Re-ID. With the generatedimages, this enables supervised Re-ID model learning in theunlabeled target domain. Wei et al. [44] propose a PersonTransfer Generative Adversarial Network (PTGAN), trans-ferring the knowledge from one labeled source dataset tothe unlabeled target dataset. Preserved self-similarity anddomain-dissimilarity [120] is trained with a similarity pre-serving generative adversarial network (SPGAN). A Hetero-Homogeneous Learning (HHL) method [215] simultane-ously considers the camera invariance with homogeneouslearning and domain connectedness with heterogeneouslearning. An adaptive transfer network [216] decomposesthe adaptation process into certain imaging factors, includ-ing illumination, resolution, camera view, etc. This strategyimproves the cross-dataset performance. Huang et al. [217]try to suppress the background shift to minimize the domainshift problem. Chen et al. [218] design an instance-guidedcontext rendering scheme to transfer the person identitiesfrom source domain into diverse contexts in the target do-main. Besides, a pose disentanglement scheme is added toimprove the image generation [121]. A mutual mean-teacherlearning scheme is also developed in [219]. However, thescalability and stability of the image generation for practicallarge-scale changing environment are still challenging.Bak et al. [125] generate a synthetic dataset with differ-ent illumination conditions to model realistic indoor andoutdoor lighting. The synthesized dataset increases general-izability of the learned model and can be easily adapted toa new dataset without additional supervision [220].

Target Domain Supervision Mining.

Some methodsdirectly mine the supervision on the unlabeled target datasetwith a well trained model from source dataset. An exem-plar memory learning scheme [106] considers three invari-ant cues as the supervision, including exemplar-invariance,camera invariance and neighborhood-invariance. TheDomain-Invariant Mapping Network (DIMN) [28] formu-lates a meta-learning pipeline for the domain transfer task,and a subset of source domain is sampled at each trainingepisode to update the memory bank, enhancing the scal-ability and discriminability. The camera view informationis also applied in [221] as the supervision signal to reducethe domain gap. A self-training method with progressiveaugmentation [222] jointly captures the local structure andglobal data distribution on the target dataset. Recently,a self-paced contrastive learning framework with hybridmemory [223] is developed with great success, which dy-namically generates multi-level supervision signals.The spatio-temporal information is also utilized as thesupervision in TFusion [224]. TFusion transfers the spatio-temporal patterns learned in the source domain to the targetdomain with a Bayesian fusion model. Similarly, Query-Adaptive Convolution (QAConv) [225] is developed to im-prove cross-dataset accuracy. TABLE 3: Statistics of SOTA unsupervised person Re-IDon two image-based datasets. “Source” represents if it uti-lizes the source annotated data in training the target Re-IDmodel. “Gen.” indicates if it contains an image generationprocess. Rank-1 accuracy (%) and mAP (%) are reported.

Market-1501 DukeMTMC

Methods Source Gen. R1 mAP R1 mAPCAMEL [226]

ICCV17

Model No 54.5 26.3 - -PUL [205]

TOMM18

Model No 45.5 20.5 30.0 16.4PTGAN [120]

CVPR18

Data Yes 58.1 26.9 46.9 26.4TJ-AIDL † [111] CVPR18

Data No 58.2 26.5 44.3 23.0HHL [215]

ECCV18

Data Yes 62.2 31.4 46.9 27.2MAR ‡ [209] CVPR19

Data No 67.7 40.0 67.1 48.0ENC [106]

CVPR19

Data No 75.1 43.0 63.3 40.4ATNet [216]

CVPR19

Data Yes 55.7 25.6 45.1 24.9PAUL ‡ [153] CVPR19

Model No 68.5 40.1 72.0 53.2SBGAN [217]

ICCV19

Data Yes 58.5 27.3 53.5 30.8UCDA [221]

ICCV19

Data No 64.3 34.5 55.4 36.7CASC ‡ [210] ICCV19

Model No 65.4 35.5 59.3 37.8PDA [121]

ICCV19

Data Yes 75.2 47.6 63.2 45.1CR-GAN [218]

ICCV19

Data Yes 77.7 54.0 68.9 48.6PAST [222]

ICCV19

Model No 78.4 54.6 72.4 54.3SSG [212]

ICCV19

Model No 80.0 58.3 73.0 53.4HCT [208]

CVPR20

Model No 80.0 56.4 69.6 50.7SNR [227]

CVPR20

Data No 82.8 61.7 76.3 58.1MMT [219]

ICLR20

Data No 87.7 71.2 78.0 65.1MEB-Net [228]

ECCV20

Data No 89.9 76.0 79.6 66.1SpCL [223]

NeurIPS20

Data No • † TJ-AIDL [111] requires additional attribute annotation. • § DAS [125] generates synthesized virtual humans under vairous lightings. • ‡ PAUL [153], MAR [209] and CASC [210] use MSMT17 as source dataset.

Unsupervised Re-ID has achieved increasing attention inrecent years, evidenced by the increasing number of publica-tions in top venues. We review the SOTA for unsuperviseddeeply learned methods on two widely-used image-basedRe-ID datasets. The results are summarized in Table 3. Fromthese results, the following insights can be drawn.First, the unsupervised Re-ID performance has increasedsigniﬁcantly over the years. The Rank-1 accuracy/mAPincreases from 54.5%/26.3% (CAMEL [226]) to 90.3%/76.7%(SpCL [223]) on the Market-1501 dataset within three years.The performance for DukeMTMC dataset increases from30.0%/16.4% to 82.9%/68.8%. The gap between the su-pervised upper bound and the unsupervised learning isnarrowed signiﬁcantly. This demonstrates the success ofunsupervised Re-ID with deep learning.Second, current unsupervised Re-ID is still under-developed and it can be further improved in the followingaspects: 1) The powerful attention scheme in supervised Re-ID methods has rarely been applied in unsupervised Re-ID. 2) Target domain image generation has been provedeffective in some methods, but they are not applied intwo best methods (PAST [222], SSG [212]). 3) Using theannotated source data in the training process of the targetdomain is beneﬁcial for cross-dataset learning, but it is alsonot included in above two methods. These observationsprovide the potential basis for further improvements.Third, there is still a large gap between the unsupervisedand supervised Re-ID. For example, the rank-1 accuracy ofsupervised ConsAtt [93] has achieved 96.1% on the Market-1501 dataset, while the highest accuracy of unsupervisedSpCL [223] is about 90.3%. Recently, He et al. [229] havedemonstrated that unsupervised learning with large-scaleunlabeled training data has the ability to outperform thesupervised learning on various tasks [230]. We expect thatseveral breakthroughs in future unsupervised Re-ID.

EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 12

Re-ID usually suffers from unavoidable noise due to datacollection and annotation difﬁculty. We review noise-robustRe-ID from three aspects:

Partial Re-ID with heavy occlusion,

Re-ID with sample noise caused by detection or trackingerrors, and

Re-ID with label noise caused by annotation error.

Partial Re-ID.

This addresses the Re-ID problem withheavy occlusions, i.e ., only part of the human body isvisible [231]. A fully convolutional network [232] is adoptedto generate ﬁx-sized spatial feature maps for the incom-plete person images. Deep Spatial feature Reconstruction(DSR) is further incorporated to avoid explicit alignmentby exploiting the reconstructing error. Sun et al. [67] designa Visibility-aware Part Model (VPM) to extract sharableregion-level features, thus suppressing the spatial misalign-ment in the incomplete images. A foreground-aware pyra-mid reconstruction scheme [233] also tries to learn from theunoccluded regions. The Pose-Guided Feature Alignment(PGFA) [234] exploits the pose landmarks to mine discrim-inative part information from occlusion noise. However, itis still challenging due to the severe partial misalignment,unpredictable visible regions and distracting unshared bodyregions. Meanwhile, how to adaptively adjust the matchingmodel for different queries still needs further investigation.

Re-ID with Sample Noise.

This refers to the prob-lem of the person images or the video sequence contain-ing outlying regions/frames, either caused by poor de-tection/inaccurate tracking results. To handle the outlyingregions or background clutter within the person image, poseestimation cues [17], [18] or attention cues [22], [66], [199]are exploited. The basic idea is to suppress the contributionof the noisy regions in the ﬁnal holistic representation. Forvideo sequences, set-level feature learning [83] or framelevel re-weighting [134] are the commonly used approachesto reduce the impact of noisy frames. Hou et al. [20] alsoutilize multiple video frames to auto-complete occludedregions. It is expected that more domain-speciﬁc samplenoise handling designs in the future.

Re-ID with Label Noise.

Label noise is usually un-avoidable due to annotation error. Zheng et al. adopt alabel smoothing technique to avoid label overﬁting issues[42]. A Distribution Net (DNet) that models the featureuncertainty is proposed in [235] for robust Re-ID modellearning against label noise, reducing the impact of sampleswith high feature uncertainty. Different from the generalclassiﬁcation problem, robust Re-ID model learning suffersfrom limited training samples for each identity [236]. Inaddition, the unknown new identities increase additionaldifﬁculty for the robust Re-ID model learning.

Open-set Re-ID is usually formulated as a person veri-ﬁcation problem, i.e ., discriminating whether or not twoperson images belong to the same identity [69], [70]. Theveriﬁcation usually requires a learned condition τ , i.e ., sim ( query, gallery ) > τ . Early researches design hand-crafted systems [26], [69], [70]. For deep learning methods,an Adversarial PersonNet (APN) is proposed in [237], whichjointly learns a GAN module and the Re-ID feature extrac-tor. The basic idea of this GAN is to generate realistic target- Rank List 2 ... ... (a) CMC = 1 AP = 0.77

INP = 0.3

Rank List 1 (b) CMC = 1 AP = 0.70

INP = 0.6

NP = 10 − 310 = 0.7

NP = 5 − 3 = 0.4 INP = 1- NP

Fig. 7: Difference between the widely used CMC, AP andthe negative penalty (NP) measurements. True matchingand false matching are bounded in green and red boxes,respectively. Assume that only three correct matches existin the gallery, rank list 1 gets better AP, but gets muchworse NP than rank list 2. The main reason is that ranklist 1 contains too many false matchings before ﬁnding thehardest true matching. For consistency with CMC and mAP,we compute the inverse negative penalty (INP), e.g ., INP =1- NP. Larger INP means better performance.like images (imposters) and enforce the feature extractoris robust to the generated image attack. Modeling featureuncertainty is also investigated in [235]. However, it remainsquite challenging to achieve a high true target recognitionand maintain low false target recognition rate [238].

Group Re-ID.

It aims at associating the persons ingroups rather than individuals [167]. Early researchesmainly focus on group representation extraction with sparsedictionary learning [239] or covariance descriptor aggre-gation [240]. The multi-grain information is integrated in[241] to fully capture the characteristics of a group. Re-cently, the graph convoltuional network is applied in [242],representing the group as a graph. The group similarityis also applied in the end-to-end person search [196] andthe individual re-identiﬁcation [197], [243] to improve theaccuracy. However, group Re-ID is still challenging since thegroup variation is more complicated than the individuals.

Dynamic Multi-Camera Network.

Dynamic updatedmulti-camera network is another challenging issue [23], [24],[27], [29], which needs model adaptation for new cameras orprobes. A human in-the-loop incremental learning methodis introduced in [24] to update the Re-ID model, adaptingthe representation for different probe galleries. Early re-search also applies the active learning [27] for continuousRe-ID in multi-camera network. A continuous adaptationmethod based on sparse non-redundant representative se-lection is introduced in [23]. A transitive inference algorithm[244] is designed to exploit the best source camera modelbased on a geodesic ﬂow kernel. Multiple environmentalconstraints ( e.g ., Camera Topology) in dense crowds andsocial relationships are integrated for an open-world personRe-ID system [245]. The model adaptation and environmen-tal factors of cameras are crucial in practical dynamic multi-camera network. Moreover, how to apply the deep learningtechnique for the dynamic multi-camera network is still lessinvestigated. N O UTLOOK : R E -ID IN N EXT E RA This section ﬁrstly presents a new evaluation metric in § 4.1,a strong baseline (in § 4.2) for person Re-ID. It provides animportant guidance for future Re-ID research. Finally, wediscuss some under-investigated open issues in § 4.3.

EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 13

TABLE 4: Comparison with the state-of-the-arts on single-modality image-based Re-ID. Rank-1 accuracy (%), mAP (%)and mINP (%) are reported on two public datasets.

Market-1501 [5] DukeMTMC [42]Method R1 mAP mINP R1 mAP mINPBagTricks [122]

CVPR19W

ICCV19

For a good Re-ID system, the target person should be re-trieved as accurately as possible, i.e ., all the correct matchesshould have low rank values. Considering that the targetperson should not be neglected in the top-ranked retrievedlist, especially for multi-camera network, so as to accuratelytrack the target. When the target person appears in thegallery set at multiple time stamps, the rank position ofthe hardest correct match determines the workload of theinspectors for further investigation. However, the currentlywidely used CMC and mAP metrics cannot evaluate thisproperty, as shown in Fig. 7. With the same CMC, ranklist 1 achieves a better AP than rank list 2, but it requiresmore efforts to ﬁnd all the correct matches. To address thisissue, we design a computationally efﬁcient metric, namelya negative penalty (NP), which measures the penalty to ﬁndthe hardest correct match NP i = R hardi − | G i | R hardi , (6)where R hardi indicates the rank position of the hardestmatch, and | G i | represents the total number of correctmatches for query i . Naturally, a smaller NP representsbetter performance. For consistency with CMC and mAP, weprefer to use the inverse negative penalty (INP), an inverseoperation of NP. Overall, the mean INP of all the queries isrepresented by mINP = 1 n (cid:88) i (1 − NP i ) = 1 n (cid:88) i | G i | R hardi . (7)The calculation of mINP is quite efﬁcient and can beseamlessly integrated in the CMC/mAP calculating pro-cess. mINP avoids the domination of easy matches in themAP/CMC evaluation. One limitation is that mINP valuedifference for large gallery size would be much smallercompared to small galleries. But it still can reﬂect the relativeperformance of a Re-ID model, providing a supplement tothe widely-used CMC and mAP metrics. According to the discussion in § 2.4.2, we design a new

AGW baseline for person Re-ID, which achieves com-petitive performance on both single-modality (image andvideo) and cross-modality Re-ID tasks. Speciﬁcally, our newbaseline is designed on top of BagTricks [122], and AGWcontains the following three major improved components:

3. Details are in https://github.com/mangye16/ReID-Survey andcomprehensive comparison is shown in the supplementary material. (1) Non-local Attention (2) GeM

Generalized Mean Pooling

X Z 𝑤 𝑝 (3) Weighted Regularization Triplet 𝑤 BN ID lossFeature

ResNet50 Backbone

Fig. 8: The framework of the proposed AGW baseline usingthe widely used ResNet50 [80] as the backbone network.(1)

Non-local Attention (Att) Block . As discussed in§ 2.4.2, the attention scheme plays a crucial role in discrim-inative Re-ID model learning. We adopt the powerful non-local attention block [246] to obtain the weighted sum of thefeatures at all positions, represented by z i = W z ∗ φ ( x i ) + x i , (8)where W z is a weight matrix to be learned, φ ( · ) represents anon-local operation, and + x i formulates a residual learningstrategy. Details can be found in [246]. We adopt the defaultsetting from [246] to insert the non-local attention block.(2) Generalized-mean (GeM) Pooling . As a ﬁne-grainedinstance retrieval, the widely-used max-pooling or aver-age pooling cannot capture the domain-speciﬁc discrimina-tive features. We adopt a learnable pooling layer, named generalized-mean (GeM) pooling [247], formulated by f = [ f · · · f k · · · f K ] T , f k = ( 1 |X k | (cid:88) x i ∈X k x p k i ) pk , (9)where f k represents the feature map, and K is number offeature maps in the last layer. X k is the set of W × H acti-vations for feature map k ∈ { , , · · · , K } . p k is a poolinghyper-parameter, which is learned in the back-propagationprocess [247]. The above operation approximates max pool-ing when p k → ∞ and average pooling when p k = 1 .(3) Weighted Regularization Triplet (WRT) loss . Inaddition to the baseline identity loss with softmax cross-entropy, we integrate with another weighted regularizedtriplet loss, L wrt ( i ) = log(1 + exp( (cid:88) j w pij d pij − (cid:88) k w nik d nik )) . (10) w pij = exp ( d pij ) (cid:80) d pij ∈P i exp( d pij ) , w nik = exp ( − d nik ) (cid:80) d nik ∈N i exp( − d nik ) , (11)where ( i, j, k ) represents a hard triplet within each trainingbatch. For anchor i , P i is the corresponding positive set,and N i is the negative set. d pij / d nik represents the pairwisedistance of a positive/negative sample pair. The aboveweighted regularization inherits the advantage of relativedistance optimization between positive and negative pairs,but it avoids introducing any additional margin parameters.Our weighting strategy is similar to [248], but our solutiondoes not introduce additional hyper-parameters.The overall framework of AGW is shown in Fig 8. Othercomponents are exactly the same as [122]. In the testingphase, the output of BN layer is adopted as the feature rep-resentation for Re-ID. The implementation details and moreexperimental results are in the supplementary material. EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 14

TABLE 5: Comparison with the state-of-the-arts on twoimage Re-ID datasets, including CUHK03 and MSMT17.Rank-1 accuracy (%), mAP (%) and mINP (%) are reported.

CUHK03 [43] MSMT17 [44]Method R1 mAP mINP R1 mAP mINPBagTricks [122]

CVPR19W

TABLE 6: Comparison with the state-of-the-arts on fourvideo-based Re-ID datasets, including MARS [8], Duke-Video [144], PRID2011 [126] and iLIDS-VID [7]. Rank-1accuracy (%), mAP (%) and mINP (%) are reported.

MARS [8] DukeVideo [144]Method R1 mAP mINP R1 mAP mINPBagTricks [122]

CVPR19W

ICCV19 + (Ours) PRID2011 [126] iLIDS-VID [7]Method R1 R5 mINP R1 R5 mINPBagTricks [122]

CVPR19W + (Ours) Results on Single-modality Image Re-ID.

We ﬁrst eval-uate each component on two image-based datasets (Market-1501 and DukeMTMC) in Table 4. We also list two state-of-the-art methods, BagTricks [122] and ABD-Net [173]. Wereport the results on CUHK03 and MSMT17 datasets inTable 5. We obtain the following two observations:1) All the components consistently contribute the ac-curacy gain, and AGW performs much better than theoriginal BagTricks under various metrics. AGW providesa strong baseline for future improvements. We have alsotried to incorporate part-level feature learning [77], butextensive experiments show that it does not improve theperformance. How to aggregate part-level feature learningwith AGW needs further study in the future. 2) Compared tothe current state-of-the-art, ABD-Net [173], AGW performsfavorably in most cases. In particular, we achieve muchhigher mINP on DukeMTMC dataset, 45.7% vs . 42.1%. Thisdemonstrates that AGW requires less effort to ﬁnd all thecorrect matches, verifying the ability of mINP. Results on Single-modality Video Re-ID.

We also eval-uate the proposed AGW on four widely used single modal-ity video-based datasets ( MARS [8], DukeVideo [144],PRID2011 [126] and iLIDS-VID [7], as shown in Table 6. Wealso compare two state-of-the-art methods, BagTricks [122]and Co-Seg [132]. For video data, we develop a variant(AGW + ) to capture the temporal information with frame-level average pooling for sequence representation. Mean-while, constraint random sampling strategy [133] is appliedfor training. Compared to Co-Seg [132], our AGW + obtainsbetter Rank-1, mAP and mINP in most cases. Results on Partial Re-ID.

We also test the performanceof AGW on two partial Re-ID datasets, as shown in Table7. The experimental setting are from DSR [232]. We alsoachieve comparable performance with the state-of-the-artVPM method [67]. This experiment further demonstrates thesuperiority of AGW for the open-world partial Re-ID task.Meanwhile, the mINP also shows the applicability for thisopen-world Re-ID problem.

Results on Cross-modality Re-ID.

We also test theperformance of AGW using a two-stream architecture on thecross-modality visible-infrared Re-ID task. The comparison TABLE 7: Comparison with the state-of-the-arts on twopartial Re-ID datasets, including Partial-REID and Partial-iLIDS. Rank-1, -3 accuracy (%) and mINP (%) are reported.

Method Partial-REID Partial-iLIDSR1 R3 mINP R1 R3 mINPDSR [232]

CVPR18

ArXiv18

CVPR19 - CVPR19W

TABLE 8: Comparison with the state-of-the-arts on cross-modality visible-infrared Re-ID. Rank-1 accuracy (%), mAP(%) and mINP (%) are reported on two public datasets.

RegDB [60] SYSU-MM01 [21]

Visible-Thermal All Search Indoor Search

Method R1 mAP R1 mAP R1 mAPZero-Pad [21]

ICCV17

AAAI18

TIFS19

AAAI19 RL [189]

CVPR19

ICCV19

CVPR20 mINP = 50.19 mINP =35.30 mINP = 59.23 with the current state-of-the-arts on two datasets is shownin Table 8. We follow the settings in AlignG [190] to performthe experiments. Results show that AGW achieves muchhigher accuracy than existing cross-modality Re-ID models,verifying the effectiveness for the open-world Re-ID task.

We discuss the open-issues from ﬁve different aspects ac-cording to the ﬁve steps in §1, including uncontrollabledata collection, human annotation minimization, domain-speciﬁc/generalizable architecture design, dynamic modelupdating and efﬁcient model deployment.

Most existing Re-ID works evaluate their method on a well-deﬁned data collection environment. However, the datacollection in real complex environment is uncontrollable.The data might be captured from unpredictable modality,modality combinations, or even cloth changing data [30].

Multi-Heterogeneous Data.

In real applications, the Re-ID data might be captured from multiple heterogeneousmodalities, i.e ., the resolutions of person images vary a lot[193], both the query and gallery sets may contain differentmodalities (visible, thermal [21], depth [62] or text descrip-tion [10]). This results in a challenging multiple heteroge-neous person Re-ID. A good person Re-ID system wouldbe able to automatically handle the changing resolutions,different modalities, various environments and multiple do-mains. Future work with broad generalizability is expected,evaluating their method for different Re-ID tasks.

Cloth-Changing Data.

In practical surveillance system,it is very likely to contain a large number of target personswith changing clothes. A cloth-Clothing Change AwareNetwork (CCAN) [250] addresses this issue by separatelyextracting the face and body context representation, andsimilar idea is applied in [251]. Yang et al. [30] presenta spatial polar transformation (SPT) to learn cross-clothinvariant representation. However, they still rely heavily onthe face and body appearance, which might be unavailable

EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 15 and unstable in real scenarios. It would be interesting tofurther explore the possibility of other discriminative cues(e.g., gait, shape) to address the cloth-changing issue.

Besides the unsupervised learning, active learning or hu-man interaction [24], [27], [154], [159] provides another pos-sible solution to alleviate the reliance on human annotation.

Active Learning.

Incorporating human interaction, la-bels are easily provided for newly arriving data and themodel can be subsequently updated [24], [27]. A pairwisesubset selection framework [252] minimizes human labelingeffort by ﬁrstly constructing an edge-weighted complete k -partite graph and then solving it as a triangle free subgraphmaximization problem. Along this line, a deep reinforce-ment active learning method [154] iteratively reﬁnes thelearning policy and trains a Re-ID network with human-in-the-loop supervision. For video data, an interpretable rein-forcement learning method with sequential decision making[178] is designed. The active learning is crucial in practicalRe-ID system design, but it has received less attention inthe research community. Additionally, the newly arrivingidentities is extremely challenging, even for human. Efﬁcienthuman in-the-loop active learning is expected in the future. Learning for Virtual Data.

This provides an alternativefor minimizing the human annotation. A synthetic dataset iscollected in [220] for training, and they achieve competitiveperformance on real-world datasets when trained on thissynthesized dataset. Bak et al. [125] generate a new syntheticdataset with different illumination conditions to model re-alistic indoor and outdoor lighting. A large-scale syntheticPersonX dataset is collected in [105] to systematically studythe effect of viewpoint for a person Re-ID system. Recently,the 3D person images are also studied in [253], generatingthe 3D body structure from 2D images. However, how tobridge the gap between synthesized images and real-worlddatasets remains challenging.

Re-ID Speciﬁc Architecture.

Existing Re-ID methods usu-ally adopt architectures designed for image classiﬁcationas the backbone. Some methods modify the architectureto achieve better Re-ID features [82], [122]. Very recently,researchers have started to design domain speciﬁc architec-tures, e.g ., OSNet with omni-scale feature learning [138]. Itdetects the small-scale discriminative features at a certainscale. OSNet is extremely lightweight and achieves compet-itive performance. With the advancement of automatic neu-ral architecture search ( e.g ., Auto-ReID [139]), more domain-speciﬁc powerful architectures are expected to address thetask-speciﬁc Re-ID challenges. Limited training samples inRe-ID also increase the difﬁculty in architecture design.

Domain Generalizable Re-ID.

It is well recognizedthat there is a large domain gap between different datsets[56], [225]. Most existing methods adopt domain adaptationfor cross-dataset training. A more practical solution wouldbe learning a domain generalized model with a numberof source datasets, such that the learned model can begeneralized to new unseen datasets for discriminative Re-ID without additional training [28]. Hu et al. [254] studiedthe cross-dataset person Re-ID by introducing a part-level CNN framework. The Domain-Invariant Mapping Network(DIMN) [28] designs a meta-learning pipeline for domaingeneralizable Re-ID, learning a mapping between a personimage and its identity classiﬁer. The domain generalizabilityis crucial to deploy the learned Re-ID model under anunknown scenario.

Fixed model is inappropriate for practical dynamicallyupdated surveillance system. To alleviate this issue, dy-namic model updating is imperative, either to a new do-main/camera or adaptation with newly collected data.

Model Adaptation to New Domain/Camera . Modeladaptation to a new domain has been widely studied inthe literature as a domain adaptation problem [125], [216].In practical dynamic camera network, a new camera may betemporarily inserted into an existing surveillance system.Model adaptation is crucial for continuous identiﬁcation ina multi-camera network [23], [29]. To adapt a learned modelto a new camera, a transitive inference algorithm [244] isdesigned to exploit the best source camera model basedon a geodesic ﬂow kernel. However, it is still challengingwhen the newly collected data by the new camera hastotally different distributions. In addition, the privacy andefﬁciency issue [255] also need further consideration.

Model Updating with Newly Arriving Data.

With thenewly collected data, it is impractical to training the previ-ously learned model from the scratch [24]. An incrementallearning approach together with human interaction is de-signed in [24]. For deeply learned model, an addition usingcovariance loss [256] is integrated in the overall learningfunction. However, this problem is not well studied since thedeep model training require large amount of training data.Besides, the unknown new identities in the newly arrivingdata is hard to be identiﬁed for the model updating.

It is important to design efﬁcient and adaptive models toaddress scalability issue for practical model deployment.

Fast Re-ID.

For fast retrieval, hashing has been exten-sively studied to boost the searching speed, approximatingthe nearest neighbor search [257]. Cross-camera SemanticBinary Transformation (CSBT) [258] transforms the originalhigh-dimensional feature representations into compact low-dimensional identity-preserving binary codes. A Coarse-to-Fine (CtF) hashing code search strategy is developed in[259], complementarily using short and long codes. How-ever, the domain-speciﬁc hashing still needs further study.

Lightweight Model.

Another direction for addressingthe scalability issue is to design a lightweight Re-ID model.Modifying the network architecture to achieve a lightweightmodel is investigated in [86], [138], [139]. Model distillationis another approach, e.g ., a multi-teacher adaptive similaritydistillation framework is proposed in [260], which learnsa user-speciﬁed lightweight student model from multipleteacher models, without access to source domain data.

Resource Aware Re-ID.

Adaptively adjusting the modelaccording to the hardware conﬁgurations also provides asolution to handle the scalability issue. Deep Anytime Re-ID (DaRe) [14] employs a simple distance based routing

EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 16 strategy to adaptively adjust the model, ﬁtting to hardwaredevices with different computational resources.

ONCLUDING R EMARKS

This paper presents a comprehensive survey with in-depthanalysis from a both closed-world and open-world perspec-tives. We ﬁrst introduce the widely studied person Re-IDunder the closed-world setting from three aspects: featurerepresentation learning, deep metric learning and rankingoptimization. With powerful deep learning, the closed-world person Re-ID has achieved performance saturation onseveral datasets. Correspondingly, the open-world settinghas recently gained increasing attention, with efforts toaddress various practical challenges. We also design a newAGW baseline, which achieves competitive performanceon four Re-ID tasks under various metrics. It provides astrong baseline for future improvements. This survey alsointroduces a new evaluation metric to measure the cost ofﬁnding all the correct matches. We believe this survey willprovide important guidance for future Re-ID research. R EFERENCES [1] Y.-C. Chen, X. Zhu, W.-S. Zheng, and J.-H. Lai, “Person re-identiﬁcation by camera correlation aware feature augmenta-tion,”

IEEE TPAMI , vol. 40, no. 2, pp. 392–408, 2018.[2] L. Zheng, Y. Yang, and A. G. Hauptmann, “Person re-identiﬁcation: Past, present and future,” arXiv preprintarXiv:1610.02984 , 2016.[3] N. Gheissari, T. B. Sebastian, and R. Hartley, “Person reidenti-ﬁcation using spatiotemporal appearance,” in

CVPR , 2006, pp.1528–1535.[4] J. Almazan, B. Gajic, N. Murray, and D. Larlus, “Re-id doneright: towards good practices for person re-identiﬁcation,” arXivpreprint arXiv:1801.05339 , 2018.[5] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian,“Scalable person re-identiﬁcation: A benchmark,” in

ICCV , 2015,pp. 1116–1124.[6] N. Martinel, G. Luca Foresti, and C. Micheloni, “Aggregatingdeep pyramidal representations for person re-identiﬁcation,” in

CVPR Workshops , 2019, pp. 0–0.[7] T. Wang, S. Gong, X. Zhu, and S. Wang, “Person re-identiﬁcationby video ranking,” in

ECCV , 2014, pp. 688–703.[8] L. Zheng, Z. Bie, Y. Sun, J. Wang, C. Su, S. Wang, andQ. Tian, “Mars: A video benchmark for large-scale person re-identiﬁcation,” in

ECCV , 2016.[9] M. Ye, C. Liang, Z. Wang, Q. Leng, J. Chen, and J. Liu, “Speciﬁcperson retrieval via incomplete text description,” in

ACM ICMR ,2015, pp. 547–550.[10] S. Li, T. Xiao, H. Li, W. Yang, and X. Wang, “Identity-awaretextual-visual matching with latent co-attention,” in

ICCV , 2017,pp. 1890–1899.[11] S. Karanam, Y. Li, and R. J. Radke, “Person re-identiﬁcationwith discriminatively trained viewpoint invariant dictionaries,”in

ICCV , 2015, pp. 4516–4524.[12] S. Bak, S. Zaidenberg, B. Boulay, and F. Bremond, “Improvingperson re-identiﬁcation by viewpoint cues,” in

AVSS , 2014, pp.175–180.[13] X. Li, W.-S. Zheng, X. Wang, T. Xiang, and S. Gong, “Multi-scalelearning for low-resolution person re-identiﬁcation,” in

ICCV ,2015, pp. 3765–3773.[14] Y. Wang, L. Wang, Y. You, X. Zou, V. Chen, S. Li, G. Huang,B. Hariharan, and K. Q. Weinberger, “Resource aware person re-identiﬁcation across multiple resolutions,” in

CVPR , 2018, pp.8042–8051.[15] Y. Huang, Z.-J. Zha, X. Fu, and W. Zhang, “Illumination-invariantperson re-identiﬁcation,” in

ACM MM , 2019, pp. 365–373.[16] Y.-J. Cho and K.-J. Yoon, “Improving person re-identiﬁcation viapose-aware multi-shot matching,” in

CVPR , 2016, pp. 1354–1362. [17] H. Zhao, M. Tian, S. Sun, and et al, “Spindle net: Person re-identiﬁcation with human body region guided feature decom-position and fusion,” in

CVPR , 2017, pp. 1077–1085.[18] M. S. Sarfraz, A. Schumann, A. Eberle, and R. Stiefelhagen,“A pose-sensitive embedding for person re-identiﬁcation withexpanded cross neighborhood re-ranking,” in

CVPR , 2018, pp.420–429.[19] H. Huang, D. Li, Z. Zhang, X. Chen, and K. Huang, “Adver-sarially occluded samples for person re-identiﬁcation,” in

CVPR ,2018, pp. 5098–5107.[20] R. Hou, B. Ma, H. Chang, X. Gu, S. Shan, and X. Chen, “Vrstc:Occlusion-free video person re-identiﬁcation,” in

CVPR , 2019, pp.7183–7192.[21] A. Wu, W.-s. Zheng, H.-X. Yu, S. Gong, and J. Lai, “Rgb-infraredcross-modality person re-identiﬁcation,” in

ICCV , 2017, pp. 5380–5389.[22] C. Song, Y. Huang, W. Ouyang, and L. Wang, “Mask-guided con-trastive attention model for person re-identiﬁcation,” in

CVPR ,2018, pp. 1179–1188.[23] A. Das, R. Panda, and A. K. Roy-Chowdhury, “Continuousadaptation of multi-camera person identiﬁcation models throughsparse non-redundant representative selection,”

CVIU , vol. 156,pp. 66–78, 2017.[24] N. Martinel, A. Das, C. Micheloni, and A. K. Roy-Chowdhury,“Temporal model adaptation for person re-identiﬁcation,” in

ECCV , 2016, pp. 858–877.[25] J. Garcia, N. Martinel, A. Gardel, I. Bravo, G. L. Foresti, andC. Micheloni, “Discriminant context information analysis forpost-ranking person re-identiﬁcation,”

IEEE Transactions on ImageProcessing , vol. 26, no. 4, pp. 1650–1665, 2017.[26] W.-S. Zheng, S. Gong, and T. Xiang, “Towards open-world per-son re-identiﬁcation by one-shot group-based veriﬁcation,”

IEEETPAMI , vol. 38, no. 3, pp. 591–606, 2015.[27] A. Das, R. Panda, and A. Roy-Chowdhury, “Active image pairselection for continuous person re-identiﬁcation,” in

ICIP , 2015,pp. 4263–4267.[28] J. Song, Y. Yang, Y.-Z. Song, T. Xiang, and T. M. Hospedales,“Generalizable person re-identiﬁcation by domain-invariantmapping network,” in

CVPR , 2019, pp. 719–728.[29] A. Das, A. Chakraborty, and A. K. Roy-Chowdhury, “Consistentre-identiﬁcation in a camera network,” in

ECCV , 2014, pp. 330–345.[30] Q. Yang, A. Wu, and W. Zheng, “Person re-identiﬁcation bycontour sketch under moderate clothing change.”

IEEE TPAMI ,2019.[31] D. Gray and H. Tao, “Viewpoint invariant pedestrian recognitionwith an ensemble of localized features,” in

ECCV , 2008, pp. 262–275.[32] M. Farenzena, L. Bazzani, A. Perina, V. Murino, and M. Cristani,“Person re-identiﬁcation by symmetry-driven accumulation oflocal features,” in

CVPR , 2010, pp. 2360–2367.[33] Y. Yang, J. Yang, J. Yan, S. Liao, D. Yi, and S. Z. Li, “Salient colornames for person re-identiﬁcation,” in

ECCV , 2014, pp. 536–551.[34] S. Liao, Y. Hu, X. Zhu, and S. Z. Li, “Person re-identiﬁcation bylocal maximal occurrence representation and metric learning,” in

CVPR , 2015, pp. 2197–2206.[35] T. Matsukawa, T. Okabe, E. Suzuki, and Y. Sato, “Hierarchicalgaussian descriptor for person re-identiﬁcation,” in

CVPR , 2016,pp. 1363–1372.[36] M. Kostinger, M. Hirzer, P. Wohlhart, and et al, “Large scalemetric learning from equivalence constraints,” in

CVPR , 2012,pp. 2288–2295.[37] W.-S. Zheng, S. Gong, and T. Xiang, “Person re-identiﬁcation byprobabilistic relative distance comparison,” in

CVPR , 2011, pp.649–656.[38] F. Xiong, M. Gou, O. Camps, and M. Sznaier, “Person re-identiﬁcation using kernel-based metric learning methods,” in

ECCV , 2014, pp. 1–16.[39] M. Hirzer, P. M. Roth, M. K¨ostinger, and H. Bischof, “Relaxedpairwise learned metric for person re-identiﬁcation,” in

ECCV ,2012, pp. 780–793.[40] S. Liao and S. Z. Li, “Efﬁcient psd constrained asymmetric metriclearning for person re-identiﬁcation,” in

ICCV , 2015, pp. 3685–3693.[41] H.-X. Yu, A. Wu, and W.-S. Zheng, “Unsupervised person re-identiﬁcation by deep asymmetric metric embedding,”

IEEETPAMI , 2018.

EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 17 [42] Z. Zheng, L. Zheng, and Y. Yang, “Unlabeled samples generatedby gan improve the person re-identiﬁcation baseline in vitro,” in

ICCV , 2017, pp. 3754–3762.[43] W. Li, R. Zhao, T. Xiao, and X. Wang, “Deepreid: Deep ﬁlterpairing neural network for person re-identiﬁcation,” in

CVPR ,2014, pp. 152–159.[44] L. Wei, S. Zhang, W. Gao, and Q. Tian, “Person transfer gan tobridge domain gap for person re-identiﬁcation,” in

CVPR , 2018,pp. 79–88.[45] Q. Leng, M. Ye, and Q. Tian, “A survey of open-world person re-identiﬁcation,”

IEEE Transactions on Circuits and Systems for VideoTechnology (TCSVT) , 2019.[46] D. Wu, S.-J. Zheng, X.-P. Zhang, C.-A. Yuan, F. Cheng, Y. Zhao, Y.-J. Lin, Z.-Q. Zhao, Y.-L. Jiang, and D.-S. Huang, “Deep learning-based methods for person re-identiﬁcation: A comprehensivereview,”

Neurocomputing , 2019.[47] B. Lavi, M. F. Serj, and I. Ullah, “Survey on deep learn-ing techniques for person re-identiﬁcation task,” arXiv preprintarXiv:1807.05284 , 2018.[48] X. Wang, “Intelligent multi-camera video surveillance: A review,”

Pattern recognition letters , vol. 34, no. 1, pp. 3–19, 2013.[49] D. Geronimo, A. M. Lopez, A. D. Sappa, and T. Graf, “Surveyof pedestrian detection for advanced driver assistance systems,”

IEEE TPAMI , no. 7, pp. 1239–1258, 2009.[50] P. Dollar, C. Wojek, B. Schiele, and P. Perona, “Pedestrian detec-tion: A benchmark,” in

CVPR , 2009, pp. 304–311.[51] E. Insafutdinov, M. Andriluka, L. Pishchulin, S. Tang,E. Levinkov, B. Andres, and B. Schiele, “Arttrack: Articulatedmulti-person tracking in the wild,” in

CVPR , 2017, pp. 6457–6465.[52] E. Ristani and C. Tomasi, “Features for multi-target multi-cameratracking and re-identiﬁcation,” in

CVPR , 2018, pp. 6036–6046.[53] A. J. Ma, P. C. Yuen, and J. Li, “Domain transfer support vectorranking for person re-identiﬁcation without target camera labelinformation,” in

ICCV , 2013, pp. 3567–3574.[54] T. Xiao, H. Li, W. Ouyang, and X. Wang, “Learning deep fea-ture representations with domain guided dropout for person re-identiﬁcation,” in

CVPR , 2016, pp. 1249–1258.[55] L. Zheng, H. Zhang, S. Sun, M. Chandraker, Y. Yang, and Q. Tian,“Person re-identiﬁcation in the wild,” in

CVPR , 2017, pp. 1367–1376.[56] D. Yi, Z. Lei, S. Liao, and S. Z. Li, “Deep metric learning forperson re-identiﬁcation,” in

ICPR , 2014, pp. 34–39.[57] A. Hermans, L. Beyer, and B. Leibe, “In defense of the tripletloss for person re-identiﬁcation,” arXiv preprint arXiv:1703.07737 ,2017.[58] Z. Zhong, L. Zheng, D. Cao, and S. Li, “Re-ranking person re-identiﬁcation with k-reciprocal encoding,” in

CVPR , 2017, pp.1318–1327.[59] M. Ye, C. Liang, Y. Yu, Z. Wang, Q. Leng, C. Xiao, J. Chen,and R. Hu, “Person reidentiﬁcation via ranking aggregation ofsimilarity pulling and dissimilarity pushing,”

IEEE Transactionson Multimedia (TMM) , vol. 18, no. 12, pp. 2553–2566, 2016.[60] D. T. Nguyen, H. G. Hong, K. W. Kim, and K. R. Park, “Personrecognition system based on a combination of body images fromvisible light and thermal cameras,”

Sensors , vol. 17, no. 3, p. 605,2017.[61] W.-H. Li, Z. Zhong, and W.-S. Zheng, “One-pass person re-identiﬁcation by sketch online discriminant analysis,”

PatternRecognition , vol. 93, pp. 237–250, 2019.[62] A. Wu, W.-S. Zheng, and J.-H. Lai, “Robust depth-based personre-identiﬁcation,”

IEEE Transactions on Image Processing (TIP) ,vol. 26, no. 6, pp. 2588–2603, 2017.[63] S. Li, T. Xiao, H. Li, B. Zhou, D. Yue, and X. Wang, “Person searchwith natural language description,” in

CVPR , 2017, pp. 1345–1353.[64] T. Xiao, S. Li, B. Wang, L. Lin, and X. Wang, “Joint detection andidentiﬁcation feature learning for person search,” in

CVPR , 2017,pp. 3415–3424.[65] X. Liu, M. Song, D. Tao, X. Zhou, C. Chen, and J. Bu,“Semi-supervised coupled dictionary learning for person re-identiﬁcation,” in

CVPR , 2014, pp. 3550–3557.[66] R. Zhao, W. Ouyang, and X. Wang, “Unsupervised saliencelearning for person re-identiﬁcation,” in

CVPR , 2013, pp. 3586–3593.[67] Y. Sun, Q. Xu, Y. Li, C. Zhang, Y. Li, S. Wang, and J. Sun, “Perceivewhere to focus: Learning visibility-aware part-level features forpartial person re-identiﬁcation,” in

CVPR , 2019, pp. 393–402. [68] X. Wang, G. Doretto, T. Sebastian, J. Rittscher, and P. Tu, “Shapeand appearance context modeling,” in

ICCV , 2007, pp. 1–8.[69] H. Wang, X. Zhu, T. Xiang, and S. Gong, “Towards unsupervisedopen-set person re-identiﬁcation,” in

ICIP , 2016, pp. 769–773.[70] X. Zhu, B. Wu, D. Huang, and W.-S. Zheng, “Fast open-worldperson re-identiﬁcation,”

IEEE Transactions on Image Processing(TIP) , vol. 27, no. 5, pp. 2286 – 2300, 2018.[71] C. Su, S. Zhang, J. Xing, W. Gao, and Q. Tian, “Deep attributesdriven multi-camera person re-identiﬁcation,” in

ECCV , 2016, pp.475–491.[72] Y. Lin, L. Zheng, Z. Zheng, Y. Wu, and Y. Yang, “Improvingperson re-identiﬁcation by attribute and identity learning,” arXivpreprint arXiv:1703.07220 , 2017.[73] K. Liu, B. Ma, W. Zhang, and R. Huang, “A spatio-temporal appearance representation for viceo-based pedestrianre-identiﬁcation,” in

ICCV , 2015, pp. 3810–3818.[74] J. Dai, P. Zhang, D. Wang, H. Lu, and H. Wang, “Video person re-identiﬁcation by temporal residual learning,”

IEEE Transactionson Image Processing (TIP) , vol. 28, no. 3, pp. 1366–1377, 2018.[75] L. Zhao, X. Li, Y. Zhuang, and J. Wang, “Deeply-learned part-aligned representations for person re-identiﬁcation,” in

CVPR ,2017, pp. 3219–3228.[76] H. Yao, S. Zhang, R. Hong, Y. Zhang, C. Xu, and Q. Tian,“Deep representation learning with part loss for person re-identiﬁcation,”

IEEE Transactions on Image Processing (TIP) , 2019.[77] Y. Sun, L. Zheng, Y. Yang, Q. Tian, and S. Wang, “Beyond partmodels: Person retrieval with reﬁned part pooling,” in

ECCV ,2018, pp. 480–496.[78] T. Matsukawa and E. Suzuki, “Person re-identiﬁcation using cnnfeatures learned from combination of attributes,” in

ICPR , 2016,pp. 2428–2433.[79] K. Simonyan and A. Zisserman, “Very deep convolutionalnetworks for large-scale image recognition,” arXiv preprintarXiv:1409.1556 , 2014.[80] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning forimage recognition,” in

CVPR , 2016, pp. 770–778.[81] F. Wang, W. Zuo, L. Lin, D. Zhang, and L. Zhang, “Joint learningof single-image and cross-image representations for person re-identiﬁcation,” in

CVPR , 2016, pp. 1288–1296.[82] Y. Sun, L. Zheng, W. Deng, and S. Wang, “Svdnet for pedestrianretrieval,” in

ICCV , 2017, pp. 3800–3808.[83] M. Ye, X. Lan, and P. C. Yuen, “Robust anchor embeddingfor unsupervised video person re-identiﬁcation in the wild,” in

ECCV , 2018, pp. 170–186.[84] X. Qian, Y. Fu, Y.-G. Jiang, T. Xiang, and X. Xue, “Multi-scale deeplearning architectures for person re-identiﬁcation,” in

ICCV , 2017,pp. 5399–5408.[85] F. Yang, K. Yan, S. Lu, H. Jia, X. Xie, and W. Gao, “Attentiondriven person re-identiﬁcation,”

Pattern Recognition , vol. 86, pp.143–155, 2019.[86] W. Li, X. Zhu, and S. Gong, “Harmonious attention network forperson re-identiﬁcation,” in

CVPR , 2018, pp. 2285–2294.[87] C. Wang, Q. Zhang, C. Huang, W. Liu, and X. Wang, “Mancs:A multi-task attentional network with curriculum sampling forperson re-identiﬁcation,” in

ECCV , 2018, pp. 365–381.[88] Y. Shen, T. Xiao, H. Li, S. Yi, and X. Wang, “End-to-end deepkronecker-product matching for person re-identiﬁcation,” in

CVPR , 2018, pp. 6886–6895.[89] Y. Wang, Z. Chen, F. Wu, and G. Wang, “Person re-identiﬁcationwith cascaded pairwise convolutions,” in

CVPR , 2018, pp. 1470–1478.[90] G. Chen, C. Lin, L. Ren, J. Lu, and J. Zhou, “Self-critical attentionlearning for person re-identiﬁcation,” in

ICCV , 2019, pp. 9637–9646.[91] J. Si, H. Zhang, C.-G. Li, J. Kuen, X. Kong, A. C. Kot, andG. Wang, “Dual attention matching network for context-awarefeature sequence based person re-identiﬁcation,” in

CVPR , 2018,pp. 5363–5372.[92] M. Zheng, S. Karanam, Z. Wu, and R. J. Radke, “Re-identiﬁcationwith consistent attentive siamese networks,” in

CVPR , 2019, pp.5735–5744.[93] S. Zhou, F. Wang, Z. Huang, and J. Wang, “Discriminative featurelearning with consistent attention regularization for person re-identiﬁcation,” in

ICCV , 2019, pp. 8040–8049.[94] D. Chen, D. Xu, H. Li, N. Sebe, and X. Wang, “Group consistentsimilarity learning via deep crf for person re-identiﬁcation,” in

CVPR , 2018, pp. 8649–8658.

EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 18 [95] C. Luo, Y. Chen, N. Wang, and Z. Zhang, “Spectral featuretransformation for person re-identiﬁcation,” in

ICCV , 2019, pp.4976–4985.[96] R. R. Varior, B. Shuai, J. Lu, D. Xu, and G. Wang, “A siamese longshort-term memory architecture for human re-identiﬁcation,” in

ECCV , 2016, pp. 135–153.[97] Y. Suh, J. Wang, S. Tang, T. Mei, and K. Mu Lee, “Part-alignedbilinear representations for person re-identiﬁcation,” in

ECCV ,2018, pp. 402–419.[98] L. Zhao, X. Li, Y. Zhuang, and J. Wang, “Deeply-learned part-aligned representations for person re-identiﬁcation,” in

ICCV ,2017, pp. 3219–3228.[99] D. Cheng, Y. Gong, S. Zhou, J. Wang, and N. Zheng, “Person re-identiﬁcation by multi-channel parts-based cnn with improvedtriplet loss function,” in

CVPR , 2016, pp. 1335–1344.[100] D. Li, X. Chen, Z. Zhang, and K. Huang, “Learning deepcontext-aware features over body and latent parts for person re-identiﬁcation,” in

CVPR , 2017, pp. 384–393.[101] C. Su, J. Li, S. Zhang, J. Xing, W. Gao, and Q. Tian, “Pose-drivendeep convolutional model for person re-identiﬁcation,” in

ICCV ,2017, pp. 3960–3969.[102] J. Xu, R. Zhao, F. Zhu, H. Wang, and W. Ouyang, “Attention-aware compositional network for person re-identiﬁcation,” in

CVPR , 2018, pp. 2119–2128.[103] Z. Zhang, C. Lan, W. Zeng, and Z. Chen, “Densely semanticallyaligned person re-identiﬁcation,” in

CVPR , 2019, pp. 667–676.[104] J. Guo, Y. Yuan, L. Huang, C. Zhang, J.-G. Yao, and K. Han,“Beyond human parts: Dual part-aligned representations forperson re-identiﬁcation,” in

ICCV , 2019, pp. 3642–3651.[105] X. Sun and L. Zheng, “Dissecting person re-identiﬁcation fromthe viewpoint of viewpoint,” in

CVPR , 2019, pp. 608–617.[106] Z. Zhong, L. Zheng, Z. Luo, S. Li, and Y. Yang, “Invariancematters: Exemplar memory for domain adaptive person re-identiﬁcation,” in

CVPR , 2019, pp. 598–607.[107] B. N. Xia, Y. Gong, Y. Zhang, and C. Poellabauer, “Second-order non-local attention networks for person re-identiﬁcation,”in

ICCV , 2019, pp. 3760–3769.[108] R. Hou, B. Ma, H. Chang, X. Gu, S. Shan, and X. Chen,“Interaction-and-aggregation network for person re-identiﬁcation,” in

CVPR , 2019, pp. 9317–9326.[109] C.-P. Tay, S. Roy, and K.-H. Yap, “Aanet: Attribute attentionnetwork for person re-identiﬁcations,” in

CVPR , 2019, pp. 7134–7143.[110] Y. Zhao, X. Shen, Z. Jin, H. Lu, and X.-s. Hua, “Attribute-drivenfeature disentangling and temporal aggregation for video personre-identiﬁcation,” in

CVPR , 2019, pp. 4913–4922.[111] W. Jingya, Z. Xiatian, G. Shaogang, and L. Wei, “Transferablejoint attribute-identity deep learning for unsupervised person re-identiﬁcation,” in

CVPR , 2018, pp. 2275–2284.[112] X. Chang, T. M. Hospedales, and T. Xiang, “Multi-level factori-sation net for person re-identiﬁcation,” in

CVPR , 2018, pp. 2109–2118.[113] F. Liu and L. Zhang, “View confusion feature learning for personre-identiﬁcation,” in

ICCV , 2019, pp. 6639–6648.[114] Z. Zhu, X. Jiang, F. Zheng, X. Guo, F. Huang, W. Zheng, andX. Sun, “Aware loss with angular regularization for person re-identiﬁcation,” in

AAAI , 2020.[115] J. Lin, L. Ren, J. Lu, J. Feng, and J. Zhou, “Consistent-aware deeplearning for person re-identiﬁcation in a camera network,” in

CVPR , 2017, pp. 5771–5780.[116] J. Liu, B. Ni, Y. Yan, and et al., “Pose transferrable person re-identiﬁcation,” in

CVPR , 2018, pp. 4099–4108.[117] X. Qian, Y. Fu, T. Xiang, W. Wang, J. Qiu, Y. Wu, Y.-G. Jiang,and X. Xue, “Pose-normalized image generation for person re-identiﬁcation,” in

ECCV , 2018, pp. 650–667.[118] Z. Zhong, L. Zheng, Z. Zheng, and et al., “Camera style adapta-tion for person re-identiﬁcation,” in

CVPR , 2018, pp. 5157–5166.[119] Z. Zheng, X. Yang, Z. Yu, L. Zheng, Y. Yang, and J. Kautz,“Joint discriminative and generative learning for person re-identiﬁcation,” in

CVPR , 2019, pp. 2138–2147.[120] W. Deng, L. Zheng, Q. Ye, G. Kang, Y. Yang, and J. Jiao, “Image-image domain adaptation with preserved self-similarity anddomain-dissimilarity for person re-identiﬁcation,” in

CVPR , 2018,pp. 994–1003.[121] Y.-J. Li, C.-S. Lin, Y.-B. Lin, and Y.-C. F. Wang, “Cross-datasetperson re-identiﬁcation via unsupervised pose disentanglementand adaptation,” in

ICCV , 2019, pp. 7919–7929. [122] H. Luo, W. Jiang, Y. Gu, F. Liu, X. Liao, S. Lai, and J. Gu, “A strongbaseline and batch normneuralization neck for deep person re-identiﬁcation,” arXiv preprint arXiv:1906.08332 , 2019.[123] Z. Zhong, L. Zheng, G. Kang, S. Li, and Y. Yang, “Random erasingdata augmentation,” arXiv preprint arXiv:1708.04896 , 2017.[124] Z. Dai, M. Chen, X. Gu, S. Zhu, and P. Tan, “Batch dropblocknetwork for person re-identiﬁcation and beyond,” in

ICCV , 2019,pp. 3691–3701.[125] S. Bak, P. Carr, and J.-F. Lalonde, “Domain adaptation throughsynthesis for unsupervised person re-identiﬁcation,” in

ECCV ,2018, pp. 189–205.[126] M. Hirzer, C. Beleznai, P. M. Roth, and H. Bischof, “Person re-identiﬁcation by descriptive and discriminative classiﬁcation,” in

Image Analysis , 2011, pp. 91–102.[127] N. McLaughlin, J. Martinez del Rincon, and P. Miller, “Recurrentconvolutional network for video-based person re-identiﬁcation,”in

CVPR , 2016, pp. 1325–1334.[128] D. Chung, K. Tahboub, and E. J. Delp, “A two stream siameseconvolutional neural network for person re-identiﬁcation,” in

ICCV , 2017, pp. 1983–1991.[129] Y. Yan, B. Ni, Z. Song, C. Ma, Y. Yan, and X. Yang, “Person re-identiﬁcation via recurrent feature aggregation,” in

ECCV , 2016,pp. 701–716.[130] Z. Zhou, Y. Huang, W. Wang, L. Wang, and T. Tan, “See the forestfor the trees: Joint spatial and temporal recurrent neural networksfor video-based person re-identiﬁcation,” in

CVPR , 2017, pp.4747–4756.[131] S. Xu, Y. Cheng, K. Gu, Y. Yang, S. Chang, and P. Zhou, “Jointlyattentive spatial-temporal pooling networks for video-based per-son re-identiﬁcation,” in

ICCV , 2017, pp. 4733–4742.[132] A. Subramaniam, A. Nambiar, and A. Mittal, “Co-segmentationinspired attention networks for video-based person re-identiﬁcation,” in

ICCV , 2019, pp. 562–572.[133] S. Li, S. Bak, P. Carr, and X. Wang, “Diversity regularized spa-tiotemporal attention for video-based person re-identiﬁcation,”in

CVPR , 2018, pp. 369–378.[134] D. Chen, H. Li, T. Xiao, S. Yi, and X. Wang, “Video personre-identiﬁcation with competitive snippet-similarity aggregationand co-attentive snippet embedding,” in

CVPR , 2018, pp. 1169–1178.[135] Y. Fu, X. Wang, Y. Wei, and T. Huang, “Sta: Spatial-temporalattention for large-scale video-based person re-identiﬁcation,” in

AAAI , 2019.[136] J. Li, J. Wang, Q. Tian, W. Gao, and S. Zhang, “Global-localtemporal representations for video person re-identiﬁcation,” in

ICCV , 2019, pp. 3958–3967.[137] Y. Guo and N.-M. Cheung, “Efﬁcient and deep person re-identiﬁcation using multi-level similarity,” in

CVPR , 2018, pp.2335–2344.[138] K. Zhou, Y. Yang, A. Cavallaro, and T. Xiang, “Omni-scale featurelearning for person re-identiﬁcation,” in

ICCV , 2019, pp. 3702–3712.[139] R. Quan, X. Dong, Y. Wu, L. Zhu, and Y. Yang, “Auto-reid:Searching for a part-aware convnet for person re-identiﬁcation,”in

ICCV , 2019, pp. 3750–3759.[140] M. Tian, S. Yi, H. Li, S. Li, X. Zhang, J. Shi, J. Yan, andX. Wang, “Eliminating background-bias for robust person re-identiﬁcation,” in

CVPR , 2018, pp. 5794–5803.[141] Z. Zheng, L. Zheng, and Y. Yang, “A discriminatively learnedcnn embedding for person re-identiﬁcation,” arXiv preprintarXiv:1611.05666 , 2016.[142] M. Ye, X. Lan, Z. Wang, and P. C. Yuen, “Bi-directionalcenter-constrained top-ranking for visible thermal person re-identiﬁcation,”

IEEE Transactions on Information Forensics and Se-curity (TIFS) , vol. 15, pp. 407–419, 2020.[143] P. Moutaﬁs, M. Leng, and I. A. Kakadiaris, “An overview andempirical comparison of distance metric learning methods,”

IEEETransactions on Cybernetics , vol. 47, no. 3, pp. 612–625, 2016.[144] Y. Wu, Y. Lin, X. Dong, Y. Yan, W. Ouyang, and Y. Yang, “Ex-ploit the unknown gradually: One-shot video-based person re-identiﬁcation by stepwise learning,” in

CVPR , 2018, pp. 5177–5186.[145] M. Ye, X. Zhang, P. C. Yuen, and S.-F. Chang, “Unsupervised em-bedding learning via invariant and spreading instance feature,”in

CVPR , 2019, pp. 6210–6219.[146] N. Wojke and A. Bewley, “Deep cosine metric learning for personre-identiﬁcation,” in

WACV , 2018, pp. 748–756.

EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 19 [147] X. Fan, W. Jiang, H. Luo, and M. Fei, “Spherereid: Deep hyper-sphere manifold embedding for person re-identiﬁcation,”

JVCIR ,vol. 60, pp. 51–58, 2019.[148] R. M ¨uller, S. Kornblith, and G. Hinton, “When does label smooth-ing help?” arXiv preprint arXiv:1906.02629 , 2019.[149] H. Shi, Y. Yang, X. Zhu, S. Liao, Z. Lei, W. Zheng, and S. Z. Li,“Embedding deep metric for person re-identiﬁcation: A studyagainst large variations,” in

ECCV , 2016, pp. 732–748.[150] S. Zhou, J. Wang, J. Wang, Y. Gong, and N. Zheng, “Pointto set similarity based deep feature learning for person re-identiﬁcation,” in

CVPR , 2017, pp. 3741–3750.[151] R. Yu, Z. Dou, S. Bai, Z. Zhang, Y. Xu, and X. Bai, “Hard-awarepoint-to-set deep metric for person re-identiﬁcation,” in

ECCV ,2018, pp. 188–204.[152] W. Chen, X. Chen, J. Zhang, and K. Huang, “Beyond triplet loss: adeep quadruplet network for person re-identiﬁcation,” in

CVPR ,2017, pp. 403–412.[153] Q. Yang, H.-X. Yu, A. Wu, and W.-S. Zheng, “Patch-baseddiscriminative feature learning for unsupervised person re-identiﬁcation,” in

CVPR , 2019, pp. 3633–3642.[154] Z. Liu, J. Wang, S. Gong, H. Lu, and D. Tao, “Deep reinforcementactive learning for human-in-the-loop person re-identiﬁcation,”in

ICCV , 2019, pp. 6122–6131.[155] J. Zhou, B. Su, and Y. Wu, “Easy identiﬁcation from betterconstraints: Multi-shot person re-identiﬁcation from referenceconstraints,” in

CVPR , 2018, pp. 5373–5381.[156] F. Zheng, C. Deng, X. Sun, X. Jiang, X. Guo, Z. Yu, F. Huang, andR. Ji, “Pyramidal person re-identiﬁcation via multi-loss dynamictraining,” in

CVPR , 2019, pp. 8514–8522.[157] M. Ye, C. Liang, Z. Wang, Q. Leng, and J. Chen, “Rankingoptimization for person re-identiﬁcation via similarity and dis-similarity,” in

ACM Multimedia (ACM MM) , 2015, pp. 1239–1242.[158] C. Liu, C. Change Loy, S. Gong, and G. Wang, “Pop: Person re-identiﬁcation post-rank optimisation,” in

ICCV , 2013, pp. 441–448.[159] H. Wang, S. Gong, X. Zhu, and T. Xiang, “Human-in-the-loopperson re-identiﬁcation,” in

ECCV , 2016, pp. 405–422.[160] S. Paisitkriangkrai, C. Shen, and A. Van Den Hengel, “Learning torank in person re-identiﬁcation with metric ensembles,” in

CVPR ,2015, pp. 1846–1855.[161] S. Bai, P. Tang, P. H. Torr, and L. J. Latecki, “Re-ranking via metricfusion for object retrieval and person re-identiﬁcation,” in

CVPR ,2019, pp. 740–749.[162] S. Bai, X. Bai, and Q. Tian, “Scalable person re-identiﬁcation onsupervised smoothed manifold,” in

CVPR , 2017, pp. 2530–2539.[163] A. J. Ma and P. Li, “Query based adaptive re-ranking for personre-identiﬁcation,” in

ACCV , 2014, pp. 397–412.[164] J. Zhou, P. Yu, W. Tang, and Y. Wu, “Efﬁcient online local metricadaptation via negative samples for person re-identiﬁcation,” in

ICCV , 2017, pp. 2420–2428.[165] L. Zheng, S. Wang, L. Tian, F. He, Z. Liu, and Q. Tian,“Query-adaptive late fusion for image search and person re-identiﬁcation,” in

CVPR , 2015, pp. 1741–1750.[166] A. Barman and S. K. Shah, “Shape: A novel graph theoreticalgorithm for making consensus-based decisions in person re-identiﬁcation systems,” in

ICCV , 2017, pp. 1115–1124.[167] W.-S. Zheng, S. Gong, and T. Xiang, “Associating groups ofpeople,” in

BMVC , 2009, pp. 1–23.[168] C. C. Loy, C. Liu, and S. Gong, “Person re-identiﬁcation bymanifold ranking,” in

ICIP , 2013, pp. 3567–3571.[169] M. Gou, Z. Wu, A. Rates-Borras, O. Camps, R. J. Radke et al. , “A systematic evaluation and benchmark for person re-identiﬁcation: Features, metrics, and datasets,”

IEEE TPAMI ,vol. 41, no. 3, pp. 523–536, 2018.[170] M. Li, X. Zhu, and S. Gong, “Unsupervised person re-identiﬁcation by deep learning tracklet association,” in

ECCV ,2018, pp. 737–753.[171] G. Song, B. Leng, Y. Liu, C. Hetang, and S. Cai, “Region-based quality estimation network for large-scale person re-identiﬁcation,” in

AAAI , 2018, pp. 7347–7354.[172] G. Wang, Y. Yuan, X. Chen, J. Li, and X. Zhou, “Learningdiscriminative features with multiple granularities for person re-identiﬁcation,” in

ACM MM , 2018, pp. 274–282.[173] T. Chen, S. Ding, J. Xie, Y. Yuan, W. Chen, Y. Yang, Z. Ren,and Z. Wang, “Abd-net: Attentive but diverse person re-identiﬁcation,” in

ICCV , 2019, pp. 8351–8361. [174] B. Chen, W. Deng, and J. Hu, “Mixed high-order attention net-work for person re-identiﬁcation,” in

ICCV , 2019, pp. 371–381.[175] X. Zhang, H. Luo, X. Fan, W. Xiang, Y. Sun, Q. Xiao,W. Jiang, C. Zhang, and J. Sun, “Alignedreid: Surpassing human-level performance in person re-identiﬁcation,” arXiv preprintarXiv:1711.08184 , 2017.[176] M. Ye, A. J. Ma, L. Zheng, J. Li, and P. C. Yuen, “Dynamiclabel graph matching for unsupervised video re-identiﬁcation,”in

ICCV , 2017, pp. 5142–5150.[177] Z. Wang, S. Zheng, M. Song, Q. Wang, A. Rahimpour, andH. Qi, “advpattern: Physical-world attacks on deep person re-identiﬁcation via adversarially transformable patterns,” in

ICCV ,2019, pp. 8341–8350.[178] J. Zhang, N. Wang, and L. Zhang, “Multi-shot pedestrian re-identiﬁcation via sequential decision making,” in

CVPR , 2018,pp. 6781–6789.[179] A. Haque, A. Alahi, and L. Fei-Fei, “Recurrent attention modelsfor depth-based person identiﬁcation,” in

CVPR , 2016, pp. 1229–1238.[180] N. Karianakis, Z. Liu, Y. Chen, and S. Soatto, “Reinforced tem-poral attention and split-rate transfer for depth-based person re-identiﬁcation,” in

ECCV , 2018, pp. 715–733.[181] I. B. Barbosa, M. Cristani, A. Del Bue, L. Bazzani, and V. Murino,“Re-identiﬁcation with rgb-d sensors,” in

ECCV Workshop , 2012,pp. 433–442.[182] D. Chen, H. Li, X. Liu, Y. Shen, J. Shao, Z. Yuan, andX. Wang, “Improving deep visual representation for person re-identiﬁcation by global and local image-language association,”in

ECCV , 2018, pp. 54–70.[183] Y. Zhang and H. Lu, “Deep cross-modal projection learning forimage-text matching,” in

ECCV , 2018, pp. 686–701.[184] J. Liu, Z.-J. Zha, R. Hong, M. Wang, and Y. Zhang, “Deepadversarial graph attention convolution network for text-basedperson search,” in

ACM MM , 2019, pp. 665–673.[185] M. Ye, Z. Wang, X. Lan, and P. C. Yuen, “Visible thermal per-son re-identiﬁcation via dual-constrained top-ranking,” in

IJCAI ,2018, pp. 1092–1099.[186] M. Ye, X. Lan, J. Li, and P. C. Yuen, “Hierarchical discriminativelearning for visible thermal person re-identiﬁcation,” in

AAAI ,2018, pp. 7501–7508.[187] Y. Hao, N. Wang, J. Li, and X. Gao, “Hsme: Hypersphere manifoldembedding for visible thermal person re-identiﬁcation,” in

AAAI ,2019, pp. 8385–8392.[188] M. Ye, J. Shen, and L. Shao, “Visible-infrared person re-identiﬁcation via homogeneous augmented tri-modal learning,”

IEEE TIFS , 2020.[189] Z. Wang, Z. Wang, Y. Zheng, Y.-Y. Chuang, and S. Satoh, “Learn-ing to reduce dual-level discrepancy for infrared-visible personre-identiﬁcation,” in

CVPR , 2019, pp. 618–626.[190] G. Wang, T. Zhang, J. Cheng, S. Liu, Y. Yang, and Z. Hou, “Rgb-infrared cross-modality person re-identiﬁcation via joint pixeland feature alignment,” in

ICCV , 2019, pp. 3623–3632.[191] S. Choi, S. Lee, Y. Kim, T. Kim, and C. Kim, “Hi-cmd: Hierarchicalcross-modality disentanglement for visible-infrared person re-identiﬁcation,” in

CVPR , 2020, pp. 10 257–10 266.[192] M. Ye, J. Shen, D. J. Crandall, L. Shao, and J. Luo, “Dynamicdual-attentive aggregation learning for visible-infrared person re-identiﬁcation,” in

ECCV , 2020.[193] Z. Wang, M. Ye, F. Yang, X. Bai, and S. Satoh, “Cascaded sr-gan forscale-adaptive low resolution person re-identiﬁcation.” in

IJCAI ,2018, pp. 3891–3897.[194] Y.-J. Li, Y.-C. Chen, Y.-Y. Lin, X. Du, and Y.-C. F. Wang, “Recoverand identify: A generative dual model for cross-resolution personre-identiﬁcation,” in

ICCV , 2019, pp. 8090–8099.[195] H. Liu, J. Feng, Z. Jie, K. Jayashree, B. Zhao, M. Qi, J. Jiang, andS. Yan, “Neural person search machines,” in

ICCV , 2017, pp. 493–501.[196] Y. Yan, Q. Zhang, B. Ni, W. Zhang, M. Xu, and X. Yang, “Learningcontext graph for person search,” in

CVPR , 2019, pp. 2158–2167.[197] B. Munjal, S. Amin, F. Tombari, and F. Galasso, “Query-guidedend-to-end person search,” in

CVPR , 2019, pp. 811–820.[198] C. Han, J. Ye, Y. Zhong, X. Tan, C. Zhang, C. Gao, and N. Sang,“Re-id driven localization reﬁnement for person search,” in

ICCV , 2019, pp. 9814–9823.[199] X. Lan, H. Wang, S. Gong, and X. Zhu, “Deep reinforce-ment learning attention selection for person re-identiﬁcation,” in

BMVC , 2017.

EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 20 [200] M. Yamaguchi, K. Saito, Y. Ushiku, and T. Harada, “Spatio-temporal person retrieval via natural language queries,” in

ICCV ,2017, pp. 1453–1462.[201] S. Tang, M. Andriluka, B. Andres, and B. Schiele, “Multiplepeople tracking by lifted multicut and person re-identiﬁcation,”in

CVPR , 2017, pp. 3539–3548.[202] Y. Hou, L. Zheng, Z. Wang, and S. Wang, “Locality awareappearance metric for multi-target multi-camera tracking,” arXivpreprint arXiv:1911.12037 , 2019.[203] E. Kodirov, T. Xiang, Z. Fu, and S. Gong, “Person re-identiﬁcationby unsupervised l1 graph learning,” in

ECCV , 2016, pp. 178–195.[204] Z. Liu, D. Wang, and H. Lu, “Stepwise metric promotion forunsupervised video person re-identiﬁcation,” in

ICCV , 2017, pp.2429–2438.[205] H. Fan, L. Zheng, and Y. Yang, “Unsupervised personre-identiﬁcation: Clustering and ﬁne-tuning,” arXiv preprintarXiv:1705.10444 , 2017.[206] M. Ye, J. Li, A. J. Ma, L. Zheng, and P. C. Yuen, “Dynamicgraph co-matching for unsupervised video-based person re-identiﬁcation,”

IEEE TIP , vol. 28, no. 6, pp. 2976–2990, 2019.[207] X. Wang, R. Panda, M. Liu, Y. Wang, and et al., “Exploiting globalcamera network constraints for unsupervised video person re-identiﬁcation,” arXiv preprint arXiv:1908.10486 , 2019.[208] K. Zeng, M. Ning, Y. Wang, and Y. Guo, “Hierarchical cluster-ing with hard-batch triplet loss for person re-identiﬁcation,” in

CVPR , 2020, pp. 13 657–13 665.[209] H.-X. Yu, W.-S. Zheng, A. Wu, X. Guo, S. Gong, and J.-H. Lai,“Unsupervised person re-identiﬁcation by soft multilabel learn-ing,” in

CVPR , 2019, pp. 2148–2157.[210] A. Wu, W.-S. Zheng, and J.-H. Lai, “Unsupervised person re-identiﬁcation by camera-aware similarity consistency learning,”in

ICCV , 2019, pp. 6922–6931.[211] J. Wu, Y. Yang, H. Liu, S. Liao, Z. Lei, and S. Z. Li, “Unsupervisedgraph association for person re-identiﬁcation,” in

ICCV , 2019, pp.8321–8330.[212] Y. Fu, Y. Wei, G. Wang, Y. Zhou, H. Shi, and T. S. Huang,“Self-similarity grouping: A simple unsupervised cross domainadaptation approach for person re-identiﬁcation,” in

ICCV , 2019,pp. 6112–6121.[213] S. Bak and P. Carr, “One-shot metric learning for person re-identiﬁcation,” in

CVPR , 2017, pp. 2990–2999.[214] X. Wang, S. Paul, D. S. Raychaudhuri, and at al., “Learning per-son re-identiﬁcation models from videos with weak supervision,” arXiv preprint arXiv:2007.10631 , 2020.[215] Z. Zhong, L. Zheng, S. Li, and Y. Yang, “Generalizing a personretrieval model hetero-and homogeneously,” in

ECCV , 2018, pp.172–188.[216] J. Liu, Z.-J. Zha, D. Chen, R. Hong, and M. Wang, “Adaptivetransfer network for cross-domain person re-identiﬁcation,” in

CVPR , 2019, pp. 7202–7211.[217] Y. Huang, Q. Wu, J. Xu, and Y. Zhong, “Sbsgan: Suppression ofinter-domain background shift for person re-identiﬁcation,” in

ICCV , 2019, pp. 9527–9536.[218] Y. Chen, X. Zhu, and S. Gong, “Instance-guided context renderingfor cross-domain person re-identiﬁcation,” in

ICCV , 2019, pp.232–242.[219] Y. Ge, D. Chen, and H. Li, “Mutual mean-teaching: Pseudolabel reﬁnery for unsupervised domain adaptation on person re-identiﬁcation,” in

ICLR , 2020.[220] Y. Wang, S. Liao, and L. Shao, “Surpassing real-world sourcetraining data: Random 3d characters for generalizable person re-identiﬁcation,” in

ACM MM , 2020, pp. 3422–3430.[221] L. Qi, L. Wang, J. Huo, L. Zhou, Y. Shi, and Y. Gao, “A novelunsupervised camera-aware domain adaptation framework forperson re-identiﬁcation,” in

ICCV , 2019, pp. 8080–8089.[222] X. Zhang, J. Cao, C. Shen, and M. You, “Self-training withprogressive augmentation for unsupervised cross-domain personre-identiﬁcation,” in

ICCV , 2019, pp. 8222–8231.[223] Y. Ge, F. Zhu, D. Chen, R. Zhao, and H. Li, “Self-paced contrastivelearning with hybrid memory for domain adaptive object re-id,”in

NeurIPS , 2020.[224] J. Lv, W. Chen, Q. Li, and C. Yang, “Unsupervised cross-datasetperson re-identiﬁcation by transfer learning of spatial-temporalpatterns,” in

CVPR , 2018, pp. 7948–7956.[225] S. Liao and L. Shao, “Interpretable and generalizable personre-identiﬁcation with query-adaptive convolution and temporallifting,” in

ECCV , 2020. [226] H.-X. Yu, A. Wu, and W.-S. Zheng, “Cross-view asymmetricmetric learning for unsupervised person re-identiﬁcation,” in

ICCV , 2017, pp. 994–1002.[227] X. Jin, C. Lan, W. Zeng, Z. Chen, and L. Zhang, “Style normaliza-tion and restitution for generalizable person re-identiﬁcation,” in

CVPR , 2020, pp. 3143–3152.[228] Y. Zhai, Q. Ye, S. Lu, M. Jia, R. Ji, and Y. Tian, “Multiple expertbrainstorming for domain adaptive person re-identiﬁcation,” in

ECCV , 2020.[229] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum con-trast for unsupervised visual representation learning,” in

CVPR ,2020.[230] M. Ye, J. Shen, X. Zhang, P. C. Yuen, and S.-F. Chang, “Aug-mentation invariant and instance spreading feature for softmaxembedding,”

IEEE TPAMI , 2020.[231] W.-S. Zheng, X. Li, T. Xiang, S. Liao, J. Lai, and S. Gong, “Partialperson re-identiﬁcation,” in

ICCV , 2015, pp. 4678–4686.[232] L. He, J. Liang, H. Li, and Z. Sun, “Deep spatial feature re-construction for partial person re-identiﬁcation: Alignment-freeapproach,” in

CVPR , 2018, pp. 7073–7082.[233] L. He, Y. Wang, W. Liu, X. Liao, H. Zhao, Z. Sun, and J. Feng,“Foreground-aware pyramid reconstruction for alignment-freeoccluded person re-identiﬁcation,” in

ICCV , 2019, pp. 8450–8459.[234] J. Miao, Y. Wu, P. Liu, Y. Ding, and Y. Yang, “Pose-guided featurealignment for occluded person re-identiﬁcation,” in

ICCV , 2019,pp. 542–551.[235] T. Yu, D. Li, Y. Yang, T. Hospedales, and T. Xiang, “Robustperson re-identiﬁcation by modelling feature uncertainty,” in

ICCV , 2019, pp. 552–561.[236] M. Ye and P. C. Yuen, “Purifynet: A robust person re-identiﬁcation model with noisy labels,”

IEEE TIFS , 2020.[237] X. Li, A. Wu, and W.-S. Zheng, “Adversarial open-world personre-identiﬁcation,” in

ECCV , 2018, pp. 280–296.[238] M. Golfarelli, D. Maio, and D. Malton, “On the error-reject trade-off in biometric veriﬁcation systems,”

IEEE TPAMI , vol. 19, no. 7,pp. 786–796, 1997.[239] G. Lisanti, N. Martinel, A. Del Bimbo, and G. Luca Foresti,“Group re-identiﬁcation via unsupervised transfer of sparse fea-tures encoding,” in

ICCV , 2017, pp. 2449–2458.[240] Y. Cai, V. Takala, and M. Pietikainen, “Matching groups of peopleby covariance descriptor,” in

ICPR , 2010, pp. 2744–2747.[241] H. Xiao, W. Lin, B. Sheng, K. Lu, J. Yan, and et al., “Group re-identiﬁcation: Leveraging and integrating multi-grain informa-tion,” in

ACM MM , 2018, pp. 192–200.[242] Z. Huang, Z. Wang, W. Hu, C.-W. Lin, and S. Satoh, “Dot-gnn: Domain-transferred graph neural network for group re-identiﬁcation,” in

ACM MM , 2019, pp. 1888–1896.[243] Y. Shen, H. Li, S. Yi, D. Chen, and X. Wang, “Person re-identiﬁcation with deep similarity-guided graph neural net-work,” in

ECCV , 2018, pp. 486–504.[244] R. Panda, A. Bhuiyan, V. Murino, and A. K. Roy-Chowdhury,“Unsupervised adaptive re-identiﬁcation in open world dynamiccamera networks,” in

CVPR , 2017, pp. 7054–7063.[245] S. M. Assari, H. Idrees, and M. Shah, “Human re-identiﬁcationin crowd videos using personal, social and environmental con-straints,” in

ECCV , 2016, pp. 119–136.[246] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neuralnetworks,” in

CVPR , 2018, pp. 7794–7803.[247] F. Radenovi´c, G. Tolias, and O. Chum, “Fine-tuning cnn imageretrieval with no human annotation,”

IEEE TPAMI , vol. 41, no. 7,pp. 1655–1668, 2018.[248] X. Wang, X. Han, W. Huang, D. Dong, and M. R. Scott, “Multi-similarity loss with general pair weighting for deep metric learn-ing,” in

CVPR , 2019, pp. 5022–5030.[249] L. He, Z. Sun, Y. Zhu, and Y. Wang, “Recognizing partial biomet-ric patterns.” arXiv preprint arXiv:1810.07399 , 2018.[250] J. Xue, Z. Meng, K. Katipally, H. Wang, and K. van Zon, “Clothingchange aware person identiﬁcation,” in

CVPR Workshops , 2018,pp. 2112–2120.[251] F. Wan, Y. Wu, X. Qian, and Y. Fu, “When person re-identiﬁcationmeets changing clothes,” arXiv preprint arXiv:2003.04070 , 2020.[252] S. Roy, S. Paul, N. E. Young, and A. K. Roy-Chowdhury, “Exploit-ing transitivity for learning person re-identiﬁcation models on abudget,” in

CVPR , 2018, pp. 7064–7072.[253] Z. Zheng and Y. Yang, “Person re-identiﬁcation in the 3d space,” arXiv preprint arXiv:2006.04569 , 2020.

EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 1 [254] Y. Hu, D. Yi, S. Liao, Z. Lei, and S. Z. Li, “Cross dataset personre-identiﬁcation,” in

ACCV , 2014, pp. 650–664.[255] G. Wu and S. Gong, “Decentralised learning from independentmulti-domain labels for person re-identiﬁcation,” arXiv preprintarXiv:2006.04150 , 2020.[256] P. Bhargava, “Incremental learning in person re-identiﬁcation,” arXiv preprint arXiv:1808.06281 , 2018.[257] F. Zhu, X. Kong, L. Zheng, H. Fu, and Q. Tian, “Part-based deephashing for large-scale person re-identiﬁcation,”

IEEE Transac-tions on Image Processing (TIP) , vol. 26, no. 10, pp. 4806–4817, 2017.[258] J. Chen, Y. Wang, J. Qin, L. Liu, and L. Shao, “Fast person re-identiﬁcation via cross-camera semantic binary transformation,”in

CVPR , 2017, pp. 3873–3882.[259] G. Wang, S. Gong, J. Cheng, and Z. Hou, “Faster person re-identiﬁcation,” in

ECCV , 2020, pp. 275–292.[260] A. Wu, W.-S. Zheng, X. Guo, and J.-H. Lai, “Distilled person re-identiﬁcation: Towards a more scalable system,” in

CVPR , 2019,pp. 1187–1196.[261] M. Ye, X. Lan, and Q. Leng, “Modality-aware collaborative learn-ing for visible thermal person re-identiﬁcation,” in

ACM MM ,2019, pp. 347–355.[262] Z. Feng, J. Lai, and X. Xie, “Learning modality-speciﬁc rep-resentations for visible-infrared person re-identiﬁcation,”

IEEETransactions on Image Processing (TIP) , vol. 29, pp. 579–590, 2020.[263] D. Li, X. Wei, X. Hong, and Y. Gong, “Infrared-visible cross-modalperson re-identiﬁcation with an x modality.” in

AAAI , 2020, pp.4610–4617.[264] Y. Lu, Y. Wu, B. Liu, T. Zhang, B. Li, Q. Chu, and N. Yu, “Cross-modality person re-identiﬁcation with shared-speciﬁc featuretransfer,” in

CVPR , 2020, pp. 13 379–13 389.

Supplemental Materials:

This supplementary material accompanies our mainmanuscript with the implementation details and more com-prehensive experiments. We ﬁrst present the experimentson two single-modality closed-world Re-ID tasks, includingimage-based Re-ID on four datasets in Section A and video-based Re-ID on four datasets in Section B. Then we intro-duce the comprehensive comparison on two open-worldRe-ID tasks, including visible-infrared cross-modality Re-ID on two datasets in Section C and partial Re-ID on twodatasets in Section D. In addition, a structure overview forour survey is ﬁnally summarized.

A. Experiments on Single-modality Image-based Re-ID

Architecture Design.

The overall structure of our proposedAGW baseline for single-modality Re-ID is illustrated in § 4(Fig. R1). We adopt ResNet50 pre-trained on ImageNet asour backbone network and change the dimension of thefully connected layer to be consistent with the numberof identities in the training dataset. The stride of the lastspatial down-sampling operation in the backbone networkis changed from 2 to 1. Consequently, the spatial size ofthe output feature map is changed from × to × ,when feeding an image of resolution × as input. Inour method, we replace the Global Average Pooling in theoriginal ResNet50 with the Generalized-mean (GeM) pool-ing. The pooling hyper parameter p k for generalized-meanpooling is initialized as 3.0. A BatchNorm layer, namedBNNeck is plugged between the GeM pooling layer andthe fully connected layer. The output of the GeM poolinglayer is adopted for computing center loss and triplet lossin the training stage, while the feature after BNNeck is usedfor computing distance between pedestrian images duringtesting inference stage.

4. https://github.com/mangye16/ReID-Survey

Non-local Attention.

The ResNet contains 4 residualstages, i.e. conv x , conv x , conv x and conv x , eachcontaining stacks of bottleneck residual blocks. We insertedﬁve non-local blocks after conv , conv , conv , conv and conv respectively. We adopt the Dot Prod-uct version of non-local block with a bottleneck of 512channels in our experiment. For each non-local block, aBatchNorm layer is added right after the last linear layerthat represents W z . The afﬁne parameter of this BatchNormlayer is initialized as zeros to ensure that the non-localblock can be inserted into any pre-trained networks whilemaintaining its initial behavior. Training Strategy.

In the training stage, we randomlysample 16 identities and 4 images for each identity toform a mini-batch of size 64. Each image is resized into × pixels, padding 10 pixels with zero values, andthen randomly cropped into × pixels. Randomhorizontally ﬂipping and random erasing with 0.5 proba-bility respectively are also adopted for data augmentation.Speciﬁcally, random erasing augmentation [123] randomlyselects a rectangle region with area ratio r e to the wholeimage, and erase its pixels with the mean value of theimage. Besides, the aspect ratio of this region is randomlyinitialized between r and r . In our method, we set theabove hyper-parameter as . < r e < . , r . and r . . At last, we normalize the RGB channels of eachimage with mean 0.485, 0.456, 0.406 and stand deviation0.229, 0.224, 0.225, respectively, which are the same withsettings in [122]. Training Loss.

In the training stage, three types of lossare combined for optimization, including identity classiﬁca-tion loss ( L id ), center loss ( L ct ) and our proposed weightedregularization triplet loss ( L wrt ). L total = L id + β L ct + β L wrt . (R1)The balanced weight of the center loss ( β ) is set to 0.0005and the one ( β ) of the weighted regularized triplet lossis set to 1.0. Label smoothing is adopted to improve theoriginal identity classiﬁcation loss, which encourages themodel to be less conﬁdent during training and preventoverﬁtting for classiﬁcation task. Concretely, it changes theone-hot label as follow: q i = (cid:26) − N − N ε if i = yε/N otherwise , (R2)where N is the total number of identities, (cid:15) is a smallconstant to reduce the conﬁdence for the true identity label y and q i is treated as a new classiﬁcation target for training.In our method, we set (cid:15) to be 0.1. Optimizer Setting.

Adam optimizer with a weight decay . is adopted to train our model. The initial learningrate is set as 0.00035 and is decreased by 0.1 at the 40thepoch and 70th epoch, respectively. The model is trainedfor 120 epochs in total. Besides, a warm-up learning ratescheme is also employed to improve the stability of trainingprocess and bootstrap the network for better performance.Speciﬁcally, in the ﬁrst 10 epochs, the learning rate is linearly EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2 (1) Non-local Attention (2) GeM

Generalized Mean Pooling

X Z 𝑤 𝑝 (3) Weighted Regularization Triplet 𝑤 BN ID lossFeature

ResNet50 Backbone

Fig. R1: The framework of the proposed AGW baseline for single-modality image-based Re-ID.TABLE R1: Comparison with the state-of-the-arts on four video-based Re-ID datasets, including MARS, DukeVideo,PRID2011 and iLIDS-VID. Rank-1, -5, -10 accuracy (%), mAP (%) and mINP (%) are reported.

Method Venue MARS DukeVideo PRID2011 iLIDS-VIDR1 R5 mAP mINP R1 R5 mAP mINP R1 R5 R20 mINP R1 R5 R20 mINPETAP [144]

CVPR18

AAAI18

CVPR19

ICCV19

CVPR19W + - 87.6 85.8 83.0 63.9 95.4 99.3 94.9 91.9 94.4 98.4 100 95.4 83.2 98.3 99.7 89.0 increased from . × − to . × − . The learning rate lr ( t ) at epoch t can be computed as: lr( t ) =  . × − × t if t ≤ . × − if < t ≤ . × − if < t ≤ . × − if < t ≤ . (R3) B. Experiments on Video-based Re-ID

Implementation Details.

We extend our proposed AGWbaseline to a video-based Re-ID model by several minorchanges to the backbone structure and training strategyof single-modality image-based Re-ID model. The video-based AGW baseline takes a video sequence as input andextracts the frame-level feature vectors, which are then av-eraged to be a video-level feature vector before the BNNecklayer. Besides, the video-based AGW baseline is trainedfor 400 epochs totally to better ﬁt the video person Re-IDdatasets. The learning rate is decayed by 10 times every100 epochs. To form an input video sequence, we adoptthe constraint random sampling strategy [133] to sample 4frames as a summary for the original pedestrian tracklet.The BagTricks [122] baseline is extended to a video-basedRe-ID model in the same way as AGW baseline for faircomparison. In addition, we also develop a variant of AGWbaseline, termed as AGW + , to model more abundant tem-poral information in a pedestrian tracklet. AGW + baselineadopts the dense sampling strategy to form an input videosequence in the testing stage. Dense sampling strategy takesall the frames in a pedestrian tracklet to form input video sequence, resulting better performance but higher computa-tional cost. To further improve the performance of AGW + baseline on video re-ID datasets, we also remove the warm-up learning rate strategy and add dropout operation beforethe linear classiﬁcation layer. Detailed Comparison.

In this section, we conduct theperformance comparison between AGW baseline and otherstate-of-the-art video-based person Re-ID methods, includ-ing ETAP [144], DRSA [133], STA [135] Snippet [134],VRSTC [20], ADFD [110], GLTR [136] and CoSeg [132]. Thecomparison results on four video person Re-ID datasets(MARS, DukeVideo, PRID2011 and iLIDS-VID) are listed inTable R1. As we can see, by simply taking video sequenceas input and adopting average pooling to aggregate frame-level feature, our AGW baseline achieves competitive resultson two large-scale video Re-ID dataset, MARS and DukeV-ideo. Besides, AGW baseline also performs signiﬁcantly bet-ter than BagTricks [122] baseline under multiple evaluationmetrics. By further modeling more temporal informationand adjusting training strategy, AGW + baseline gains hugeimprovement and also achieves competitive results on bothPRID2011 and iLIDS-VID datasets. AGW + baseline outper-forms most state-of-the-art methods on MARS, DukeVideoand PRID2011 datasets. Most of these video-based personRe-ID methods achieve state-of-the-art performance by de-signing complicate temporal attention mechanism to exploittemporal dependency in pedestrian video. We believe thatour AGW baseline can help video Re-ID model achievehigher performance with properly designed mechanism tofurther exploit spatial and temporal dependency. EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 3

Generalized Mean Pooling BN ID loss

ResNet50

Generalized Mean Pooling

BN ID loss

Sharing Blocks

Weighted Regularization Triplet loss

Non-Local Non-Local Non-Local Non-Local Specific ResNet50

Fig. R2: The framework of the proposed AGW baseline for cross-modality visible-infrared Re-ID.TABLE R2: Comparison with the state-of-the-arts on SYSU-MM01 dataset. Rank at r accuracy (%), mAP (%) and mINP (%)are reported. ( Single-shot query setting [21] for experiments) . “*” represents methods published after the paper submission.

Settings

All Search Indoor Search

Method Venue r = 1 r = 10 r = 20 mAP mINP r = 1 r = 10 r = 20 mAP mINPOne-stream [21] ICCV17

ICCV17

AAAI18

IJCAI18

TIFS19

AAAI19 RL [189]

CVPR19

MM19

TIP19

ICCV19 ∗ [263] AAAI-20 49.9 89.8 96.0 50.7 - - - - - -Hi-CMD ∗ [191] CVPR20 34.9 77.6 - 35.9 - - - - - -cm-SSFT ∗ [264] CVPR20 47.7 - - 54.1 - - - - - -DDAG ∗ [192] ECCV20 54.75 90.39 95.81 53.02 39.62 61.02 94.06 98.41 67.98 62.61HAT ∗ [188] TIFS20 55.29 92.14 97.36 53.89 - 62.10 95.75 99.20 69.37 -AGW - 47.50 84.39 92.14 47.65 35.30 54.17 91.14 95.98 62.97 59.23 C. Experiments on Cross-modality Re-ID

Architecture Design.

We adopt a two-stream network struc-ture as the backbone for cross-modality visible-infrared Re-ID . Compared to the one-stream architecture in single-modality person Re-ID (Fig. 8), the major difference is that, i.e ., the ﬁrst block is speciﬁc for two modalities in orderto capture modality-speciﬁc information, while the remain-ing blocks are shared to learn modality sharable features.Compared to the two-stream structure widely used in [142],[261], which only has one shared embedding layer, ourdesign captures more sharable components. An illustrationfor cross-modality visible-infrared Re-ID is shown in Fig. R2. Training Strategy.

At each training step, we randomsample 8 identities from the whole dataset. Then 4 visi-ble and 4 infrared images are randomly selected for eachidentity. Totally, each training batch contains 32 visible and32 infrared images. This guarantees the informative hardtriplet mining from both modalities, i.e ., we directly selectthe hard positive and negative from both intra- and inter-modalities. This approximates the idea of bi-directionalcenter-constrained top-ranking loss, handling the inter- andintra-modality variations simultaneously.For fair comparison, we follow the settings in [142]exactly to conduct the image processing and data aug-

5. https://github.com/mangye16/Cross-Modal-Re-ID-baseline mentation. For infrared images, we keep the original threechannels, just like the visible RGB images. All the inputimages from both modalities are ﬁrst resized to × ,and random crop with zero-padding together with randomhorizontal ﬂipping are adopted for data argumentation.The cropped image sizes are × for both modality.The image normalization are exactly following the single-modality setting. Training Loss.

In the training stage, we combine with theidentity classiﬁcation loss ( L id ) and our proposed weightedregularization triplet loss ( L wrt ). The weight of combiningthe identity loss and weighted regularized triplet loss is setto 1, the same as the single-modality setting. The poolingparameter p k is set to 3. For stable training, we adopt thesame identity classiﬁer for two heterogeneous modalities,mining sharable information. Optimizer Setting.

We set the initial learning rate as0.1 on both datasets, and decay it by 0.1 and 0.01 at 20and 50 epochs, respectively. The total number of trainingepoch is 60. We also adopt a warm-up learning rate scheme.We adopt the stochastic gradient descent (SGD) optimizerfor optimization, and the momentum parameter is set to0.9. We have tried the same Adam optimizer (used insingle-modality Re-ID) on cross-modality Re-ID task, butthe performance is much lower than that of SGD optimizerby using a large learning rate. This is crucial since ImageNet

EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 4

TABLE R3: Comparison with the state-of-the-arts on RegDB dataset on both query settings. Rank at r accuracy (%), mAP(%) and mINP (%) are reported. ( Both the visible to thermal and thermal to visible query settings are evaluated.) “*” representsmethods published after the paper submission.

Settings

Visible to Thermal Thermal to Visible

Method Venue r = 1 r = 10 r = 20 mAP mINP r = 1 r = 10 r = 20 mAP mINPHCML [186] AAAI18

ICCV17

IJCAI18

TIFS19

AAAI19 RL [189]

CVPR19

MM19

TIP19

ICCV19 ∗ [263] AAAI20 62.21 83.13 91.72 60.18 - - - - - -Hi-CMD ∗ [191] CVPR20 70.93 86.39 - 66.04 - - - - - -cm-SSFT ∗ [264] CVPR20 72.3 - - 72.9 - 71.0 - - 71.7 -DDAG ∗ [192] ECCV20 69.34 86.19 91.49 63.46 49.24 68.06 85.15 90.31 61.80 48.62HAT ∗ [188] TIFS20 71.83 87.16 92.16 67.56 - 70.02 86.45 91.61 66.30 -AGW - 70.05 86.21 91.55 66.37 50.19 70.49 87.12 91.84 65.90 51.24 initialization is adopted for the infrared images. Detailed Comparison

This section conducts the compar-ison with the state-of-the-art cross-modality VI-ReID meth-ods, including eBDTR [142], HSME [187], D RL [189], MAC[261], MSR [262] and AlignGAN [190]. These methods arepublished in the past two years. AlignGAN [190], publishedin ICCV 2019, achieves the state-of-the-art performance byaligning the cross-modality representation at both the fea-ture level and pixel level with GAN generated images. Theresults on two datasets are shown in Tables R2 and R3. Weobserve that the proposed AGW consistently outperformsthe current state-of-the-art, without the time-consumingimage generation process. For different query settings onRegDB dataset, our proposed baseline generally keeps thesame performance. Our proposed baseline has been widelyused in many recently developed methods. We believe ournew baseline will provide a good guidance to boost thecross-modality Re-ID.

D. Experiments on Partial Re-ID

Implementation Details.

We also evaluate the performanceof our proposed AGW baseline on two commonly-usedpartial Re-ID datasets, Partial-REID and Partial-iLIDS. Theoverall backbone structure and training strategy for partialRe-ID AGW baseline model are the same as the one forsingle-modality image-based Re-ID model. Both Partial-REID and Partial-iLIDS datasets offer only query image setand gallery image set. So we train AGW baseline modelon the training set of Market-1501 dataset, then evaluate itsperformance on the testing set of two partial Re-ID datasets.We adopt the same way to evaluate the performance ofBagTricks [122] baseline on these two partial Re-ID datasetsfor better comparison and analysis.

Detailed Comparison.

We compare the performanceof AGW baseline with other state-of-the-art partial Re-IDmethods, including DSR [232], SFR [249] and VPM [67]. Allthese methods are published in recent years. The compari-son results on both Partial-REID and Partial-iLIDS datasetsare shown in Table R4. The VPM [67] achieves a very highperformance by perceiving the visibility of regions throughself-supervision and extracting region-level features. Con-sidering only global features, our proposed AGW baselinestill achieves competitive results compared to the current TABLE R4: Comparison with the state-of-the-arts on twopartial Re-ID datasets, including Partial-REID and Partial-iLIDS. Rank-1, -3 accuracy (%) and mINP (%) are reported.

Method Partial-REID Partial-iLIDSR1 R3 mINP R1 R3 mINPDSR [232]

CVPR18

ArXiv18

CVPR19 - CVPR19W state-of-the-arts on both datasets. Besides, AGW baselinebrings signiﬁcant improvement comparing to BagTricks[122] under multiple evaluation metrics, demonstrating itseffectiveness for partial Re-ID problem.

E. Overview of This Survey

The overview ﬁgure of this survey is shown in Fig. R3.According to the ﬁve steps in developing a person Re-IDsystem, we conduct the survey from both closed-world andopen-world settings. The closed-world setting is detailedin three different aspects: feature representation learning,deep metric learning and ranking optimization. We thensummarize the datasets and SOTAs from both image- andvideo-based perspectives. For open-world person Re-ID,we summarize it into ﬁve aspects: including heterogeneousdata, Re-ID from raw images/videos, unavailable/limitedlabels, noisy annotation and open-set Re-ID.Following the summary, we present an outlook for fu-ture person Re-ID. We design a new evaluation metric(mINP) to evaluate the difﬁculty to ﬁnd all the correctmatches. By analyzing the advantages of existing Re-IDmethods, we develop a strong AGW baseline for futuredevelopments, which achieves competitive performance onfour Re-ID tasks. Finally, some under-investigated openissues are discussed. Our survey provides a comprehensivesummarization of existing state-of-the-art in different sub-tasks. Meanwhile, the analysis of future directions is alsopresented for further development guidance.

Acknowledgement.

The authors would like to thankthe anonymous reviewers for providing valuable feedbacksto improve the quality of this survey. The authors alsowould like to thank the pioneer researchers in person re-identiﬁcation and other related ﬁelds. This work is spon-sored by CAAI-Huawei MindSpore Open Fund.

EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 5

Closed-World SettingOpen-World Setting ✓ Single-modality Data ✓ Bounding Boxes Generation ✓ Sufficient Annotated Data ✓ Correct Annotation ✓ Query Exists in Gallery

Feature Representation LearningDeep Metric LearningRankingOptimization Global Feature LearningLocal Feature Learning

Auxiliary Feature Learning

Video Feature LearningIdentity LossVerification LossTriplet LossRe-RankingRank FusionHeterogeneous Data Re-ID fromRaw Images/Videos Unavailable /Limited Labels Noisy Annotation Open-set Re-ID and Beyond Depth-based Re-IDText-Image Re-IDDatasets and Evaluation Datasets & MetricsState-of-the-Arts

Visible-Infrared Re-ID

Cross-Resolution Re-IDEnd-to-end Person SearchMulti-Camera TrackingUnsupervised Re-IDUnsupervised Domain AdaptationPartial Re-IDRe-ID with Sample NoiseRe-ID with Label Noise

Re-ID inNext Era

New EvaluationMetricNew AGW

BaselineUnder-investigatedOpen Issues

SurveySurveyOutlook mINP : Measure the ability to retrieve the hardest match

AGW : Achieve state-of-the-art/comparable performance on four Re-ID tasksDiscuss five open issues from different aspects

PersonRe-ID