Deep Learning for Person Re-identification: A Survey and Outlook
Mang Ye, Jianbing Shen, Gaojie Lin, Tao Xiang, Ling Shao, Steven C. H. Hoi
IIEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 1
Deep Learning for Person Re-identification:A Survey and Outlook
Mang Ye, Jianbing Shen,
Senior Member, IEEE , Gaojie Lin, Tao XiangLing Shao and Steven C. H. Hoi,
Fellow, IEEE
Abstract —Person re-identification (Re-ID) aims at retrieving a person of interest across multiple non-overlapping cameras. With theadvancement of deep neural networks and increasing demand of intelligent video surveillance, it has gained significantly increasedinterest in the computer vision community. By dissecting the involved components in developing a person Re-ID system, we categorizeit into the closed-world and open-world settings. The widely studied closed-world setting is usually applied under variousresearch-oriented assumptions, and has achieved inspiring success using deep learning techniques on a number of datasets. We firstconduct a comprehensive overview with in-depth analysis for closed-world person Re-ID from three different perspectives, includingdeep feature representation learning, deep metric learning and ranking optimization. With the performance saturation underclosed-world setting, the research focus for person Re-ID has recently shifted to the open-world setting, facing more challengingissues. This setting is closer to practical applications under specific scenarios. We summarize the open-world Re-ID in terms of fivedifferent aspects. By analyzing the advantages of existing methods, we design a powerful AGW baseline, achieving state-of-the-art orat least comparable performance on twelve datasets for FOUR different Re-ID tasks. Meanwhile, we introduce a new evaluation metric(mINP) for person Re-ID, indicating the cost for finding all the correct matches, which provides an additional criteria to evaluate theRe-ID system for real applications. Finally, some important yet under-investigated open issues are discussed.
Index Terms —Person Re-Identification, Pedestrian Retrieval, Literature Survey, Evaluation Metric, Deep Learning (cid:70)
NTRODUCTION P ERSON re-identification (Re-ID) has been widely studied asa specific person retrieval problem across non-overlappingcameras [1], [2]. Given a query person-of-interest, the goalof Re-ID is to determine whether this person has appearedin another place at a distinct time captured by a differentcamera, or even the same camera at a different time instant[3]. The query person can be represented by an image [4],[5], [6], a video sequence [7], [8], and even a text description[9], [10]. Due to the urgent demand of public safety andincreasing number of surveillance cameras, person Re-ID isimperative in intelligent surveillance systems with signifi-cant research impact and practical importance.Re-ID is a challenging task due to the presence of dif-ferent viewpoints [11], [12], varying low-image resolutions[13], [14], illumination changes [15], unconstrained poses[16], [17], [18], occlusions [19], [20], heterogeneous modal-ities [10], [21], complex camera environments, backgroundclutter [22], unreliable bounding box generations, etc. Theseresult in varying variations and uncertainty. In addition, forpractical model deployment, the dynamic updated cameranetwork [23], [24], large scale gallery with efficient retrieval • M. Ye is with the School of Computer Science, Wuhan University, Chinaand Inception Institute of Artificial Intelligence, UAE. • J. Shen and L. Shao are with the Inception Institute of Artificial Intelli-gence, UAE. E-mail: { mangye16, shenjianbingcg } @gmail.com • G. Lin is with the School of Computer Science, Beijing Institute ofTechnology, China. • T. Xiang is with the Centre for Vision Speech and Signal Processing,University of Surrey, UK. Email: [email protected] • S. C. H. Hoi is with the Singapore Management University, and SalesforceResearch Asia, Singapore. Email: [email protected] [25], group uncertainty [26], significant domain shift [27],unseen testing scenarios [28], incremental model updating[29] and changing cloths [30] also greatly increase the dif-ficulties. These challenges lead that Re-ID is still unsolvedproblem. Early research efforts mainly focus on the hand-crafted feature construction with body structures [31], [32],[33], [34], [35] or distance metric learning [36], [37], [38],[39], [40], [41]. With the advancement of deep learning,person Re-ID has achieved inspiring performance on thewidely used benchmarks [5], [42], [43], [44]. However, thereis still a large gap between the research-oriented scenariosand practical applications [45]. This motivates us to conducta comprehensive survey, develop a powerful baseline fordifferent Re-ID tasks and discuss several future directions.Though some surveys have also summarized the deeplearning techniques [2], [46], [47], our survey makes threemajor differences: 1) We provide an in-depth and com-prehensive analysis of existing deep learning methods bydiscussing their advantages and limitations, analyzing thestate-of-the-arts. This provides insights for future algo-rithm design and new topic exploration. 2) We design anew powerful baseline (AGW: Attention Generalized meanpooling with Weighted triplet loss) and a new evaluationmetric (mINP: mean Inverse Negative Penalty) for futuredevelopments. AGW achieves state-of-the-art performanceon twelve datasets for four different Re-ID tasks. mINPprovides a supplement metric to existing CMC/mAP, in-dicating the cost to find all the correct matches. 3) We makean attempt to discuss several important research directionswith under-investigated open issues to narrow the gap be-tween the closed-world and open-world applications, takinga step towards real-world Re-ID system design. a r X i v : . [ c s . C V ] J a n EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2
2. Bounding Box Generation1. Raw Data Collection 3. Train Data Annotation
Re-ID
Model
4. Training Model
QueryGallerySearch …
5. Testing
Fig. 1: The flow of designing a practical person Re-ID system, including five main steps: 1)
Raw Data Collection , (2)
BoundingBox Generation , 3)
Training Data Annotation , 4)
Model Training and 5)
Pedestrian Retrieval .TABLE 1:
Closed-world vs . Open-world Person Re-ID. Closed-world (Section 2) Open-world (Section 3) (cid:88)
Single-modality Data Heterogeneous Data (§ 3.1) (cid:88)
Bounding Boxes Generation Raw Images/Videos (§ 3.2) (cid:88)
Sufficient Annotated Data Unavailable/Limited Labels (§ 3.3) (cid:88)
Correct Annotation Noisy Annotation (§ 3.4) (cid:88)
Query Exists in Gallery Open-set (§ 3.5)
Unless otherwise specified, person Re-ID in this surveyrefers to the pedestrian retrieval problem across multiplesurveillance cameras, from a computer vision perspective.Generally, building a person Re-ID system for a specificscenario requires five main steps (as shown in Fig. 1):1) Step 1:
Raw Data Collection : Obtaining raw video datafrom surveillance cameras is the primary requirementof practical video investigation. These cameras are usu-ally located in different places under varying environ-ments [48]. Most likely, this raw data contains a largeamount of complex and noisy background clutter.2) Step 2:
Bounding Box Generation : Extracting the bound-ing boxes which contain the person images from theraw video data. Generally, it is impossible to manuallycrop all the person images in large-scale applications.The bounding boxes are usually obtained by the persondetection [49], [50] or tracking algorithms [51], [52].3) Step 3:
Training Data Annotation : Annotating the cross-camera labels. Training data annotation is usually indis-pensable for discriminative Re-ID model learning dueto the large cross-camera variations. In the existence oflarge domain shift [53], we often need to annotate thetraining data in every new scenario.4) Step 4:
Model Training : Training a discriminative androbust Re-ID model with the previous annotated personimages/videos. This step is the core for developing aRe-ID system and it is also the most widely studiedparadigm in the literature. Extensive models have beendeveloped to handle the various challenges, concen-trating on feature representation learning [54], [55],distance metric learning [56], [57] or their combinations.5) Step 5:
Pedestrian Retrieval : The testing phase conductsthe pedestrian retrieval. Given a person-of-interest(query) and a gallery set, we extract the feature rep-resentations using the Re-ID model learned in previousstage. A retrieved ranking list is obtained by sortingthe calculated query-to-gallery similarity. Some meth-ods have also investigated the ranking optimization toimprove the retrieval performance [58], [59].According to the five steps mentioned above, we cate-gorize existing Re-ID methods into two main trends: closed-world and open-world settings, as summarized in Table 1. Astep-by-step comparison is in the following five aspects: 1)
Single-modality vs . Heterogeneous Data : For the raw datacollection in Step 1, all the persons are representedby images/videos captured by single-modality visiblecameras in the closed-world setting [5], [8], [31], [42],[43], [44]. However, in practical open-world applica-tions, we might also need to process heterogeneousdata, which are infrared images [21], [60], sketches [61],depth images [62], or even text descriptions [63]. Thismotivates the heterogeneous Re-ID in § 3.1.2) Bounding Box Generation vs . Raw Images/Videos : For thebounding box generation in Step 2, the closed-worldperson Re-ID usually performs the training and testingbased on the generated bounding boxes, where thebounding boxes mainly contain the person appearanceinformation. In contrast, some practical open-worldapplications require end-to-end person search from theraw images or videos [55], [64]. This leads to anotheropen-world topic, i.e ., end-to-end person search in § 3.2.3) Sufficient Annotated Data vs . Unavailable/Limited Labels :For the training data annotation in Step 3, the closed-world person Re-ID usually assumes that we haveenough annotated training data for supervised Re-IDmodel training. However, label annotation for eachcamera pair in every new environment is time consum-ing and labor intensive, incurring high costs. In open-world scenarios, we might not have enough annotateddata ( i.e ., limited labels) [65] or even without any labelinformation [66]. This inspires the discussion of theunsupervised and semi-supervised Re-ID in § 3.3.4) Correct Annotation vs . Noisy Annotation : For Step 4, exist-ing closed-world person Re-ID systems usually assumethat all the annotations are correct, with clean labels.However, annotation noise is usually unavoidable dueto annotation error ( i.e ., label noise) or imperfect detec-tion/tracking results ( i.e ., sample noise, partial Re-ID[67]). This leads to the analysis of noise-robust personRe-ID under different noise types in § 3.4.5) Query Exists in Gallery vs . Open-set : In the pedestrianretrieval stage (Step 5), most existing closed-world per-son Re-ID works assume that the query must occur inthe gallery set by calculating the CMC [68] and mAP[5]. However, in many scenarios, the query person maynot appear in the gallery set [69], [70], or we need toperform the verification rather than retrieval [26]. Thisbrings us to the open-set person Re-ID in § 3.5.This survey first introduces the widely studied personRe-ID under closed-world settings in § 2. A detailed reviewon the datasets and the state-of-the-arts are conducted in§ 2.4. We then introduce the open-world person Re-ID in § 3.An outlook for future Re-ID is presented in § 4, including EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 3 a new evaluation metric (§ 4.1), a new powerful AGWbaseline (§ 4.2). We discuss several under-investigated openissues for future study (§ 4.3). Conclusions will be drawn in§ 5. A structure overview is shown in the supplementary.
LOSED - WORLD P ERSON R E -I DENTIFICATION
This section provides an overview for closed-world personRe-ID. As discussed in § 1, this setting usually has the fol-lowing assumptions: 1) person appearances are captured bysingle-modality visible cameras, either by image or video; 2)The persons are represented by bounding boxes, where mostof the bounding box area belongs the same identity; 3) Thetraining has enough annotated training data for superviseddiscriminative Re-ID model learning; 4) The annotations aregenerally correct; 5) The query person must appear in thegallery set. Typically, a standard closed-world Re-ID sys-tem contains three main components:
Feature RepresentationLearning (§ 2.1), which focuses on developing the featureconstruction strategies;
Deep Metric Learning (§ 2.2), whichaims at designing the training objectives with different lossfunctions or sampling strategies; and
Ranking Optimization (§ 2.3), which concentrates on optimizing the retrievedranking list. An overview of the datasets and state-of-the-arts with in-depth analysis is provided in § 2.4.2.
We firstly discuss the feature learning strategies in closed-world person Re-ID. There are four main categories (asshown in Fig. 2): a) Global Feature (§ 2.1.1), it extracts aglobal feature representation vector for each person imagewithout additional annotation cues [55]; b) Local Feature(§ 2.1.2), it aggregates part-level local features to formulatea combined representation for each person image [75], [76],[77]; c) Auxiliary Feature (§ 2.1.3), it improves the featurerepresentation learning using auxiliary information, e.g .,attributes [71], [72], [78], GAN generated images [42], etc.d) Video Feature (§ 2.1.4), it learns video representationfor video-based Re-ID [7] using multiple image frames andtemporal information [73], [74]. We also review severalspecific architecture designs for person Re-ID in § 2.1.5.
Global feature representation learning extracts a global fea-ture vector for each person image, as shown in Fig. 2(a).Since deep neural networks are originally applied in imageclassification [79], [80], global feature learning is the primarychoice when integrating advanced deep learning techniquesinto the person Re-ID field in early years.To capture the fine-grained cues in global feature learn-ing, A joint learning framework consisting of a single-image representation (SIR) and cross-image representation(CIR) is developed in [81], trained with triplet loss usingspecific sub-networks. The widely-used ID-discriminativeEmbedding (IDE) model [55] constructs the training pro-cess as a multi-class classification problem by treating eachidentity as a distinct class. It is now widely used in Re-IDcommunity [42], [58], [77], [82], [83]. Qian et al. [84] developa multi-scale deep representation learning model to capturediscriminative cues at different scales.
Attention Information.
Attention schemes have beenwidely studied in literature to enhance representation learn-ing [85]. 1)
Group 1: Attention within the person image.
Typ-ical strategies include the pixel level attention [86] andthe channel-wise feature response re-weighting [86], [87],[88], [89], or background suppressing [22]. The spatial in-formation is integrated in [90]. 2)
Group 2: attention acrossmultiple person images.
A context-aware attentive featurelearning method is proposed in [91], incorporating bothan intra-sequence and inter-sequence attention for pair-wisefeature alignment and refinement. The attention consistencyproperty is added in [92], [93]. Group similarity [94], [95]is another popular approach to leverage the cross-imageattention, which involves multiple images for local andglobal similarity modeling. The first group mainly enhancesthe robustness against misalignment/imperfect detection,and the second improves the feature learning by mining therelations across multiple images.
It learns part/region aggregated features, making it robustagainst misalignment [77], [96]. The body parts are eitherautomatically generated by human parsing/pose estimation(Group 1) or roughly horizontal division (Group 2).With automatic body part detection, the popular solutionis to combine the full body representation and local part fea-tures [97], [98]. Specifically, the multi-channel aggregation[99], multi-scale context-aware convolutions [100], multi-stage feature decomposition [17] and bilinear-pooling [97]are designed to improve the local feature learning. Ratherthan feature level fusion, the part-level similarity combina-tion is also studied in [98]. Another popular solution is toenhance the robustness against background clutter, usingthe pose-driven matching [101], pose-guided part attentionmodule [102], semantically part alignment [103], [104].For horizontal-divided region features, multiple part-level classifiers are learned in Part-based ConvolutionalBaseline (PCB) [77], which now serves as a strong partfeature learning baseline in the current state-of-the-art[28], [105], [106]. To capture the relations across multiplebody parts, the Siamese Long Short-Term Memory (LSTM)architecture [96], second-order non-local attention [107],Interaction-and-Aggregation (IA) [108] are designed to re-inforce the feature learning.The first group uses human parsing techniques to obtainsemantically meaningful body parts, which provides well-align part features. However, they require an additionalpose detector and are prone to noisy pose detections [77].The second group uses a uniform partition to obtain thehorizontal stripe parts, which is more flexible, but it issensitive to heavy occlusions and large background clutter.
Auxiliary feature representation learning usually requiresadditional annotated information ( e.g ., semantic attributes[71]) or generated/augmented training samples to reinforcethe feature representation [19], [42].
Semantic Attributes . A joint identity and attribute learn-ing baseline is introduced in [72]. Su et al. [71] proposea deep attribute learning framework by incorporating thepredicted semantic attribute information, enhancing the
EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 4
CNN CNN CNN“male”, “short hair” CNN(a) Global (b) Local (c) Auxiliary (d) Video
Fig. 2: Four different feature learning strategies. a) Global Feature, learning a global representation for each person imagein § 2.1.1; b) Local Feature, learning part-aggregated local features in § 2.1.2; c) Auxiliary Feature, learning the featurerepresentation using auxiliary information, e.g ., attributes [71], [72] in § 2.1.3 and d) Video Feature , learning the videorepresentation using multiple image frames and temporal information [73], [74] in § 2.1.4.generalizability and robustness of the feature representationin a semi-supervised learning manner. Both the semanticattributes and the attention scheme are incorporated toimprove part feature learning [109]. Semantic attributes arealso adopted in [110] for video Re-ID feature representationlearning. They are also leveraged as the auxiliary supervi-sion information in unsupervised learning [111].
Viewpoint Information.
The viewpoint information isalso leveraged to enhance the feature representation learn-ing [112], [113]. Multi-Level Factorisation Net (MLFN) [112]also tries to learn the identity-discriminative and view-invariant feature representations at multiple semantic levels.Liu et al. [113] extract a combination of view-generic andview-specific learning. An angular regularization is incor-porated in [114] in the viewpoint-aware feature learning.
Domain Information.
A Domain Guided Dropout(DGD) algorithm [54] is designed to adaptively mine thedomain-sharable and domain-specific neurons for multi-domain deep feature representation learning. Treating eachcamera as a distinct domain, Lin et al. [115] propose a multi-camera consistent matching constraint to obtain a globallyoptimal representation in a deep learning framework. Sim-ilarly, the camera view information or the detected cameralocation is also applied in [18] to improve the feature repre-sentation with camera-specific information modeling.
GAN Generation.
This section discusses the use of GANgenerated images as the auxiliary information. Zheng et al. [42] start the first attempt to apply the GAN technique forperson Re-ID. It improves the supervised feature represen-tation learning with the generated person images. Pose con-straints are incorporated in [116] to improve the quality ofthe generated person images, generating the person imageswith new pose variants. A pose-normalized image genera-tion approach is designed in [117], which enhances the ro-bustness against pose variations. Camera style information[118] is also integrated in the image generation process toaddress the cross camera variations. A joint discriminativeand generative learning model [119] separately learns theappearance and structure codes to improve the image gen-eration quality. Using the GAN generated images is also awidely used approach in unsupervised domain adaptationRe-ID [120], [121], approximating the target distribution.
Data Augmentation.
For Re-ID, custom operations arerandom resize, cropping and horizontal flip [122]. Besides,adversarially occluded samples [19] are generated to aug-ment the variation of training data. A similar randomerasing strategy is proposed in [123], adding random noiseto the input images. A batch DropBlock [124] randomly drops a region block in the feature map to reinforce theattentive feature learning. Bak et al. [125] generate the virtualhumans rendered under different illumination conditions.These methods enrich the supervision with the augmentedsamples, improving the generalizability on the testing set.
Video-based Re-ID is another popular topic [126], whereeach person is represented by a video sequence with mul-tiple frames. Due to the rich appearance and temporalinformation, it has gained increasing interest in the Re-ID community. This also brings in additional challenges invideo feature representation learning with multiple images.The primary challenge is to accurately capture the tem-poral information. A recurrent neural network architectureis designed for video-based person Re-ID [127], whichjointly optimizes the final recurrent layer for temporalinformation propagation and the temporal pooling layer.A weighted scheme for spatial and temporal streams isdeveloped in [128]. Yan et al. [129] present a progres-sive/sequential fusion framework to aggregate the frame-level human region representations. Semantic attributes arealso adopted in [110] for video Re-ID with feature disentan-gling and frame re-weighting. Jointly aggregating the frame-level feature and spatio-temporal appearance information iscrucial for video representation learning [130], [131], [132].Another major challenge is the unavoidable outliertracking frames within the videos. Informative frames areselected in a joint Spatial and Temporal Attention PoolingNetwork (ASTPN) [131], and the contextual information isintegrated in [130]. A co-segmentation inspired attentionmodel [132] detects salient features across multiple videoframes with mutual consensus estimation. A diversity regu-larization [133] is employed to mine multiple discriminativebody parts in each video sequence. An affine hull is adoptedto handle the outlier frames within the video sequence [83].An interesting work [20] utilizes the multiple video framesto auto-complete occluded regions. These works demon-strate that handling the noisy frames can greatly improvethe video representation learning.It is also challenging to handle the varying lengths ofvideo sequences, Chen et al. [134] divide the long videosequences into multiple short snippets, aggregating the top-ranked snippets to learn a compact embedding. A clip-levellearning strategy [135] exploits both spatial and temporaldimensional attention cues to produce a robust clip-levelrepresentation. Both the short- and long-term relations [136]are integrated in a self-attention scheme.
EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 5
Framing person Re-ID as a specific pedestrian retrievalproblem, most existing works adopt the network archi-tectures [79], [80] designed for image classification as thebackbone. Some works have tried to modify the backbonearchitecture to achieve better Re-ID features. For the widelyused ResNet50 backbone [80], the important modificationsinclude changing the last convolutional stripe/size to 1 [77],employing adaptive average pooling in the last poolinglayer [77], and adding bottleneck layer with batch normal-ization after the pooling layer [82].Accuracy is the major concern for specific Re-ID networkarchitecture design to improve the accuracy, Li et al. [43]start the first attempt by designing a filter pairing neuralnetwork (FPNN), which jointly handles misalignment andocclusions with part discriminative information mining.Wang et al. [89] propose a BraidNet with a specially de-signed WConv layer and Channel Scaling layer. The WConvlayer extracts the difference information of two images toenhance the robustness against misalignments and ChannelScaling layer optimizes the scaling factor of each inputchannel. A Multi-Level Factorisation Net (MLFN) [112]contains multiple stacked blocks to model various latentfactors at a specific level, and the factors are dynamicallyselected to formulate the final representation. An efficientfully convolutional Siamese network [137] with convolutionsimilarity module is developed to optimize multi-level sim-ilarity measurement. The similarity is efficiently capturedand optimized by using the depth-wise convolution.Efficiency is another important factor for Re-ID architec-ture design. An efficient small scale network, namely Omni-Scale Network (OSNet) [138], is designed by incorporatingthe point-wise and depth-wise convolutions. To achievemulti-scale feature learning, a residual block composed ofmultiple convolutional streams is introduced.With the increasing interest in auto-machine learning, anAuto-ReID [139] model is proposed. Auto-ReID provides anefficient and effective automated neural architecture designbased on a set of basic architecture components, using apart-aware module to capture the discriminative local Re-ID features. This provides a potential research direction inexploring powerful domain-specific architectures.
Metric learning has been extensively studied before the deeplearning era by learning a Mahalanobis distance function[36], [37] or projection matrix [40]. The role of metric learn-ing has been replaced by the loss function designs to guidethe feature representation learning. We will first review thewidely used loss functions in § 2.2.1 and then summarize thetraining strategies with specific sampling designs § 2.2.2.
This survey only focuses on the loss functions designedfor deep learning [56]. An overview of the distance metriclearning designed for hand-crafted systems can be found in[2], [143]. There are three widely studied loss functions withtheir variants in the literature for person Re-ID, includingthe identity loss, verification loss and triplet loss. An illus-tration of three loss functions is shown in Fig. 3.
CNN “Suman” (a) Identity LossCNN(b) Verification LossCNN CNN(c) Triplet LossCNNCNN … … ? PullPush
Margin
BeforeAfter
ClassifierSame / Not
Fig. 3: Three kinds of widely used loss functions in the litera-ture. (a) Identity Loss [42], [82], [118], [140] ; (b) VerificationLoss [94], [141] and (c) Triplet Loss [14], [22], [57]. Manyworks employ their combinations [87], [137], [141], [142].
Identity Loss.
It treats the training process of personRe-ID as an image classification problem [55], i.e ., eachidentity is a distinct class. In the testing phase, the outputof the pooling layer or embedding layer is adopted as thefeature extractor. Given an input image x i with label y i , thepredicted probability of x i being recognized as class y i isencoded with a softmax function, represented by p ( y i | x i ) .The identity loss is then computed by the cross-entropy L id = − n (cid:88) ni =1 log( p ( y i | x i )) , (1)where n represents the number of training samples withineach batch. The identity loss has been widely used in ex-isting methods [19], [42], [82], [92], [95], [106], [118], [120],[140], [144]. Generally, it is easy to train and automaticallymine the hard samples during the training process, asdemonstrated in [145]. Several works have also investigatedthe softmax variants [146], such as the sphere loss in [147]and AM softmax in [95]. Another simple yet effective strat-egy, i.e ., label smoothing [42], [122], is generally integratedinto the standard softmax cross-entropy loss. Its basic ideais to avoid the model fitting to over-confident annotatedlabels, improving the generalizability [148]. Verification Loss.
It optimizes the pairwise relationship,either with a contrastive loss [96], [120] or binary verificationloss [43], [141]. The contrastive loss improves the relativepairwise distance comparison, formulated by L con = (1 − δ ij ) { max(0 , ρ − d ij ) } + δ ij d ij , (2)where d ij represents the Euclidean distance between theembedding features of two input samples x i and x j . δ ij isa binary label indicator ( δ ij = 1 when x i and x j belong tothe same identity, and δ ij = 0 , otherwise). ρ is a marginparameter. There are several variants, e.g ., the pairwisecomparison with ranking SVM in [81].Binary verification [43], [141] discriminates the positiveand negative of a input image pair. Generally, a differentialfeature f ij is obtained by f ij = ( f j − f j ) [141], where f i and f j are the embedding features of two samples x i and x j . The verification network classifies the differential feature EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 6 into positive or negative. We use p ( δ ij | f ij ) to represent theprobability of an input pair ( x i and x j ) being recognized as δ ij (0 or 1). The verification loss with cross-entropy is L veri ( i, j ) = − δ ij log( p ( δ ij | f ij )) − (1 − δ ij ) log(1 − p ( δ ij | f ij )) . (3)The verification is often combined with the identity lossto improve the performance [94], [96], [120], [141]. Triplet loss.
It treats the Re-ID model training processas a retrieval ranking problem. The basic idea is that thedistance between the positive pair should be smaller thanthe negative pair by a pre-defined margin [57]. Typically, atriplet contains one anchor sample x i , one positive sample x j with the same identity, and one negative sample x k froma different identity. The triplet loss with a margin parameteris represented by L tri ( i, j, k ) = max( ρ + d ij − d ik , , (4)where d ( · ) measures the Euclidean distance between twosamples. The large proportion of easy triplets will dominatethe training process if we directly optimize above lossfunction, resulting in limited discriminability. To alleviatethis issue, various informative triplet mining methods havebeen designed [14], [22], [57], [97]. The basic idea is to selectthe informative triplets [57], [149]. Specifically, a moderatepositive mining with a weight constraint is introduced in[149], which directly optimizes the feature difference. Her-mans et al. [57] demonstrate that the online hardest positiveand negative mining within each training batch is beneficialfor discriminative Re-ID model learning. Some methods alsostudied the point to set similarity strategy for informativetriplet mining [150], [151]. This enhances robustness againstthe outlier samples with a soft hard-mining scheme.To further enrich the triplet supervision, a quadrupletdeep network is developed in [152], where each quadrupletcontains one anchor sample, one positive sample and twomined negative samples. The quadruplets are formulatedwith a margin-based online hard negative mining. Optimiz-ing the quadruplet relationship results in smaller intra-classvariation and larger inter-class variation.The combination of triplet loss and identity loss is one ofthe most popular solutions for deep Re-ID model learning[28], [87], [90], [93], [103], [104], [116], [137], [142], [153],[154]. These two components are mutually beneficial fordiscriminative feature representation learning. OIM loss.
In addition to the above three kinds of lossfunctions, an Online Instance Matching (OIM) loss [64] isdesigned with a memory bank scheme. A memory bank { v k , k = 1 , , · · · , c } contains the stored instance features,where c denotes the class number. The OIM loss is thenformulated by L oim = − n (cid:88) ni =1 log exp( v Ti f i /τ ) (cid:80) ck =1 exp( v Tk f i /τ ) , (5)where v i represents the corresponding stored memory fea-ture for class y i , and τ is a temperature parameter thatcontrols the similarity space [145]. v Ti f i measures the onlineinstance matching score. The comparison with a memorizedfeature set of unlabelled identities is further included tocalculate the denominator [64], handling the large instancenumber of non-targeted identities. This memory scheme isalso adopted in unsupervised domain adaptive Re-ID [106]. Cam 1 (a) Initial Ranking
Cam 2 Cam 3 Cam 4Query Easy Match Hard MatchHard MatchSearch … … …
Search (b) Re-Ranking new rank list … Search
Cam 1 (a) Initial Ranking
Cam 2 Cam 3 Cam 4Query (1) Easy (3) Hard (2) HardSearch … … …
Search (b) Re-Ranking new rank list … Search
Fig. 4: An illustration of re-ranking in person Re-ID. Givena query example, an initial rank list is retrieved, wherethe hard matches are ranked in the bottom. Using the top-ranked easy positive match (1) as query to search in thegallery, we can get the hard match (2) and (3) with similaritypropagation in the gallery set.
The batch sampling strategy plays an important role indiscriminative Re-ID model learning. It is challenging sincethe number of annotated training images for each identityvaries significantly [5]. Meanwhile, the severely imbalancedpositive and negative sample pairs increases additionaldifficulty for the training strategy design [40].The most commonly used training strategy for handlingthe imbalanced issue is identity sampling [57], [122]. Foreach training batch, a certain number of identities arerandomly selected, and then several images are sampledfrom each selected identity. This batch sampling strategyguarantees the informative positive and negative mining.To handle the imbalance issue between the positive andnegative, adaptive sampling is the popular approach toadjust the contribution of positive and negative samples,such as Sample Rate Learning (SRL) [89], curriculum sam-pling [87]. Another approach is sample re-weighting, usingthe sample distribution [87] or similarity difference [52] toadjust the sample weight. An efficient reference constraintis designed in [155] to transform the pairwise/triplet sim-ilarity to a sample-to-reference similarity, addressing theimbalance issue and enhancing the discriminability, whichis also robust to outliers.To adaptively combine multiple loss functions, a multi-loss dynamic training strategy [156] adaptively reweightsthe identity loss and triplet loss, extracting appropriatecomponent shared between them. This multi-loss trainingstrategy leads to consistent performance gain.
Ranking optimization plays a crucial role in improvingthe retrieval performance in the testing stage. Given aninitial ranking list, it optimizes the ranking order, either byautomatic gallery-to-gallery similarity mining [58], [157] orhuman interaction [158], [159]. Rank/Metric fusion [160],[161] is another popular approach for improving the rankingperformance with multiple ranking list inputs.
EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 7
The basic idea of re-ranking is to utilize the gallery-to-gallery similarity to optimize the initial ranking list, asshown in Fig. 4. The top-ranked similarity pulling andbottom-ranked dissimilarity pushing is proposed in [157].The widely-used k -reciprocal reranking [58] mines the con-textual information. Similar idea for contextual informationmodeling is applied in [25]. Bai et al. [162] utilize the geo-metric structure of the underlying manifold. An expandedcross neighborhood re-ranking method [18] is introducedby integrating the cross neighborhood distance. A localblurring re-ranking [95] employs the clustering structure toimprove neighborhood similarity measurement. Query Adaptive.
Considering the query difference,some methods have designed the query adaptive retrievalstrategy to replace the uniform searching engine to improvethe performance [163], [164]. Andy et al. [163] propose aquery adaptive re-ranking method using locality preserv-ing projections. An efficient online local metric adaptationmethod is presented in [164], which learns a strictly localmetric with mined negative samples for each probe.
Human Interaction.
It involves using human feedbackto optimize the ranking list [158]. This provides reliable su-pervision during the re-ranking process. A hybrid human-computer incremental learning model is presented in [159],which cumulatively learns from human feedback, improv-ing the Re-ID ranking performance on-the-fly.
Rank fusion exploits multiple ranking lists obtained withdifferent methods to improve the retrieval performance [59].Zheng et al. [165] propose a query adaptive late fusionmethod on top of a “L” shaped observation to fuse methods.A rank aggregation method by employing the similarity anddissimilarity is developed in [59]. The rank fusion process inperson Re-ID is formulated as a consensus-based decisionproblem with graph theory [166], mapping the similarityscores obtained by multiple algorithms into a graph withpath searching. An Unified Ensemble Diffusion (UED) [161]is recently designed for metric fusion. UED maintains theadvantages of three existing fusion algorithms, optimizedby a new objective function and derivation. The metricensemble learning is also studied in [160].
Datasets.
We first review the widely used datasets for theclosed-world setting, including 11 image datasets (VIPeR[31], iLIDS [167], GRID [168], PRID2011 [126], CUHK01-03 [43], Market-1501 [5], DukeMTMC [42], Airport [169]and MSMT17 [44]) and 7 video datasets (PRID-2011 [126],iLIDS-VID [7], MARS [8], Duke-Video [144], Duke-Tracklet[170], LPW [171] and LS-VID [136]). The statistics of thesedatasets are shown in Table 2. This survey only focuses onthe general large-scale datsets for deep learning methods. Acomprehensive summarization of the Re-ID datasets can befound in [169] and their website . Several observations canbe made in terms of the dataset collection over recent years:
1. https://github.com/NEU-Gou/awesome-reid-dataset
TABLE 2: Statistics of some commonly used datasets forclosed-world person Re-ID. “both” means that it containsboth hand-cropped and detected bounding boxes. “C&M”means both CMC and mAP are evaluated.
Image datasets
Dataset Time
Video datasets
Dataset time
1) The dataset scale (both
Evaluation Metrics.
To evaluate a Re-ID system, Cumu-lative Matching Characteristics (CMC) [68] and mean Aver-age Precision (mAP) [5] are two widely used measurements.CMC- k ( a.k.a , Rank- k matching accuracy) [68] representsthe probability that a correct match appears in the top- k ranked retrieved results. CMC is accurate when only oneground truth exists for each query, since it only considersthe first match in evaluation process. However, the galleryset usually contains multiple groundtruths in a large cameranetwork, and CMC cannot completely reflect the discrim-inability of a model across multiple cameras.Another metric, i.e ., mean Average Precision (mAP) [5],measures the average retrieval performance with multiplegrountruths. It is originally widely used in image retrieval.For Re-ID evaluation, it can address the issue of two systemsperforming equally well in searching the first ground truth(might be easy match as in Fig. 4), but having differentretrieval abilities for other hard matches.Considering the efficiency and complexity of traininga Re-ID model, some recent works [138], [139] also reportthe FLoating-point Operations Per second (FLOPs) and thenetwork parameter size as the evaluation metrics. Thesetwo metrics are crucial when the training/testing device haslimited computational resources. We review the state-of-the-arts from both image-based andvideo-based perspectives. We include methods published intop CV venues over the past three years.
EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 8 P $ 3 5 D Q N 3 &