Weakly Supervised Person Re-ID: Differentiable Graphical Learning and A New Benchmark
Guangrun Wang, Guangcong Wang, Xujie Zhang, Jianhuang Lai, Zhengtao Yu, Liang Lin
11 Weakly Supervised Person Re-ID: DifferentiableGraphical Learning and A New Benchmark
Guangrun Wang, Guangcong Wang, Xujie Zhang, Jianhuang Lai, and Liang Lin
Abstract —Person re-identification (Re-ID) benefits greatlyfrom the accurate annotations of existing datasets (e.g., CUHK03[1] and Market-1501 [2]), which are quite expensive because eachimage in these datasets has to be assigned with a proper label.In this work, we ease the annotation of Re-ID by replacing theaccurate annotation with inaccurate annotation, i.e., we group theimages into bags in terms of time and assign a bag-level label foreach bag. This greatly reduces the annotation effort and leads tothe creation of a large-scale Re-ID benchmark called SYSU-30 k .The new benchmark contains k categories of persons, whichis about times larger than CUHK03 ( . k categories) andMarket-1501 ( . k categories), and times larger the ImageNet( k categories). It sums up to 29,606,918 images. Learning aRe-ID model with bag-level annotation is called the weaklysupervised Re-ID problem. To solve this problem, we introducea differentiable graphical model to capture the dependenciesfrom all images in a bag and generate a reliable pseudo labelfor each person image. The pseudo label is further used tosupervise the learning of the Re-ID model. When comparedwith the fully supervised Re-ID models, our method achieves thestate-of-the-art performance on SYSU-30 k and other datasets.The code, dataset, and pretrained model will be available athttps://github.com/wanggrun/SYSU-30k. Index Terms —Weakly Supervised Learning, Person Re-identification, Deep Learning, Differentiable Graphical Learning
I. I
NTRODUCTION P ERSON re-identification (Re-ID) has been extensivelystudied in recent years [3]–[7], which refers to the prob-lem of recognizing persons across cameras. Solving the Re-IDproblem has many applications in video surveillance for publicsafety. Existing attempts mainly focus on learning to extractrobust and discriminative representations [8], [9], and learningmatching functions or metrics [8], [10]–[12] in a supervisedmanner. In the past four years, deep learning [13], [14] hasbeen introduced to the Re-ID community and has achievedpromising results.However, a crucial bottleneck in building deep-learning-based models is that they typically require strongly annotatedimages during training. In the context of Re-ID, strong annota-tion refers to assigning a clear category label (i.e., person ID)for each person image, which is very expensive because it isdifficult for annotators to remember persons who are strangersto the annotators, particularly when the crowd is massive.Moreover, due to the wide range of human activities, many
The first two authors contribute equally and share the first authorship. Theauthors are with the School of Data and Computer Science, Sun Yat-senUniversity, Guangzhou, P. R. China. Email: [email protected].;[email protected]; [email protected]; [email protected];Corresponding author: Liang Lin.
Alice Bob Carol Alice, Bob, Carol (a) (b)(c)
Matching Who?Dave Dave
Fig. 1:
Problem definition for the weakly supervised Re-ID. (a) isan example of strong annotation while (b) is an example of a weakannotation. During testing, there is no difference between the fullyand weakly supervised Re-ID problems, i.e., they both aim at findingthe best-matching image for a given person image, as shown in (c). images must be annotated in a short amount of time (see Fig.1 (a)).An alternative way to create a Re-ID benchmark is toreplace image-level annotations with bag-level annotations.Suppose that there is a short video containing many personimages; we do not need to know who is in each image.A cast of characters is enough. Here, the clear ID of eachimage is called the image-level label (Fig. 1 (a)), and thecast of characters is called the bag-level label (Fig. 1 (b)).Based on our experience, collecting bag-level annotationsis approximately three times faster/cheaper than collectingimage-level annotations. Once the dataset has been collected,the goal is to train a weakly supervised Re-ID model that is aspowerful as the fully supervised one. We call this the weaklysupervised Re-ID problem .Formally, with strong supervision, the supervised learn-ing task is to learn f : X → Y from a training set { ( x , y ) , · · · , ( x i , y i ) , · · · , ( x m , y m ) } , where x i ∈ X is aperson image and y i ∈ Y is its exact person ID. By contrast,the weakly supervised learning task here is to learn f : B → L from a training set { ( b , l ) , · · · , ( b j , l j ) , · · · , ( b n , l n ) } ,where b j ∈ B is a bag of person images, i.e., b j = { x j , x j , x j , · · · , x jp } ; and l j ∈ L is its bag-level la-bel, i.e., l j = { y j , y j , · · · , y jq } . Note that the mappingsbetween { x j , x j , x j , · · · , x jp } and { y j , y j , · · · , y jq } areunknown. Furthermore, it is not necessary for the labels in { y j , y j , · · · , y jp } to be accurate; i.e., they may be insuffi-cient, redundant, or even incorrect. During testing, there isno difference between fully and weakly supervised Re-IDproblems (see Fig. 1 (c)). a r X i v : . [ c s . C V ] J u l Person A Person B Person CPairwise Term Unary Term
Weak Label Pseudo Strong Label (a) (b) (c)
Fig. 2:
An illustration of the proposed method for weakly supervisedRe-ID. (a) shows a bag of images and their bag-level label. (b)represents the process of differentiable graphical learning. Usinggraphical modeling, we can obtain the pseudo image-level label foreach image, as shown in (c).
Solving the weakly supervised Re-ID problem is challeng-ing. Because without the help of strongly labeled data, it israther difficult to model the dramatic variances across cameraviews, such as the variances in illumination and occlusionconditions, which makes it very challenging to learn a dis-criminative representation. Existing Re-ID approaches cannotsolve the weakly supervised Re-ID problem. Regardless ofwhether they are designed for computing either cross-view-invariant features or distance metrics [1], [3], [4], [15]–[19],the existing models all assume that a strong annotation ofeach person image is available. This is also reflected in theexisting benchmarking Re-ID datasets, most of which consistof a precise person category label for each image. None ofthem are designed to train a weakly supervised model.Although the weak annotation lacks detailed clues fordirectly recognizing each person image, they usually containglobal dependencies among images, which are very useful tomodel the variances of images across camera views. Hence,the weak annotations are as powerful as the strong annotations.Specifically, we introduce a differentiable graphical model toaddress the weakly supervised Re-ID problem, which includesseveral steps.
First , the person images are fed into the DNNsin term of bags (Fig. 2 (a)) to obtain the rough categorizationprobabilities. These categorization probabilities are modeledas the unary terms in a discriminative undirected probabilisticgraphical model; see Fig. 2 (b).
Second , we further model therelations between person images as the binary terms in thegraph by considering their feature similarity, their apparentsimilarity, and their index in different bags (representing thespatiotemporal information); see Fig. 2 (b). Note that boththe unary term and the pairwise term are formulated asprobabilities. These two terms are summed to form the refinedcategorization probability.
Third , we maximize the refinedcategorization probabilities and obtain the pseudo-image-levellabel for each image.
Fourth , we use the generated pseudolabels to supervise the learning of the deep Re-ID model.Note that different from traditional non-differentiable graphi-cal models (e.g., CRFs), our proposed model is differentiableand thus can be integrated into DNNs, which is optimizedby using stochastic gradient descent (SGD). All of the abovesteps are trained in an end-to-end fashion. We summarize the
Contributions of this work in the following three aspects. We take the first step to define the unexplored weakly
Identity 1, 2, 3 & 4Identity 5, 6 & 7Identity 8 & 9 (a) (b)Identity 10 & 11Identity 9,12 &13 Identity 13,14 &15 Identity 14,15,16,17 Identity 16,17,18,19 Identity 18,19,20,21 Identity 20,21,22,23
Fig. 3: Examples in our SYSU- k dataset. (a) are person imagesin terms of bag and (b) are their bag-level weak annotations. supervised Re-ID problem by replacing the image-level an-notations in conventional Re-ID systems with bag-level an-notations. This new problem is worth exploring because itsignificantly reduces the labor of annotation and offers thepotential to obtain large-scale training data. Since existing benchmarks largely ignore this weaklysupervised Re-ID problem, we contribute a newly dedicateddataset called the SYSU- k for facilitating further researchon the Re-ID problem. SYSU- k contains k categories ofpersons, which is about times larger than CUHK03 ( . k categories) and Market-1501 ( . k categories), and timeslarger than ImageNet ( k categories). SYSU- k contains29,606,918 images. Moreover, SYSU- k provides not onlya large platform for the weakly supervised Re-ID problem butalso a more challenging test set that is consistent with the realistic setting for standard evaluation. Fig. 3 shows somesamples from the SYSU- k dataset. We introduce a differentiable graphical model to tacklethe unreliable annotation dilemma in the weakly supervisedRe-ID problem. When compared with the fully supervisedRe-ID models, our method achieves the state-of-the-art per-formance on SYSU-30 k and other datasets.The remainder of this work is organized as follows. SectionII provides a brief review of the related work. Then, weintroduce the annotation of SYSU-30 k in Section III andfollow with the weakly supervised Re-ID model in SectionsIV. The experimental results and comparisons are presented inSection V. Section VI concludes the work and presents someoutlooks for future work.II. R ELATED W ORKS
Re-ID has been widely investigated in the literature. Mostrecent works can be categorized into three groups: (1) extract-ing invariant and discriminant features [1], [3]–[5], [17], [20]–[24], (2) learning a robust metric or subspace for matching[4], [9], [13], [15], [25], [26], and (3) joint learning of theabove two methods [27]–[29]. Recently, there are many workson the generalization of Re-ID, such as video-based Re-ID[30], image-to-video Re-ID [31], spatio-temporal Re-ID [32],partial/occluded Re-ID [33], [34], and natural language Re-ID [35]. However, all these methods assume that the traininglabels are strong. They are thus ineffective for solving theweakly supervised learning problem in our scenario.Another approach that is free from the prohibitively highcost of manual labeling is unsupervised learning Re-ID [36]–[41]. These methods either use local saliency matching [40],[41] or resort to clustering models [36]. However, withoutthe help of labeled data, it is difficult to model the dramaticvariances across camera views, e.g., representation learningand metric learning. Therefore, it is difficult for these pipelinesto achieve high accuracies [42]–[46]. In contrast, the proposedweakly supervised Re-ID problem has a good solution. Notethat compared to unsupervised Re-ID, the annotation effort ofweakly supervised Re-ID is also very inexpensive.Beyond Re-ID, although training deep models with weakannotations is a challenging problem, it has been partiallyinvestigated in the literature, such as image classification [47],[48], semantic segmentation [49]–[51], object detection [47],[52], [53] tasks. Take semantic segmentation as an example;it has exploited the advantages of weak annotation, includingbounding box label [54], image-level label [49], scribble label[55] and language label [56], [57]. Our method is relatedto them in that our model is also based on the generationof a pseudo label. However, the weakly supervised Re-IDproblem has two unique characteristics that distinguish it fromother weakly supervised learning tasks. (1) We cannot finda representative image for a permanent ID because peoplewill change their clothes at short intervals. The same personwearing different clothes may be regarded as two differentpersons. This results in thousands of millions of person IDs.Therefore, the label for a weakly supervised Re-ID sample isfuzzier than other tasks. (2) The entropy of the weakly super-vised Re-ID problem is larger than other tasks. In the weakly supervised segmentation task, pixels in images share certainmotion of rigidity and stability, increasing the correction rateof prediction. Whereas in the case of the weakly supervisedRe-ID task, persons in video bags are more unordered andirregular. Due to the above two reasons, it is considerably morechallenging to re-identify a person in a weakly supervisedscenario.Apart from our model, there have been some uncertain labellearning models, among which the one-shot/one-example Re-ID [58], [59] is the most related to ours. The main differencesbetween their methods and ours are two-fold. First, in one-shotRe-ID, at least one accurate label for each person categoryis still in desire. While in our weakly supervised Re-ID, noaccurate label is needed. Second, there are bag-level labelsas constraints to guide the estimation of the pseudo labelsin our method, ensuring that our generated pseudo labelsto be more reliable than those generated by one-shot Re-ID. Besides, [60] also proposes to cope with the uncertain-label Re-ID problem using multiple-instance multiple-labellearning. However, similar to [59], at least one accurate labelfor each person category is still in a desire to form the probeset in [60]. Note that mathematically, [58]–[60] are all semi-supervised Re-ID but NOT weakly supervised Re-ID.To address the weakly supervised Re-ID problem, wepropose to generate the pseudo label for each image byintroducing a differentiable graphical learning [61], which isinspired by the advances in semantic image segmentation [62],[63]. Recently, one classical graphical model, i.e., conditionalrandom field or CRF, has also been introduced to Re-IDproblem for deep similarity [64]. However, our method differsfrom [64] in two aspects. First, like all existing methods, [64]uses CRF as a post-processing tool to refine the predictions infully supervised learning, while our method fully exploits thesupervision-independent property of graphical learning [62]to generate pseudo labels for our weakly supervised Re-IDlearning. Second, different from traditional non-differentiablegraphical models and [64], our proposed model directly for-mulates the graphical learning as an additional loss, which isdifferentiable to the neural network parameters and thus canbe optimized by using stochastic gradient descent (SGD).Another problem that is very related to our problem isperson search [35], [70], which aims to fuse the processesof person detection and Re-ID. There are two significantdifferences between weakly supervised Re-ID and personsearch. First, the weakly supervised Re-ID only focuses onvisual matching, which is reasonable because current humandetectors are competent enough to detect persons. Second,the weakly supervised Re-ID problem enjoys the inexpensiveefforts of weak annotation, while the person search still needsa strong annotation for each person image.III. SYSU- k D ATASET
Data Collection.
No weakly supervised Re-ID dataset ispublicly available. To fill this gap, we contribute a new Re-ID dataset named SYSU- k in the wild to facilitate studies.We download many short program videos from the Internet.TV programs are considered as our video source for two TABLE I: A comparison of different Re-ID benchmarks.
Categories:
Each person identity is a category.
Scene: whether thevideo is taken indoors or outdoors.
Annotation: whether image-level labels are provided.
Images: the person images which areobtained by using a human detector to detect the video frames. Actually, the person images in this work refer to the boundingboxes. (a) Comparision with existing Re-ID datasets.
Dataset CUHK03 [1] Market-1501 [2] Duke [65] MSMT17 [66] CUHK01 [67] PRID [68] VIPeR [4] CAVIAR [69] SYSU- k Categories 1,467 1,501 1,812 4,101 971 934 632 72
Scene Indoor Outdoor Outdoor Indoor, Outdoor Indoor Outdoor Outdoor Indoor
Indoor, Outdoor
Annotation Strong Strong Strong Strong Strong Strong Strong Strong
Weak
Cameras 2 6 8 15 10 2 2 2
Countless
Images 28,192 32,668 36,411 126,441 3,884 1,134 1,264 610 (b) Comparison with ImageNet-1 k Dataset ImageNet-1 k SYSU- k Categories 1,000
Images 1,280,000
Annotation Strong
Weak reasons.
First , the pedestrians in a TV program video areoften cross-view and cross-camera because the scenes inTV program videos are generally recorded by many camerasfor post-processing and the cameras in a program aregenerally movable for following shots. Therefore, identifyingthe pedestrians in a TV program video is exactly a Re-IDproblem in the wild. Second , the number of pedestrians ina program is suitable for annotation, i.e., neither too manynor too few. On average, each video contains 30.5 pedestrianswalking around.Our final raw video set contains 1,000 videos. The anno-tators are then asked to annotate the persons in the videoin a weak fashion. In particular, each video is divided into84,924 bags of arbitrary length. Then, the annotators record thepedestrians identity for each bag. YOLO-v2 [71] is utilized forpedestrian bounding box detection. Three annotators reviewthe detected bounding boxes and annotate person categorylabels for 20 days. Finally, 29,606,918 ( ≈ M ) boundingboxes of 30,508 ( ≈ k ) person categories are annotated. Wethen select 2,198 identities as the test set, leaving the rest asthe training set. There is no overlap between the training setand the test set. Dataset Statistics.
SYSU- k contains 29,606,918 personimages with 30,508 categories in total, which is further dividedinto 84,930 bags (only for training set). Fig. 4 (a) summarizesthe number of bags with respect to the number of images perbag, showing that each bag has 2,885 images on average. Thishistogram reveals the person image distribution of these bagsin the real world without any manual cleaning and refinement.Each bag is provided with an annotation of bag-level labels. Comparison with Existing Re-ID Benchmarks.
Wecompare SYSU- k with existing Re-ID datasets, includingCUHK03 [1], Market-1501 [2], Duke [65], MSMT17 [66],CUHK01 [67], PRID [68], VIPeR [4], and CAVIAR [69].Fig. 4 (c) and (d) plots the person categories and the numberof images, respectively, indicating that SYSU- k is muchlarger than existing datasets. To evaluate the performance ofthe weakly supervised Re-ID approach, we randomly choose2,198 person categories from SYSU- k as the test set. Theseperson categories are not utilized in training. We annotate anaccurate person ID for each person image. We also comparethe test set of SYSU- k with existing Re-ID datasets. From Fig. 4 (b) and (c), we can observe that the test set of SYSU- k is more challenging than those of the competitors in termsof both the image number and person categories. Thanks tothe above annotation fashion, the SYSU- k test set can ade-quately reflect the real world setting and is consequently morechallenging than existing Re-ID datasets. Therefore, SYSU- k is not only a large benchmark for the weakly supervisedRe-ID problem but is also a significant standard platformfor evaluating existing fully-supervised Re-ID methods in thewild.A further comparison of SYSU- k with existing Re-IDbenchmarks is shown in Table I (a), including categories,scene, annotation, cameras, and image numbers (boundingboxes). After the comparison, we summarize the new featuresin SYSU- k in the following aspects. First, SYSU- k is thefirst weakly annotated dataset for Re-ID. Second, SYSU- k is the largest Re-ID dataset in terms of both person categoriesand image number. Third, SYSU- k is more challenging dueto many cameras, realistic indoor and outdoor scenes, andoccasionally incorrect annotations. Four, the test set of SYSU- k is not only suitable for the weakly supervised Re-IDproblem but is also a significant standard platform to evaluateexisting fully supervised Re-ID methods in the wild. Fig. 3shows some training samples in the SYSU- k dataset, andFig. 5 shows some testing samples. Comparison with ImageNet-1 k . Beyond the Re-ID family,we also compare SYSU- k with the well-known ImageNet-1 k benchmark for general image recognition. As shown inTable I (b), SYSU- k has several appealing advantages overImageNet-1 k . First, SYSU- k has more object categoriesthan ImageNet-1 k , i.e., 30 k vs 1 k . Second, SYSU- k hasa greater number of images by 1-2 orders of magnitude thanImageNet-1 k . Third, SYSU- k saves annotation due to theeffective weak annotation. Evaluation Protocol.
The evaluation protocol of SYSU- k is similar to that of the previous datasets [2]. As SYSU- k dataset is quite large, we do not need to repeat randompartitioning the dataset into a training set and test set for tentimes [1]. Instead, we fix the train/test partitioning. In the testset, each person category will have one probe, resulting in1,000 probes. As the scalability is of most importance forthe practicability of Re-ID systems, we propose to challenge Images per bag (a)(b)(c) C a t e go r i e sI m ag e s TotalTestTotalTest B ag s C U
H K M a r k e t D uk e M S M T C U
H K P R I D V I P e R C A V I A R S Y S U - k C U
H K M a r k e t D uk e M S M T C U
H K P R I D V I P e R C A V I A R S Y S U - k Fig. 4:
The statistics of the SYSU- k . (a) summarizes the numberof the bags with respect to the number of the images per bag. (b) and(c) compare SYSU- k with the existing datasets in terms of imagenumber and person categories, respectively, for both the whole datasetand the test set. the scalability of a Re-ID model by providing a galleryset containing a vast volume of distractors for validation.Specifically, for each probe, there is only one matching personimage as the correct answer in the gallery, while there are478,730 mismatching person images as the wrong answer inthe gallery. Thus, the evaluation protocol is to search for aneedle in the ocean. This is consistent with the practicabilityof Re-ID tasks because the police usually need to search amassive amount of videos for a criminal. Then, given a queryimage sequence, all gallery items are assigned a similarityscore. We then rank the gallery according to their similarityto the query, based on which we calculate the CMC metricwhich represents the expectation of the true match being foundwithin the first n ranks, following [2].IV. W EAKLY S UPERVISED R E -ID M ODEL
We aim at learning a Re-ID model using weak supervisionby exploiting the dependencies among the person images. Wefirst discuss the supervision in the traditional supervised Re-IDand the weakly supervised Re-ID (Section IV-A), then presenta solution for the weakly supervised Re-ID using differentiable Fig. 5: Examples in the test set of SYSU-30 k . Each pairrepresents a pair of images belonging to the same personcategory, but taken by different cameras. Left: query images;
Right: gallery images.graphical modeling (Section IV-B). The network architectureand implementation details are presented in Section IV-Cand Section IV-D. Next, the computational complexity of ourmethod is presented in Section IV-E. Finally, we discuss therelationship of our work to previous works in Section IV-F
A. From Supervised Re-ID to Weakly Supervised Re-ID
The training data for DNNs are usually organized in batches,which allows us to organize several bags of person images ina batch. Each bag has a flexible number of images. Hence,abundant inter-image relations and dependencies can be fullyexploited to discover useful supervision information.Let b denote a bag containing p images, i.e., b = { x , x , · · · , x j , · · · , x p } ; y = { y , y , · · · , y j , · · · , y p } arethe image-level labels; while l denotes the bag-level label. Ina fully supervised Re-ID problem, the image-level labels y areknown. The goal of fully supervised learning is to learn themodel by minimizing the error between the category predictionand the image-level label for each person image.On the contrary, in a weakly supervised Re-ID problem,although the bag-level label l is provided, the image-levellabels y are unknown. One possible solution is to estimate apseudo image-level label ˆ y for each person image. Intuitively, we can first obtain an image-level label in the form of aprobabilistic vector (denoted as Y ) for each image from thebag-level label. Suppose l contains n categories of person, andin total there are m person categories in the training set. Thenthe preliminary image-level for each person image x j can bededuced as the following: Y j = Y j ... Y kj ... Y mj , where Y kj = (cid:40) n , if k ∈ l , otherwise , (1)Eqn. 1 reveals the restricting role of a bag-level label. There-fore, in the following we refer to Eqn. 1 as bag constraint for simplification. By fully exploiting the bag constraint andthe dependencies among the images in a bag, we can furtherdeduce the final pseudo-image-level labels ˆ y from the prelimi-nary image-level labels Y . Then, ˆ y are leveraged to supervisedthe learning of the model in the same manner as the fullysupervised learning.A Re-ID problem is different from an image classificationproblem because the training set and the test set in a Re-IDproblem do not share the person categories. As a result, thesimilarity between the probe images and the gallery imagesmust be measured. Let x i be a probe image and x j be agallery image. The similarity of x i and x j is measured bycalculating the Euclidean distance between the features of x i and x j learned by the DNNs. B. Weakly Supervised Re-ID with Differentiable GraphicalLearning
In this section, we will discuss the mechanism and formu-lation of using differentiable graphical learning to generatepseudo-image-level labels for the person images.
Graphically Modeling Re-ID . Our graph is a directedgraph in which each node represents a person image x i ina bag, and each edge represents the relation between personimages, as illustrated in Fig. 6. Assigning a label y i to eachnode x i will have a cost. For example, imposing the labels ‘Person 1’, ‘Person 2’, and ‘Person 3’ to x , x and x leadsto an energy cost of E ( y = 1; y = 2; y = 3 | x ; x ; x ) ,which is abbreviated as E ( y ; y ; y | x ; x ; x ) or E ( y | x ) for notation simplification. Let i denote an image index withrespect to a bag. Formally, the energy function of our graphis defined as E ( y | x ) = (cid:88) ∀ i ∈ U Φ( y i | x i ) (cid:124) (cid:123)(cid:122) (cid:125) unary term + (cid:88) ∀ i,j ∈ V Ψ( y i , y j | x i ; x j ) (cid:124) (cid:123)(cid:122) (cid:125) pairwise term , (2)where U and V denote a set of nodes and edges, respectively. Φ( y i | x i ) is the unary term measuring the cost of assigninglabel y i to a person image x i . For instance, if an imagesbelongs to the first category rather than the second one,we should have Φ( y i = 1 | x i ) < Φ( y i = 2 | x i ) .Moreover, Ψ( y i , y j | x i ; x j ) is the pairwise term that measuresthe penalty of assigning labels to a pair of person images ... ... ... ... x x x x ResNet50Bag-level labelA bag of images Categorization scoreUnary termPairwise term Loss ˆ y ˆ y ˆ y ˆ y Graphical Module P P P P P P P P Fig. 6: Graphical model for the generation of pseudo image-level labels in a bag of person images. The unary terms areestimated by the deep neural networks, while the pairwiseterms are obtained by considering the similarity of features,the raw image appearance, and the bag-level label. ( x i , x i ) , respectively. Mathematically, graphical modeling isemployed to smooth noisy (uncertain) person ID prediction.The unary term in Eqn. 2 performs the prediction based onsole nodes. While the pairwise term in Eqn. 2 couples differentnodes, favoring same-label assignments of nodes that are bagindex proximal and similar in appearance. In summary, Eqn.2 is to clean up the spurious predictions of classifiers learnedin a weakly supervised manner. Unary Term.
Intuitively, the unary terms represent per-image classifications. The unary term in Eqn. 2 is typicallydefined as Φ( y i | x i ) = − log( Y i [ y i ]) , where Y = Y i (cid:124)(cid:123)(cid:122)(cid:125) bag constraint (cid:12) P i (cid:124)(cid:123)(cid:122)(cid:125) DNN output , (3)where P i is the label assignment probability for the personimage x i as computed by a DNN. Y i is the preliminaryimage-level label defined in Eqn. 1, indicating the estimationis subjected to the bag-level label, as illustrated in Fig. 6.Here, (cid:12) denotes element-wise product, and [ · ] denotes vectorindexing.The maximum a posteriori (MAP) labeling is good enoughto be a candidate pseudo label due to the capacity of the DNNs.However, as the output of the unary classifier for each image isproduced independently from the outputs of the classifiers forthe other images, the unary term alone is generally noisy andinconsistent. Interactions between pairwise terms are required. Pairwise Term.
The pairwise terms represent a set ofsmoothness constraints. As in [61], we use the followingexpression: Ψ( y i , y j | x i ; x j ) = ζ ( y i , y j ) (cid:124) (cid:123)(cid:122) (cid:125) label compatibility Y i [ y i ] Y j [ y j ] (cid:124) (cid:123)(cid:122) (cid:125) bag constraint exp (cid:18) − (cid:107) I i − I j (cid:107) σ (cid:19)(cid:124) (cid:123)(cid:122) (cid:125) appearance similarity (4) where a Gaussian kernel depending on RGB colors that mea-sure the appearance similarity is used. The hyper parameter σ control the scale of the Gaussian kernels. The kernel forcesperson images with similar color and deep features to havethe same labels. Similar to the unary term, the pairwise terms are also bounded by the bag-level annotations Y i and Y j ,enabling more reliable estimations. The pairwise terms arewidely known to improve accuracy, indicating that they canprovide nontrivial knowledge (e.g., structural context depen-dencies) that is not captured by the unary term. A simple labelcompatibility function ζ ( y i , y j ) ∈ { , } in Eqn. 4 is given bythe Potts model, i.e., ζ ( y i , y j ) = (cid:40) , if y i = y j , otherwise , (5)which introduces a penalty for similar images that are assigneddifferent labels. While the simple model in Eqn. 2 works wellin practice, it is non-differentiable and thus is incompatiblewith DNNs. We can instead learn a differential version ofEqn. 2 that takes the deep model into account, as described inthe following. Bag Constraint.
As mentioned above, both the unary andpairwise terms are constrained by the bag-level annotations Y i and Y j . In fact, the bag-level annotation contains extraknowledge that helps to improve the estimation. For example,if the estimator mismatches a person image to a categorythat is not in the bag-level annotation, the estimation isundoubtedly supposed to be incorrect. Then, the estimationwill be corrected by matching the person image to the categoryin the bag-level annotation with the most significant predictionscore. Furthermore, if some person categories in the weakannotation are absent in the prediction, the proposed methodwill encourage a portion of the person images to be assignedto such categories to improve the performance. In this way,knowledge of the weakly labeled data can be fully exploited.Specifically, given a bag of images and their bag-level label,we refine the DNN predictions by element-wise multiplicationof P by the bag-level weak annotation P , which is shown inthe unary term in Eqn. 3. Similarly, we also impose Y i and Y j in the pairwise term in Eqn. 4.One may argue that it is difficult to achieve perfect per-formance using bag-level labels because the mapping frominput vectors to output vectors is ambiguous. However, thereis a natural smoothness assumption in videos that could beignored: person IDs in bags change slowly within a short time,e.g., an image-level label y i in bag b T could also be in bag b T +1 . A large amount of bags with overlapping IDs naturallyexist in a video and thus partially disclose the underlyingmapping from input vectors to output vectors, which shedslight on the competitive performance of the weakly supervisedRe-ID. As a special example, if b T contains { y i , y j } and b T +1 contains { y j , y k } , then the two bags share { y j } . In this case,our method can easily know which image in b T belongs to { y i } and which image in b T +1 belongs to { y k } . Deduction of Pseudo Image-level Labels.
By minimizingthe Gibbs energy of Eqn. 2, we can obtain the pseudo image-level label for each person image, i.e., ˆ y i = arg max y i ∈{ , , , ··· ,m } E ( y i | x i ) , (6)where { , , , · · · , m } denotes all the person categories inthe training set. Here ˆ y i is the final pseudo image-level label loss supervised by graphical loss x Non-DifferentiableGraphicalModel x x loss supervised by Step 1 Step 2 Step 3 (a) (b) x DifferentiableGraphicalModel
P P PYPY
Fig. 7:
Differentiable graphical modeling with deep neural networks,where x, Y , P , ˆ y denote the input images, bag-level label, prelimi-nary categorization and refined categorization, respectively. (a) is thestepwise graphical modeling for the weakly supervised Re-ID model,while (b) is our proposed end-to-end differentiable graphical model.The implementation of our differentiable graphical model consistsof two losses, i.e., an unsupervised loss for pseudo label estimationand a loss supervised by the pseudo labels. Here black lines denoteforward-propagation, while blue lines denote back-propagation. generated by our approach. Once such labels are generated,they can be used to update the network parameters as if theywere authentic ground-truth labels. Differentiablizing the Graphical Learning.
The aboveweakly supervised Re-ID model is not end-to-end. Because wemust first use an external graphical learning solver to obtainthe pseudo labels and then use another solver to train theDNNs under the supervision of the pseudo labels (see Fig.7 (a)). To enable an end-to-end optimization, we propose tomake our graphical learning differentiable and compatible withDNNs (see Fig. 7 (b)).We first investigate the mechanism of a non-differentiablegraphical model. As is illustrated in Fig. 7 (a), a non-differentiable graphical model consists of three steps.
First ,the preliminary categorization score ˜ y is obtained through aDNN. Second , the Gibbs energy in Eqn. 2 is minimized byappropriately (optimally) re-assigning labels to the images,subject to the apparent similarity, the preliminary categoriza-tion scores, and the bag constraint.
Third , the re-assignedlabels are considered as the pseudo labels and are used tosupervise the learning of the Re-ID model.Assigning labels in the second step listed above is non-differential, which makes the graphical model incompatiblewith the DNN. To fill this gap, a relaxation form of Eqn. 2 isin desire. Specifically, Eqn. 2 is rewritten as: L graph ( x ) = (cid:88) ∀ i ∈ U ˆΦ( x i ) (cid:124) (cid:123)(cid:122) (cid:125) unary term + (cid:88) ∀ i,j ∈ V ˆΨ( x i , x j ) (cid:124) (cid:123)(cid:122) (cid:125) pairwise term , (7)Where we use a continuous version of ˆΦ and ˆΨ to approximatethe discrete Φ and Ψ . Formally, ˆΦ and ˆΨ are defined as: ˆΦ( x i ) = − log( arg max k ∈{ , , , · ,m } Y i [ k ] (cid:12) P i [ k ]) , (8) ˆΨ( x i , x j ) = − exp (cid:18) − (cid:107) I i − I j (cid:107) σ (cid:19) ( Y i P i ) T log( Y j P j ) . (9) fcfc ResNet-50 ... ... ...
Differentiable Graphical Model
Alice, Bob,Carol, Dave,Eve, ...
Bag-level LabelsBag of Images n*2048 n*512 n*class num ... n*class num graphical loss loss supervised by x f Y (a) Feature extraction (b) Rough ReID (c) Refined ReID N*C*H*W:
N: batch sizeC: channelclass num: number of categories P testing forwardtraining forward Fig. 8: Diagram of our approach, which consists of three main stages, i.e., feature extraction, rough Re-ID, and refined Re-IDby the differentiable graphical module. The solid black flow denotes the testing stage, while the black dotted flow denotes thetraining stage. For simplification, the back-propagation flow is omitted. The loss function is marked with a red arrow.The differences between Eqn. 3 and Eqn. 8 are summarizedas follows: The y i in Φ( y i ) is replaced with the x i in ˆΦ( x i ) . In the non-differentiable graphical model, all possible y s are feed into the energy function. The y which leads tolowest energy will be considered as the optimal solution.Differently, in a differentiable graphical model, we feed theimages x into the DNN and obtain the prediction y . Weuse an arg max function to obtain the prediction, which isconsistent with the nature of DNNs, i.e., during the testingphase, we directly obtain the prediction from the output ofthe DNN without the graphical losses. Besides, there is onemore difference between Eqn. 4 and Eqn. 9. We use an cross-entropy term − ( Y i P i ) T log( Y j P j ) to approximate the non-differential term ζ ( y i , y j ) Y i Y j in Eqn. 4. C. Overall Neural Network Architecture
The network architectures for training and testing are shownin Fig. 8, where the black dotted lines denote training flow,and the solid black lines denote inference flow. It is noteworthythat we only perform graphical modeling in the training stagefor two reasons.
First , the graphical module is introduced togenerate pseudo labels to supervise the training, which requiresa bag-level label as a constraint. However, there is no bag-level label in the testing stage.
Second , due to the specificityof the Re-ID problem, the images in the inference stage arenot organized in the form of a bag. For example, only a queryimage and a set of gallery images are provided in inference,requiring the Re-ID systems to calculate the similarity betweenthem. As a result, there is no bag-level dependency among thetesting images to exploit. Thus, performing graphical modelingmay be infeasible in the inference stage.The implementation of our weakly supervised Re-ID modelconsists of three main modules, including (a) a feature em-bedding module built upon a ResNet-50 network followed bytwo fully connected layers, (b) a rough Re-ID module usinga fully connected layer as the classifier, and (c) a refinedRe-ID module that considers both the rough results and bag- level weak annotation to perform graphical modeling. Thesemodules are shown in Fig. 8.
Feature Embedding Module.
Many current best-performing Re-ID models use multi-scale features asfeature embeddings [72], which guarantees a robust featurerepresentation and thus boosts the performance. However,in this work, our focus is the mechanism of the weaklysupervised Re-ID model alone, rather than other tricks.Therefore, we simply take the ResNet-50 [73] as thebackbone without any feature pyramid [72]. Our featureembedding is similar to [74]. Specifically, the last layer of theoriginal ResNet-50 is discarded, and two new, fully connectedlayers are added. The first has units, followed by a batchnormalization [75], a Leaky ReLU [76] and a dropout [77].This module is shown in Fig. 8 (a).
Rough Re-ID Module.
To investigate the behavior of theweakly supervised Re-ID alone, we use the standard softmaxclassifier rather than triplet similarity [74] for rough Re-ID.Specifically, our model has another fully connected layer at thetop of the feature embedding module, which has the same unitsas the person categorization numbers (denoted as ‘class num’in Fig. 8). Then, a softmax cross-entropy loss is employed.The person categorization score (e.g., y in Fig. 8) is consideredas the rough Re-ID estimation, indicating the possibility of aperson ID being present in a bag b . This module is shown inFig. 8 (b). Refined Re-ID Module.
Here, we aim to estimate a pseudo-image-level label for each person image by refining theprevious estimation results. The refinement benefits from threeaspects:1)
Rough Re-ID score.
As mentioned above, the rough Re-ID module returns a preliminary categorization result,which can be considered as a baseline for further im-provement.2)
Appearance.
Although the rough Re-ID score is takeninto consideration, it is a high-level abstraction of theimages that lack details. As compensation, we proposeto include more low-level information (i.e., RGB appear- ance) in our refinement.3)
Bag constraint.
Finally, we consider the bag-level labels.Intuitively, we eliminate any possibility of assigning aperson image with a person category that is absent inthe bag-level annotation. By contrast, we encourage aperson image to be assigned with a person category thatis present in the bag-level annotation.The above three aspects are fed into our graphical modeling,as shown in Eqn. 8 and 9. Once such labels are generated, theycan be used to update the network parameters as if they wereauthentic ground-truth labels. The diagram of our approach ispresented in Fig. 8.
D. Optimization and Implementation Details
The optimization of our model is a joint process of esti-mating pseudo labels and solving the DNN model supervisedby the pseudo labels. Once the pseudo labels are obtained,the weakly supervised Re-ID problem becomes a fully super-vised one. Specifically, given the pseudo person IDs, we cancompute the gradient of the overall losses with respect to theDNN parameters. With the back-propagation algorithm, thegradients from the loss propagate backward through all layersof the DCNN. Thus, all parameters of our weakly supervisedmodel can be learned in an end-to-end manner.
Loss Function.
The optimization object of our approachconsists of two loss functions, including a graphical modelingloss L graph and a classification/re-identification L cls , as illus-trated in Fig. 7 (b), where the back-propagation is representedwith blue lines. L cls can be a simple softmax cross-entropyloss with the pseudo label ˆ y as supervision. L cls = − n (cid:88) i =1 ( g one hot (ˆ y i )) T log( P i ) , (10)where g one hot (ˆ y i ) denotes a function that transforms ˆ y i into aone-hot vector, and n denotes the image number in a bag. Here P i denotes the preliminary categorization probability, whichis the logits normalized by the softmax function, i.e., P ki = exp( z k ) (cid:80) mk =1 exp( z k ) , (11)where m denotes the person category number in the trainingset and z is the output logit.The combination of these two loss functions is a simplelinear combination with predefined loss weights. In our im-plementation, the loss weights are set as 1 : 0.5. We have thetotal loss L : L = w cls L cls + w graph L graph , (12)where w cls and w graph denote the two loss weights. Bag organization.
In our implementation, an image batch b contains images of n person categories, and each personcategory has k images. With the image bags in each batch,we can perform graphical modeling to capture the imagedependencies in a bag, thus enabling the weakly supervisedlearning. Other Implementations Details.
As mentioned in Sec-tion IV-C, our approach employs ResNet-50 as the back-bone, where the parameters are initialized by classifying one-thousand image classes in ImageNet. The other parametersare initialized by sampling from a normal distribution. ForSGD, we use a minibatch of 90 images and an initial learningrate of 0.01 (0.1 for the fully connected layer), multiplyingthe learning rate by 0.1 after a fixed number of iterations.We use the momentum of 0.9 and a weight decay of 0.0005.The training on SYSU- k takes approximately ten days ona single GPU (i.e., NVIDIA TITAN X). During training,all of the images are resized to × and cropped to × at the center with a small random perturbation.Random mirroring is also adopted in our experiments. E. Computational Complexity
We provide more discussion of computational complexityon our weakly supervised Re-ID model. The extra introducedtime cost of our method is negligible for two reasons. In the training phase, the extra introduced time cost onlyrelates to the estimation of pseudo labels, which is a graphicallearning module. Conventionally, graphical learning needsmany iterations to find the solution, and thus, the processis time-consuming. However, our approach formulates thedifferentiable graphical learning as an inherent part of theDNN. Therefore, in each training step of the DNN, thereis only one iteration of inference in our graphical module,which is consistent with the back-propagation algorithm. Thismakes our graphical module very effective. In the experimentalsection, we will show that our training brings an additionaltime cost of only . × . In the testing phase, there is noextra time cost when the pseudo label estimation componentis disabled.
F. Relationship to Previous Works
In the following, we compare our weakly supervised Re-ID with previous works on Re-ID with uncertain labels,including the unsupervised Re-ID and the semi-supervised Re-ID. In general, we see that our weakly supervised Re-ID notonly possess cheap annotation effort but also achieves highidentification accuracy. The details are presented below.
Unsupervised Re-ID.
To get rid of the prohibitively highcost of manual labeling, unsupervised learning Re-ID [36]–[41] proposes to use either local saliency matching [40], [41]models or clustering models [36]. However, without the helpof labeled data, it is difficult to model the dramatic variancesacross camera views, e.g., representation learning and metriclearning. Therefore, it is difficult for these pipelines to obtainhigh accuracies [42]–[46]. In contrast, our weakly supervisedRe-ID problem has a good solution. Note that comparedto unsupervised Re-ID, the annotation effort of our weaklysupervised Re-ID is also very inexpensive.
Semi-supervised Re-ID.
One-shot/one-example [58], [59]propose to reduce the annotation effort by annotating only oneexample for each person ID. The main differences betweentheir methods and ours are two-fold. First, in one-shot Re-ID, at least one accurate label for each person category is (a) Epoch = 10 (b) Epoch = 30 (c) Epoch = 50 (d) Epoch = 70 Fig. 9: The effectiveness of our differentiable graphical learning module. Here we show the errors between the rough predictionsand the weak annotations in the form of a confusion matrix containing × grids. Each grid indicates a bag of 10 categories,with a total sum of 760 categories, which is approximately equivalent to the person categories in the full training set (i.e., 767categories).in desire. While in our weakly supervised Re-ID, no accuratelabel is needed. Second, there is bag-level label as a constraintto estimate the pseudo labels in our method, ensuring thatour generated pseudo labels to be more reliable than thosegenerated by one-shot Re-ID.We would also like to acknowledge the contribution ofprevious work [60] that matches a target person image with abag-level gallery video using multiple-instance multiple-labellearning. However, similar to [59], at least one accurate label(of the target person) for each person category is still in adesire to form the probe set in [60]. Hence, mathematically,[60] still belongs to semi-supervised Re-ID but NOT weaklysupervised Re-ID.Experimentally, Section V-B2 and V-B3 will compare theperformance of our weakly supervised Re-ID with previousworks. V. E XPERIMENTS
We evaluate the weakly supervised Re-ID approach in twoaspects. Section V-A conducts an extensive ablation study,including the effectiveness of the differentiable graphicallearning module, the scalability of our method, the impactof bag diversity, and the compatibility with fully supervisedlearning tricks. Section V-B compares the Re-ID accuracywith state-of-the-art methods and analyzes the computationalcomplexity.
Two Simulated Datasets
In addition to the proposedSYSU- k dataset, another two simulated datasets are in-troduced to evaluate the effectiveness of our method byadjusting the existing datasets. Specifically, we replace thestrong annotations on the training set of the PRID2011 [68]and CUHK03 [1] datasets with weak annotations while theirtest sets are kept unchanged. For a fair comparison (e.g.,using the same images for both fully and weakly supervisedlearning), we generate the weak annotations from the strongannotations. This includes two steps. Each bag is simulatedby randomly selecting several images and packaging them. Then, the weak labels are easily obtained by summarizingthe strong annotations, e.g., four image-level labels { Alice,Bob, Alice, Carol } are summarized as a bag-level label { Alice,Bob, Carol } . We denote n categories/bag when a bag contains n person categories. Note that unless otherwise stated, ourweakly supervised learning setting is two categories/bag.Originally, PRID2011 [68] contains 200 person categoriesappearing in at least two camera views and is further randomlydivided into training/test sets following the general settings[31], i.e., both having 100 categories. The CUHK03 dataset[1] is one of the largest databases for Re-ID. This datasetcontains 14,096 images of 1,467 pedestrians collected from5 different pairs of camera views. Each person is observedby two disjointed camera views, which have an average of4.8 images in each view. We follow the new standard setting[78] of using CUHK03 [1], and a training set (including767 persons) is obtained without overlap. For the trainingsets of both the PRID2011 and CUHK03 benchmarks, personcategories are further packed into bags, and bag-level labelsare extracted from the image-level labels. This enables usto examine the proposed weakly supervised Re-ID problem.Note that the test sets of the two datasets are the same as theoriginal ones, as the definition states that during testing, thereis no difference between fully and weakly supervised Re-IDproblems (Fig. 1 (c)). Market-1501.
In addition to PRID2011 and CUHK03,we also compare our method with existing state-of-the-artmethods on Market-1501 dataset [2]. Market-1501 is anotherwidely-used large-scale Re-ID benchmark, which contains32,668 images of 1,501 pedestrians captured from 6 differentcameras. The dataset is split into two parts: 12,936 imageswith 751 identities for training and 19,732 images with 750identities for testing. In testing, 3,368 hand-drawn imageswith 750 identities are used as probe set to identify the trueidentities on the testing set.
Evaluation Metric
For PRID2011, CUHK03, and Market-1501, the test set is further divided into a gallery set of imagesand a probe set. We use the cumulative matching characteristic(CMC) [79] as the evaluation metric. For SYSU- k , theevaluation metric is described in Section III. A. Ablation Study
First, we present ablation studies to reveal the benefits ofeach main component of our method. TABLE II: Ablation studies of the proposed weakly supervised Re-ID method. random:
Each bag contains random categoriesof persons, which reflects the real-world state. reranking: see [78], one of the effective tricks frequently used in fully supervisedRe-ID problems. (cid:63) fully supervised: when each bag contains only one category, the weakly supervised Re-ID problem degradesinto a fully supervised Re-ID problem. ∗ full training set: the overall training set of CUHK03 contains 767 person categories. (a) Impact of Bag Diversity on PRID2011 (b) Impact of Bag Diversity on CUHK03categories / bag Rank-1 Rank-5 Rank-10 categories / bag Rank-1 Rank-5 Rank-101 ( (cid:63) fully supervised ) 71.8 91.2 95.9 1 ( (cid:63) fully supervised ) 67.5 88.2 91.82 68.0 87.5 94.8 2 61.0 82.0 87.03 66.1 86.4 92.3 3 59.4 80.7 86.710 49.5 73.9 82.2 10 55.2 79.3 84.5 random random reranking reranking reranking reranking ∗ full training set ) 61.0 82.0 87.0
1) Effectiveness of the graphical learning Module:
First,we investigate the effectiveness of the refinement operation.As discussed in Section IV-C, the graphical learning moduleplays the role of refining the person categorization results bycorrecting the errors between the rough Re-ID predictions andthe weak annotations, which forms the basis of generatingpseudo-image-level labels. During the training, we visualizethe person categorization errors between the rough predictionsand the weak annotations in Fig. 9. This experiment isconducted on CUHK03 using the setting of 10 categories/bag.Fig. 9 displays the errors between the rough predictionsand the weak annotations in the form of a confusion matrixcontaining × grids. Each grid indicates a bag of 10categories, totally summing up to 760 categories, which ap-proximates the number of person categories in the full trainingset (i.e., 767 categories). We have two major observationsfrom Fig. 9. First , there is a significant gap between therough predictions and the weak annotations (see 9 (a) or(b)), indicating that the rough Re-ID results are still notcompetent for generating pseudo-image-level labels. Moreimportantly, the gap between the rough predictions and thebag-level annotations is non-negligible. This result indicatesthat it is necessary to refine the person categorization resultsby correcting the errors between the rough predictions and thebag-level weak annotations.
Second , the gap between the rough predictions and theweak annotations becomes smaller as the training iterationincreases (from 10 epochs in 9(a) to 70 epochs in 9 (d)).Specifically, when the training finishes, the gap between theground truth becomes significantly small. This result indicatesthat the problem is adequately addressed by the differentiablegraphical learning module, which provides extra knowledgefor the learning of the Re-ID model.
2) Scalability of Our Approach:
We have shown that a Re-ID model can be learned with weakly labeled data. Next, weinvestigate whether increasing the amount of weakly labeleddata will improve the performance of weakly supervised learning. The entire CUHK03 training set is randomly par-titioned into three subsets containing 67, 300, and 300 personcategories, respectively. We evaluate the scalability of ourapproach by gradually adding one subset in training. The rank-1 accuracy is reported in Table II (e). For example, the firstmodel is trained with the first 67 person categories, and thenumber of person categories is increased to 367 categories inthe second model. The third model is trained with the fullCUHK03 training set (i.e., 767 categories).Table II (e) shows that the accuracies increase when weincrease the scale of training data in CUHK03. For instance,our approach trained with full training data achieves the bestperformance and outperforms the other two models by 44.7%and 17.4%, respectively.
3) Impact of Bag Diversity:
Intuitively, if a bag containsmore person categories, it is more challenging to learn aweakly supervised Re-ID model because of the increase inentropy. Next, we investigate the performance with respectto such bag internal diversity. We conduct experiments onPRID2011 and CUHK03.
PRID2011.
In Table II (a) and Fig. 10 (a), we comparefive options, i.e., each bag containing 1, 2, 3, 4, or a randomnumber of person categories, respectively. In particular, wheneach bag has only one person category, the weakly supervisedRe-ID problem degrades into a fully supervised one.We have three major observations from Table II (a) and Fig.10 (a).
First , the model that is trained with weakly labeledsamples, achieves comparable accuracies to those trained withstrongly labeled data. For example, in Table II (a), the rank-1 accuracies of the fully and weakly supervised learning are71.8% and 68.0%, respectively. This result is very significantas we know that a weak annotation costs tens of times lessmoney and time than a strong annotation. More importantly,the top 10 accuracies are almost the same, i.e., 95.9% vs Second , the accuracy of the weakly supervised methods A cc u r a cy ( % ) y supervised Cat /bag
Cat /bag49.5% 10
Cat /bag
Cat /bag (a)
PRID2011 A cc u r a cy ( % ) y supervised Cat /bag
Cat /bag
Cat /bag60.6% Random
Cat /bag (b)
CUHK03
Fig. 10: Analysis on different bag diversities.
Cat/bag: thenumber of person categories in each bag.
Random Cat/bag: each bag contains random person categories, which reflects thereal-world state.
Fully supervised: each bag contains only oneperson category. In this case, the weakly supervised problemdegrades into a fully supervised one.gradually decreases as the number of categories in each bagincreases. In particular, the rank-1 accuracy of our approachdrops by 18.5% when increasing the number of categories perbag from 2 to 10. We argue that the increase in uncertaintycauses this optimization difficulty. When the category numberwithin a bag increases, the uncertainty in the label assignmentalso increases. This means that the probability of adequatelyassigning an image-level label to each person image decreases,making the problem more challenging.
Third , it is noteworthy that the random version has ap-pealing performance (69.3% vs CUHK03.
A similar phenomenon can also be observed inthe CUHK03 benchmark. In Table II (b) and Fig. 10 (b),we also compare the five settings consistent with those forthe PRID2011 dataset. Table II (b) and Fig. 10 (b) showthe behaviors of the weakly supervised methods. First, themodel trained with weakly annotated data achieves comparableaccuracy to those trained with fully annotated data (61.0%vs. 67.5%). Second, our approach suffers from an increasednumber of categories per bag, suggesting that such an increasein uncertainty is a fundamental problem.
4) Compatibility with Fully Supervised Learning Tricks:
Intuitively, a weakly supervised Re-ID problem is likely to beupper bounded by fully supervised learning with all annota-tions. Next, we investigate the performance of our approachwith respect to models with different fully-supervised learningcapacities.
PRID2011.
We first evaluate two different fully supervisedlearning baseline models. Both share the same architectures, asdescribed in Section IV-C, except that the first one is a nakedCNN framework, while the second one employs a rerankingpost-process (denoted as ‘ +reranking ’ in Table II (c)). TableII (c) shows the top 1, 5, and ten accuracies of the fullysupervised learning results, which form the baseline of thissection. A cc u r a cy ( % ) % , KISSME % , MAHAL25.0 % , L2 % , XQDA % , P2SNet71.8 % , * Fully Sup. % , Weakly Sup. (a)
Standard PRID2011 test set. A cc u r a cy ( % ) (b) SYSU- k Fig. 11: Comparison with state-of-the-art methods.
Weaklysup.: the proposed weakly supervised Re-ID approach. * FullySup.: each bag contains only one person category. In thiscase, the weakly supervised problem degrades into a fullysupervised one. We thus consider the latter as the baselineof our weakly supervised Re-ID approach.Next, we evaluate the weakly supervised learning scenario.The setting is similar to the above fully supervised setting,except that all of the image-level annotations are replaced withbag-level annotations in the training set. In this scenario, wepresent a horizontal comparison and a vertical comparison.In the horizontal comparison, we focus on the performancegap between fully and weakly supervised learning. Once again,we observed that the rank-1 accuracy of using weak annotationapproaches that of using strong annotation in both options.In the vertical comparison, we compare the two weaklysupervised learnings built on different baselines. The resultsare summarized in Table II (c). A finding of this experimentcan be observed: weakly supervised learning with a strongerbaseline (‘weakly supervised + reranking ’) yields betterperformance. For example, in the weak annotation setting,“weakly supervised + reranking ” yields 68.0%, compared to39.9% obtained by “weakly supervised”, a relative improve-ment of 70.4%. This comparison verifies the compatibility ofour method with existing frameworks; i.e., the existing trick(e.g., reranking) used to improve the fully supervised learningcan also be applied to the weakly supervised Re-ID.
CUHK03.
Similar observations can also be obtained onCUHK03 in Table II (d). The approach with reranking [78]achieves better accuracies than that without re-reranking inboth fully supervised learning (67.5% vs. 52.1%) and weaklysupervised learning (61.0% vs. 44.0%), once again provingthat the existing trick to improve the fully supervised learningcan also be applied to the weakly supervised Re-ID.
B. Comparison with the State-of-the-Art
In this section, we compare our weakly supervised approachwith the best-performing fully supervised / semi-supervised /unsupervised methods.
1) Accuracy on PRID2011:
In Table III (a) and Fig. 11(a), we compare the results of our model with the currentbest model results. Note that although our method was trainedin the weakly supervised scenario, we still evaluate it inthe same setting as conventional methods do. This leavesour approach at a disadvantage. Five representative image-to-image Re-ID models are used as the competing methods: the KISSME distance learning method [15], MAHAL, L2, andXQDA [8], and P2SNet [31]. For KISSME, MAHAL, L2, andXQDA, deep features [80] are utilized to represent an imageof a person. For P2SNet, we train the model based on theimage-to-video setting but sample one frame from each videoto formulate the image-to-image setting. The above settingsare all consistent with the traditional settings, e.g., [31].Our method achieves excellent performance, even surpassingthe state-of-the-art fully-supervised methods. Accurately, itachieves a rank-1 accuracy of 68.0%. We also observe that thisresult surpasses all of the above competitive methods, such asKISSME, MAHAL, L2, and XQDA, even if they are trainedwith all available strong annotations.
2) Accuracy on CUHK03:
Our weakly supervised Re-ID iscompared with state-of-the-art methods in two groups, includ-ing the traditional fully-supervised Re-ID and the unsupervisedRe-Id.
Fully supervised Re-ID.
In Table III (b), we compareour method with the current best models. Eleven representa-tive state-of-the-art methods are used as competing methods,including BOW+XQDA [2], PUL [81], LOMO+XQDA [8],IDE(R) [82], IDE+DaF [83], IDE+XQ+reranking [78], PAN,DPFL [84], and newly proposed methods such as SVDNet[85] and TriNets [86]. All settings of the above methods areconsistent with the common training setting. Our approachachieves very competitive accuracy. For example, our approachachieves a rank-1 accuracy of 61.0%. We also highlight thatthis result surpasses many of the current competitive methods,such as BOW+XQDA [2], PUL [81], LOMO+XQDA [8],IDE [82], IDE+DaF [83], IDE+XQ+reranking [78], PAN,DPFL [84] and SVDNet [85], which are trained with allavailable strong annotations. This result once again verifiesthe effectiveness of our method.To validate the superiority of our weakly supervised Re-ID over previous annotation-saving Re-ID works, we furthercompare with unsupervised Re-ID methods.
Unsupervised Re-ID.
In Table III (c), we compare ourmethod with the unsupervised Re-ID models. Three represen-tative state-of-the-art methods are used as competing methods,including CAMEL [36], PatchNet [87], and PAUL [87]. Theresults in Table III (c) show that our weakly supervised Re-IDproblem has obtained significant gain over unsupervised Re-ID methods. For instance, our method outperforms the best-performing model PAUL [87] by a large margin (i.e., 8.7%).Note that compared to unsupervised Re-ID, the annotationeffort of our weakly supervised Re-ID is also very inexpensive.These results verify the effectiveness of our method again.
3) Accuracy on Market-1501:
Our weakly supervised Re-ID is compared with state-of-the-art methods in three groups,including the traditional fully-supervised Re-ID, the unsuper-vised Re-Id, and the semi-supervised Re-ID.
Fully supervised Re-ID.
In Table III (c), we compareour method with the fully supervised Re-ID models. Twelverepresentative state-of-the-art methods are used as compet-ing methods, including MSCAN [88], DF [89], SSM [90],SVDNet [85], GAN [65], PDF [91], TriNet [74], TriNet +Era. + reranking [86], PCB [72], VPM [92], JDGL [93], andAANet [94]. All settings of the above methods are consistent TABLE III:
Comparison with state-of-the-art methods.
Weaklysupervised: the proposed weakly supervised Re-ID approach. * Fullysupervised: each bag contains only one person category. In thiscase, the weakly supervised problem degrades into a fully supervisedone. We thus consider the latter as the baseline of our weaklysupervised Re-ID approach. ‡ CUHK03: pretrained on CUHK03. ‡ Market-1501: pretrained on Market-1501. reranking: see [78].(a) Comparison on standard PRID2011 test set
Supervision Method Rank-1 Rank-5 Rank-10Fully KISSME [15] 18.2 33.2 44.5MAHAL 16.0 32.5 43.6L2 25.0 46.6 52.8XQDA [8] 39.0 66.6 77.8P2SNet [31] 60.5 88.9 97.5 *Fully supervised
Weakly supervised (b) Comparison on standard CUHK03 test set.
Supervision Method Rank-1Fully BOW+XQDA [2] 6.4PUL [81] 9.1LOMO+XQDA [8] 12.8IDE(R) [82] 21.3IDE+DaF [83] 26.4IDE+XQ.+reranking [78] 34.7PAN 36.3DPFL [84] 40.7SVDNet [85] 41.5TriNet+Era. [86] 55.5TriNet+Era.+reranking [86] 64.4 *Fully supervised
Weakly supervised (b) Comparison on standard Market-1501 test set.
Supervision Method Rank-1Fully MSCAN [88] 80.31DF [89] 81.0SSM [90] 82.21SVDNet [85] 82.3GAN [65] 83.97PDF [91] 84.14TriNet [74] 84.92TriNet+Era.+reranking [86] 85.45PCB [72] 93.4VPM [92] 93.0JDGL [93] 94.8AANet [94] 92.4 *Fully supervised
Weakly supervised (c) Comparison on SYSU- k Supervision Method Rank-1Fully DARI [29], ‡ CUHK03 11.2DF [13], ‡ CUHK03 10.3TriNet [86], ‡ CUHK03 20.1Weakly
Weakly supervised with the common training setting. Our approach achieves verycompetitive accuracy. For example, our approach achieves arank-1 accuracy of 88.6%. We also highlight that this resultsurpasses many of the current competitive methods, such asMSCAN [88], DF [89], SSM [90], SVDNet [85], GAN [65],PDF [91], TriNet [74], and TriNet + Era. + reranking [86],which are trained with all available strong annotations. Thisresult once again verifies the effectiveness of our method.To validate the superiority of our weakly supervised Re-ID over previous annotation-saving Re-ID works, we furthercompare with unsupervised and semi-supervised Re-ID. Unsupervised Re-ID.
In Table III (c), we compare ourmethod with the unsupervised Re-ID models. Eleven represen-tative state-of-the-art methods are used as competing methods,including CAMEL [36], TAUDL [95], UTAL [45], UDA [42],MAR [43], DECAMEL [96], ECN [97], PAUL [87], HHL[98], Distilled [99], and PUL [44]. The results in Table III(c) show that our weakly supervised Re-ID problem hasobtained significant gain over unsupervised Re-ID methods.For instance, our method outperforms the best-performingmodel UDA [42] by a large margin (i.e., 12.8%). Note thatcompared to unsupervised Re-ID, the annotation effort ofour weakly supervised Re-ID is also very inexpensive. Theseresults verify the effectiveness of our method again.
Semi-supervised Re-ID.
In Table III (c), we compareour method with the semi-supervised Re-ID models. Fiverepresentative state-of-the-art methods are used as competingmethods, including SPACO [100], HHL [98], Distilled [99],One Example [58], and Many Examples [58]. The resultsshow that our weakly supervised Re-ID problem has obtainedsignificant gain over semi-supervised Re-ID methods. Forinstance, our method outperforms the best-performing model“ManyExamples” [58] by a large margin (i.e., 6.1%). Note thatcompared to semi-supervised Re-ID, the annotation effort ofour weakly supervised Re-ID is also very inexpensive. Theseresults verify the effectiveness of our method again.
4) Accuracy on SYSU- k : As SYSU- k is the onlyweakly supervised Re-ID dataset and our method is the onlyweakly supervised Re-ID method, we propose to compare theconventional fully supervised Re-ID models with our weaklysupervised method by using transfer learning. Specifically,three representative fully supervised Re-ID models includingDARI [29], DF [13], and TriNet [86] are first trained onCUHK03. Then, they are used to performed cross-datasetevaluation on the test set of SYSU- k . In contrast, our weaklysupervised Re-ID is trained on the training set of the SYSU-30 k with weak annotations and then is tested on the test setof SYSU-30 k .Table III (c) and Fig. 11 (b) show the results of thecomparisons. We can observe that our approach can achievestate-of-the-art performance (26.9% vs. 20.1%), even thoughour method is trained in a weakly supervised manner whilethe competitors are trained with full supervision. The successmay be attributed to two reasons. First, our model is quiteeffective due to the graphical modeling that generates reliablepseudo labels as compensation for the absence of stronglabels. Second, the large-scale SYSU-30 k dataset providesrich knowledge that improves the capacity of our model, even TABLE IV: Computational complexity of weakly and fully super-vised Re-ID. secs / 100 images: the time of forward-passing 100images in the testing stage or the cycle of a forward-backward passingin the training stage when the batch size is 100. weakly (secs / 100 images) fully (secs / 100 image)Testing 0.0559 0.0559Training 0.2453 0.2448 though SYSU-30 k is annotated weakly.In summary, the comparisons provide a promising conclu-sion, i.e., learning a Re-ID model using less annotation effortis possible.
5) Computational Complexity:
Table IV compares the com-putational time of Re-ID in the context of weak supervisionto that in the context of full supervision in terms of timecost per 100 images. For a fair comparison, both methods areindividually trained on the same desktop with 1 Titan-x GPU.As shown in the table, the weakly and fully supervised Re-ID methods have similar computational costs. Specifically, inthe testing phase, both methods share the same computationalcosts. Even in the training phase, our method only performs . × slower than the fully supervised Re-ID (0.2453 vs.0.2448 seconds per 100 images using TITAN X.).VI. C ONCLUSION
We have considered a new and more realistic Re-ID problemchallenge: the weakly supervised Re-ID problem. To addressthis new problem, we proposed a graphical model to capturethe dependencies among images in each weakly annotated bag,which are specifically designed to address the weakly super-vised Re-ID problem. We further propose a weakly annotatedRe-ID dataset (i.e., SYSU- k ) to facilitate future research,which is currently the largest Re-ID benchmark. Future workwill include building more effective weakly supervised Re-IDmodels. R EFERENCES[1] W. Li, R. Zhao, T. Xiao, and X. Wang, “Deepreid: Deep filter pairingneural network for person re-identification,” in
Proceedings of the IEEEConference on Computer Vision and Pattern Recognition , 2014, pp.152–159.[2] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian, “Scalableperson re-identification: A benchmark,” in
Proceedings of the IEEEInternational Conference on Computer Vision , 2015, pp. 1116–1124.[3] M. Farenzena, L. Bazzani, A. Perina, V. Murino, and M. Cristani,“Person re-identification by symmetry-driven accumulation of localfeatures,” in
Computer Vision and Pattern Recognition (CVPR), 2010IEEE Conference on . IEEE, 2010, pp. 2360–2367.[4] D. Gray and H. Tao, “Viewpoint invariant pedestrian recognitionwith an ensemble of localized features,” in
European conference oncomputer vision . Springer, 2008, pp. 262–275.[5] I. Kviatkovsky, A. Adam, and E. Rivlin, “Color invariants for personreidentification,”
IEEE Transactions on Pattern Analysis and MachineIntelligence , vol. 35, no. 7, pp. 1622–1634, 2013.[6] R. Zhao, W. Ouyang, and X. Wang, “Person re-identification bysalience matching,” in
Proceedings of the IEEE International Confer-ence on Computer Vision , 2013, pp. 2528–2535.[7] ——, “Unsupervised salience learning for person re-identification,” in
Proceedings of the IEEE Conference on Computer Vision and PatternRecognition , 2013, pp. 3586–3593.[8] S. Liao, Y. Hu, X. Zhu, and S. Z. Li, “Person re-identificationby local maximal occurrence representation and metric learning,” in
Proceedings of the IEEE Conference on Computer Vision and PatternRecognition , 2015, pp. 2197–2206. [9] W.-S. Zheng, S. Gong, and T. Xiang, “Person re-identification byprobabilistic relative distance comparison,” in Computer vision andpattern recognition (CVPR), 2011 IEEE conference on . IEEE, 2011,pp. 649–656.[10] Y. Yang, J. Yang, J. Yan, S. Liao, D. Yi, and S. Z. Li, “Salientcolor names for person re-identification,” in
European conference oncomputer vision . Springer, 2014, pp. 536–551.[11] B. Ma, Y. Su, and F. Jurie, “Covariance descriptor based on bio-inspiredfeatures for person re-identification and face verification,”
Image andVision Computing , vol. 32, no. 6, pp. 379–390, 2014.[12] G. Wang, J. Lai, Z. Xie, and X. Xie, “Discovering underlyingperson structure pattern with relative local distance for person re-identification,” arXiv preprint arXiv:1901.10100 , 2019.[13] S. Ding, L. Lin, G. Wang, and H. Chao, “Deep feature learningwith relative distance comparison for person re-identification,”
PatternRecognition , vol. 48, no. 10, pp. 2993–3003, 2015.[14] G. Wang, X. Xie, J. Lai, and J. Zhuo, “Deep growing learning,” in
Proceedings of the IEEE International Conference on Computer Vision ,2017, pp. 2812–2820.[15] M. Koestinger, M. Hirzer, P. Wohlhart, P. M. Roth, and H. Bischof,“Large scale metric learning from equivalence constraints,” in
Com-puter Vision and Pattern Recognition (CVPR), 2012 IEEE Conferenceon . IEEE, 2012, pp. 2288–2295.[16] Z. Li, S. Chang, F. Liang, T. S. Huang, L. Cao, and J. R. Smith,“Learning locally-adaptive decision functions for person verification,”in
Proceedings of the IEEE Conference on Computer Vision and PatternRecognition , 2013, pp. 3610–3617.[17] B. Ma, Y. Su, and F. Jurie, “Local descriptors encoded by fishervectors for person re-identification,” in
Computer Vision–ECCV 2012.Workshops and Demonstrations . Springer, 2012, pp. 413–422.[18] A. Mignon and F. Jurie, “Pcca: A new approach for distance learningfrom sparse pairwise constraints,” in
Computer Vision and PatternRecognition (CVPR), 2012 IEEE Conference on . IEEE, 2012, pp.2666–2672.[19] S. Pedagadi, J. Orwell, S. Velastin, and B. Boghossian, “Local fisherdiscriminant analysis for pedestrian re-identification,” in
Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition ,2013, pp. 3318–3325.[20] R. Zhao, W. Ouyang, and X. Wang, “Learning mid-level filters forperson re-identification,” in
Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition , 2014, pp. 144–151.[21] G. Wang, P. Luo, X. Wang, L. Lin et al. , “Kalman normalization: Nor-malizing internal representations across network layers,” in
Advancesin Neural Information Processing Systems , 2018, pp. 21–31.[22] Y. Li, G. Wang, L. Lin, and H. Chang, “A deep joint learning approachfor age invariant face verification,” in
CCF Chinese Conference onComputer Vision . Springer, 2015, pp. 296–305.[23] W. Liang, G. Wang, J. Lai, and J. Zhu, “M2m-gan: Many-to-manygenerative adversarial transfer learning for person re-identification,” arXiv preprint arXiv:1811.03768 , 2018.[24] G. Wang, K. Wang, and L. Lin, “Adaptively connected neural net-works.” 2019.[25] B. J. Prosser, W.-S. Zheng, S. Gong, T. Xiang, and Q. Mary, “Personre-identification by support vector ranking.” in
BMVC , vol. 2, no. 5,2010, p. 6.[26] Y. Li, G. Wang, L. Nie, Q. Wang, and W. Tan, “Distance metricoptimization driven convolutional neural network for age invariant facerecognition,”
Pattern Recognition , vol. 75, pp. 51–62, 2018.[27] L. Lin, G. Wang, W. Zuo, X. Feng, and L. Zhang, “Cross-domainvisual matching via generalized similarity measure and feature learn-ing,”
IEEE transactions on pattern analysis and machine intelligence ,vol. 39, no. 6, pp. 1089–1102, 2017.[28] T. Xiao, H. Li, W. Ouyang, and X. Wang, “Learning deep feature repre-sentations with domain guided dropout for person re-identification,” in
Proceedings of the IEEE Conference on Computer Vision and PatternRecognition , 2016, pp. 1249–1258.[29] G. Wang, L. Lin, S. Ding, Y. Li, and Q. Wang, “Dari: Distance metricand representation integration for person verification.” in
AAAI , 2016,pp. 3611–3617.[30] J. You, A. Wu, X. Li, and W.-S. Zheng, “Top-push video-based personre-identification,” in
Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , 2016, pp. 1345–1353.[31] G. Wang, J. Lai, and X. Xie, “P2snet: Can an image match a videofor person re-identification in an end-to-end way?”
IEEE Transactionson Circuits and Systems for Video Technology , 2017.[32] G. Wang, J. Lai, P. Huang, and X. Xie, “Spatial-temporal person re-identification,” 2019. [33] W.-S. Zheng, X. Li, T. Xiang, S. Liao, J. Lai, and S. Gong, “Partialperson re-identification,” in
Proceedings of the IEEE InternationalConference on Computer Vision , 2015, pp. 4678–4686.[34] J. Zhuo, Z. Chen, J. Lai, and G. Wang, “Occluded person re-identification,” arXiv preprint arXiv:1804.02792 , 2018.[35] S. Li, T. Xiao, H. Li, B. Zhou, D. Yue, and X. Wang, “Person searchwith natural language description,” arXiv preprint arXiv:1702.05729 ,2017.[36] H.-X. Yu, A. Wu, and W.-S. Zheng, “Cross-view asymmetric metriclearning for unsupervised person re-identification,” in
IEEE Interna-tional Conference on Computer Vision , 2017.[37] E. Kodirov, T. Xiang, and S. Gong, “Dictionary learning with iterativelaplacian regularisation for unsupervised person re-identification.” in
BMVC , vol. 3, 2015, p. 8.[38] G. Lisanti, I. Masi, A. D. Bagdanov, and A. Del Bimbo, “Person re-identification by iterative re-weighted sparse ranking,”
IEEE transac-tions on pattern analysis and machine intelligence , vol. 37, no. 8, pp.1629–1642, 2015.[39] P. Peng, T. Xiang, Y. Wang, M. Pontil, S. Gong, T. Huang, andY. Tian, “Unsupervised cross-dataset transfer learning for person re-identification,” in
Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , 2016, pp. 1306–1315.[40] H. Wang, S. Gong, and T. Xiang, “Unsupervised learning of generativetopic saliency for person re-identification,” 2014.[41] R. Zhao, W. Oyang, and X. Wang, “Person re-identification by saliencylearning,”
IEEE transactions on pattern analysis and machine intelli-gence , vol. 39, no. 2, pp. 356–370, 2017.[42] L. Song, C. Wang, L. Zhang, B. Du, Q. Zhang, C. Huang, and X. Wang,“Unsupervised domain adaptive re-identification: Theory and practice.” arXiv: Computer Vision and Pattern Recognition , 2018.[43] H. Yu, W. Zheng, A. Wu, X. Guo, S. Gong, and J. Lai, “Unsupervisedperson re-identification by soft multilabel learning.”
CVPR , 2019.[44] H. Fan, L. Zheng, C. Yan, and Y. Yang, “Unsupervised personre-identification: Clustering and fine-tuning,”
ACM Transactions onMultimedia Computing, Communications, and Applications , vol. 14,no. 4, p. 83, 2018.[45] M. Li, X. Zhu, and S. Gong, “Unsupervised tracklet person re-identification,”
IEEE Transactions on Pattern Analysis and MachineIntelligence , pp. 1–1, 2019.[46] Y. Chen, X. Zhu, and S. Gong, “Deep association learning for unsu-pervised video person re-identification.”
BMVC , p. 48, 2018.[47] D. Mahajan, R. B. Girshick, V. Ramanathan, K. He, M. Paluri, Y. Li,A. Bharambe, and L. V. Der Maaten, “Exploring the limits of weaklysupervised pretraining,” european conference on computer vision , pp.185–201, 2018.[48] J. Krause, B. Sapp, A. Howard, H. Zhou, A. Toshev, T. Duerig,J. Philbin, and L. Feifei, “The unreasonable effectiveness of noisy datafor fine-grained recognition,” european conference on computer vision ,pp. 301–320, 2016.[49] G. Papandreou, L.-C. Chen, K. Murphy, and A. L. Yuille, “Weakly-andsemi-supervised learning of a dcnn for semantic image segmentation,” arXiv preprint arXiv:1502.02734 , 2015.[50] P. Luo, G. Wang, L. Lin, and X. Wang, “Deep dual learning forsemantic image segmentation,” in
Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , 2017, pp. 2718–2726.[51] G. Wang, P. Luo, L. Lin, and X. Wang, “Learning object interactionsand descriptions for semantic image segmentation,” in
Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recognition ,2017, pp. 5859–5867.[52] Z. Yan, J. Liang, W. Pan, J. Li, and C. Zhang, “Weakly-and semi-supervised object detection with expectation-maximization algorithm,” arXiv preprint arXiv:1702.08740 , 2017.[53] R. G. Cinbis, J. Verbeek, and C. Schmid, “Weakly supervised objectlocalization with multi-fold multiple instance learning,”
IEEE transac-tions on pattern analysis and machine intelligence , vol. 39, no. 1, pp.189–203, 2017.[54] J. Dai, K. He, and J. Sun, “Boxsup: Exploiting bounding boxesto supervise convolutional networks for semantic segmentation,” in
Proceedings of the IEEE International Conference on Computer Vision ,2015, pp. 1635–1643.[55] D. Lin, J. Dai, J. Jia, K. He, and J. Sun, “Scribblesup: Scribble-supervised convolutional networks for semantic segmentation,” in
Pro-ceedings of the IEEE Conference on Computer Vision and PatternRecognition , 2016, pp. 3159–3167.[56] R. Hu, M. Rohrbach, and T. Darrell, “Segmentation from naturallanguage expressions,” in
European Conference on Computer Vision .Springer, 2016, pp. 108–124. [57] L. Lin, G. Wang, R. Zhang, R. Zhang, X. Liang, and W. Zuo, “Deepstructured scene parsing by learning with image descriptions,” in Proceedings of the IEEE Conference on Computer Vision and PatternRecognition , 2016, pp. 2276–2284.[58] Y. Wu, Y. Lin, X. Dong, Y. Yan, W. Bian, and Y. Yang, “Progressivelearning for person re-identification with one example,”
IEEE Trans-actions on Image Processing , vol. 28, no. 6, pp. 2872–2881, 2019.[59] Y. Wu, Y. Lin, X. Dong, Y. Yan, W. Ouyang, and Y. Yang, “Exploitthe unknown gradually: One-shot video-based person re-identificationby stepwise learning,” pp. 5177–5186, 2018.[60] J. Meng, S. Wu, and W. Zheng, “Weakly supervised person re-identification.” arXiv: Computer Vision and Pattern Recognition , 2019.[61] P. Kr¨ahenb¨uhl and V. Koltun, “Efficient inference in fully connectedcrfs with gaussian edge potentials,” in
Advances in neural informationprocessing systems , 2011, pp. 109–117.[62] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille,“Deeplab: Semantic image segmentation with deep convolutional nets,atrous convolution, and fully connected crfs,”
IEEE transactions onpattern analysis and machine intelligence , vol. 40, no. 4, pp. 834–848,2018.[63] Q. Huang, M. Han, B. Wu, and S. Ioffe, “A hierarchical conditionalrandom field model for labeling and segmenting images of streetscenes,” pp. 1953–1960, 2011.[64] Y. Shen, H. Li, S. Yi, D. Chen, and X. Wang, “Person re-identificationwith deep similarity-guided graph neural network,” in
European Con-ference on Computer Vision . Springer, 2018, pp. 508–526.[65] Z. Zheng, L. Zheng, and Y. Yang, “Unlabeled samples generated by ganimprove the person re-identification baseline in vitro,” arXiv preprintarXiv:1701.07717 , vol. 3, 2017.[66] L. Wei, S. Zhang, W. Gao, and Q. Tian, “Person transfer gan to bridgedomain gap for person re-identification,” in
Proc. CVPR , 2018, pp.79–88.[67] W. Li, R. Zhao, and X. Wang, “Human reidentification with transferredmetric learning,” in
Asian Conference on Computer Vision . Springer,2012, pp. 31–44.[68] M. Hirzer, C. Beleznai, P. M. Roth, and H. Bischof, “Person re-identification by descriptive and discriminative classification,” in
Scan-dinavian conference on Image analysis . Springer, 2011, pp. 91–102.[69] D. S. Cheng, M. Cristani, M. Stoppa, L. Bazzani, and V. Murino,“Custom pictorial structures for re-identification.” in
Bmvc , vol. 1,no. 2. Citeseer, 2011, p. 6.[70] T. Xiao, S. Li, B. Wang, L. Lin, and X. Wang, “End-to-end deeplearning for person search,” arXiv preprint arXiv:1604.01850 , 2016.[71] J. Redmon and A. Farhadi, “Yolo9000: Better, faster, stronger,” com-puter vision and pattern recognition , pp. 6517–6525, 2017.[72] Y. Sun, L. Zheng, Y. Yang, Q. Tian, and S. Wang, “Beyond part models:Person retrieval with refined part pooling (and a strong convolutionalbaseline),” in
Proceedings of the European Conference on ComputerVision (ECCV) , 2018, pp. 480–496.[73] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in
Proceedings of the IEEE conference on computer visionand pattern recognition , 2016, pp. 770–778.[74] A. Hermans, L. Beyer, and B. Leibe, “In defense of the triplet loss forperson re-identification,” arXiv preprint arXiv:1703.07737 , 2017.[75] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deepnetwork training by reducing internal covariate shift,” in
InternationalConference on Machine Learning , 2015, pp. 448–456.[76] V. Nair and G. E. Hinton, “Rectified linear units improve restrictedboltzmann machines,” in
Proceedings of the 27th international confer-ence on machine learning (ICML-10) , 2010, pp. 807–814.[77] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhut-dinov, “Dropout: A simple way to prevent neural networks fromoverfitting,”
The Journal of Machine Learning Research , vol. 15, no. 1,pp. 1929–1958, 2014.[78] Z. Zhong, L. Zheng, D. Cao, and S. Li, “Re-ranking person re-identification with k-reciprocal encoding,” in , 2017, pp. 3652–3661.[79] D. Gray, S. Brennan, and H. Tao, “Evaluating appearance models forrecognition, reacquisition, and tracking,” in
Proc. IEEE InternationalWorkshop on Performance Evaluation for Tracking and Surveillance(PETS) , vol. 3, no. 5. Citeseer, 2007, pp. 1–7.[80] L. Zheng, Z. Bie, Y. Sun, J. Wang, C. Su, S. Wang, and Q. Tian,“Mars: A video benchmark for large-scale person re-identification,” in
European Conference on Computer Vision . Springer, 2016, pp. 868–884. [81] H. Fan, L. Zheng, and Y. Yang, “Unsupervised person re-identification:Clustering and fine-tuning,” arXiv preprint arXiv:1705.10444 , 2017.[82] L. Zheng, Y. Yang, and A. G. Hauptmann, “Person re-identification:Past, present and future,” arXiv preprint arXiv:1610.02984 , 2016.[83] R. Yu, Z. Zhou, S. Bai, and X. Bai, “Divide and fuse: A re-ranking ap-proach for person re-identification,” arXiv preprint arXiv:1708.04169 ,2017.[84] Y. Chen, X. Zhu, and S. Gong, “Person re-identification by deeplearning multi-scale representations,” in
Proceedings of the IEEEConference on Computer Vision and Pattern Recognition , 2017, pp.2590–2600.[85] Y. Sun, L. Zheng, W. Deng, and S. Wang, “Svdnet for pedestrianretrieval,” arXiv preprint arXiv:1703.05693 , 2017.[86] Z. Zhong, L. Zheng, G. Kang, S. Li, and Y. Yang, “Random erasingdata augmentation,” arXiv preprint arXiv:1708.04896 , 2017.[87] Q. Yang, H.-X. Yu, A. Wu, and W.-S. Zheng, “Patch-based discrim-inative feature learning for unsupervised person re-identification,” in
The IEEE Conference on Computer Vision and Pattern Recognition(CVPR) , June 2019.[88] D. Li, X. Chen, Z. Zhang, and K. Huang, “Learning deep context-aware features over body and latent parts for person re-identification,”in
CVPR , 2017, pp. 384–393.[89] L. Zhao, X. Li, Y. Zhuang, and J. Wang, “Deeply-learned part-alignedrepresentations for person re-identification.” in
ICCV , 2017, pp. 3239–3248.[90] S. Bai, X. Bai, and Q. Tian, “Scalable person re-identification onsupervised smoothed manifold,” in
CVPR , 2017.[91] C. Su, J. Li, S. Zhang, J. Xing, W. Gao, and Q. Tian, “Pose-driven deepconvolutional model for person re-identification,” in
ICCV . IEEE,2017, pp. 3980–3989.[92] Y. Sun, Q. Xu, Y. Li, C. Zhang, Y. Li, S. Wang, and J. Sun, “Perceivewhere to focus: Learning visibility-aware part-level features for partialperson re-identification.”
CVPR , 2019.[93] Z. Zheng, X. Yang, Z. Yu, L. Zheng, Y. Yang, and J. Kautz, “Jointdiscriminative and generative learning for person re-identification.”
CVPR , 2019.[94] C.-P. Tay, S. Roy, and K.-H. Yap, “Aanet: Attribute attention networkfor person re-identifications,” in
The IEEE Conference on ComputerVision and Pattern Recognition (CVPR) , June 2019.[95] M. Li, X. Zhu, and S. Gong, “Unsupervised person re-identification bydeep learning tracklet association,”
ECCV , pp. 772–788, 2018.[96] H. Yu, A. Wu, and W. Zheng, “Unsupervised person re-identificationby deep asymmetric metric embedding,”
IEEE Transactions on PatternAnalysis and Machine Intelligence , pp. 1–1, 2019.[97] Z. Zhong, L. Zheng, Z. Luo, S. Li, and Y. Yang, “Invariance mat-ters: Exemplar memory for domain adaptive person re-identification.”
CVPR , 2019.[98] Z. Zhong, L. Zheng, S. Li, and Y. Yang, “Generalizing a person retrievalmodel hetero- and homogeneously,” in
ECCV , 2018.[99] A. Wu, W.-S. Zheng, X. Guo, and J.-H. Lai, “Distilled person re-identification: Towards a more scalable system,” in
The IEEE Confer-ence on Computer Vision and Pattern Recognition (CVPR) , June 2019.[100] F. Ma, D. Meng, Q. Xie, Z. Li, and X. Dong, “Self-paced co-training,”pp. 2275–2284, 2017.
Guangrun Wang is currently a Ph.D. candidatein the School of Data and Computer Science, SunYat-sen University, Guangzhou, China. He receivedthe B.E. degree from Sun Yat-sen University in2014. From 2015 to 2017, he was a visiting scholarwith the Department of Information Engineering,The Chinese University of Hong Kong. His researchinterests include machine learning, computer vision.He is the recipient of the 2018 Pattern RecognitionBest Paper Award and one ESI Highly Cited Paper. Guangcong Wang is pursuing a Ph.D. degree inthe School of Data and Computer Science, Sun Yat-sen University, Guangzhou, China. He received theB.E. degree in communication engineering from JilinUniversity (JLU), Changchun, China, in 2015. Hisresearch interests are computer vision and machinelearning. He has published several works on personre-identification, weakly supervised learning, semi-supervised learning, and deep learning.
Xujie Zhang is currently an undergraduate studentin the School of Data and Computer Science, SunYat-sen University (SYSU), Guangzhou, China. Hemajors in computer science. His research interest iscomputer vision and machine learning. Currently,he aims at developing algorithms for person re-identification, especially weakly supervised personre-identification.
Jianhuang Lai received the Ph.D. degree in mathe-matics in 1999 from Sun Yat-sen University, China.He joined Sun Yat-sen University in 1989 as anassistant professor, where he is currently a Professorof the School of Data and Computer Science. Hiscurrent research interests are in the areas of digitalimage processing, pattern recognition, and applica-tions. He has published over 200 scientific papersin the academic journals and conferences on imageprocessing and pattern recognition.