[PDF] In Defense of the Triplet Loss Again: Learning Robust Person Re-Identification with Fast Approximated Triplet Loss and Label Distillation

Abstract

The comparative losses (typically, triplet loss) are appealing choices for learning person re-identification (ReID) features. However, the triplet loss is computationally much more expensive than the (practically more popular) classification loss, limiting their wider usage in massive datasets. Moreover, the abundance of label noise and outliers in ReID datasets may also put the margin-based loss in jeopardy. This work addresses the above two shortcomings of triplet loss, extending its effectiveness to large-scale ReID datasets with potentially noisy labels. We propose a fast-approximated triplet (FAT) loss, which provably converts the point-wise triplet loss into its upper bound form, consisting of a point-to-set loss term plus cluster compactness regularization. It preserves the effectiveness of triplet loss, while leading to linear complexity to the training set size. A label distillation strategy is further designed to learn refined soft-labels in place of the potentially noisy labels, from only an identified subset of confident examples, through teacher-student networks. We conduct extensive experiments on three most popular ReID benchmarks (Market-1501, DukeMTMC-reID, and MSMT17), and demonstrate that FAT loss with distilled labels lead to ReID features with remarkable accuracy, efficiency, robustness, and direct transferability to unseen datasets.

Full PDF

IIn Defense of the Triplet Loss Again: Learning Robust Person Re-Identiﬁcationwith Fast Approximated Triplet Loss and Label Distillation

Ye Yuan , Wuyang Chen , Yang Yang and Zhangyang Wang { ye.yuan, wuyang.chen, atlaswang } @tamu.edu , [email protected] Texas A&M University, Walmart Technology https://github.com/TAMU-VITA/FAT

Abstract

The comparative losses (typically, triplet loss) are ap-pealing choices for learning person re-identiﬁcation (ReID)features. However, the triplet loss is computationally muchmore expensive than the (practically more popular) classi-ﬁcation loss, limiting their wider usage in massive datasets.Moreover, the abundance of label noise and outliers inReID datasets may also put the margin-based loss in jeop-ardy. This work addresses the above two shortcomings oftriplet loss, extending its effectiveness to large-scale ReIDdatasets with potentially noisy labels. We propose a fast-approximated triplet (

FAT ) loss, which provably convertsthe point-wise triplet loss into its upper bound form, con-sisting of a point-to-set loss term plus cluster compactnessregularization. It preserves the effectiveness of triplet loss,while leading to linear complexity to the training set size.A label distillation strategy is further designed to learn re-ﬁned soft-labels in place of the potentially noisy labels, fromonly an identiﬁed subset of conﬁdent examples, throughteacher-student networks. We conduct extensive experi-ments on three most popular ReID benchmarks (Market-1501, DukeMTMC-reID, and MSMT17), and demonstratethat FAT loss with distilled labels lead to ReID featureswith remarkable accuracy, efﬁciency, robustness, and directtransferability to unseen datasets.

1. Introduction

Person re-identiﬁcation (ReID) has attracted tremendousattention owing to its vast applications in video surveillance,public safety, and so on. Given a person image spotted byone camera, ReID aims to accurately match that probe im-age against a large amount of gallery images, taken by othercameras and timestamps. The dramatic visual appearancevariations of the same person, as caused by different poses,view angles, illuminations, and backgrounds, constitute se-rious challenges for learning robust identity representations.Most existing ReID algorithms use a classiﬁcation loss totrain their feature learning backbones [48, 41, 42, 19, 3, 45]. (a) Triplet Loss(b) FAT Loss

Figure 1:

Illustrative comparison of standard triplet loss and FATloss. The former compares point-to-point distances, while the lat-ter compares point-to-set distances while regularizing all clustersets to be compact. The solid arrows depict the “push and pull” ef-fect of triplet loss and the point-to-set term of FAT loss. The dasharrows represents the compactness regularization of FAT loss. Seedetails in Section 3.

However, ReID is essentially an “open-ended” retrievalproblem rather than closed-set classiﬁcation, e.g. , the train-ing and testing sets usually have no overlapped identityclasses. The learned feature extractor should be able to gen-eralize to matching unseen identities. The testing perfor-mance is evaluated by the precision and recall of the match-ing instances, rather than classiﬁcation accuracy. Therefore,the classiﬁcation-driven learning could be misaligned withthe end goal. Instead, the comparative losses [31, 7, 25, 49],which compares the distances between two sample pairs,are naturally better choices, as empirically validated by ahandful of works [21, 19, 4, 40, 5]. Among many, the tripletloss [13], which maximizes the margin between the intra-class distance and the inter-class distance, has been mostlyused in ReID, in order to explicitly embed the relative orders a r X i v : . [ c s . C V ] D ec etween right and wrong matches ( i.e ., the correct matchesshould always be closer to the query than the wrong ones).However, an important downside of triplet loss lies inits computational expensiveness , which prohibits its wideusage in the large-scale ReID applications. A naive tripletloss that compares every possible pair of training sampleswill incur cubic complexity w.r.t. the training set size [13].Also, triplet loss relatively quickly learns to correctly mapmost trivial triplets, rendering a large fraction of all tripletsuninformative. Applying triplet loss with randomly selectedtriplets can accelerate training but quickly stagnates, or be-comes difﬁcult to converge. Hard sample mining [43, 46]has recently become the standard practice in using tripletloss, to select only “informative” (a.k.a. hard) pairs ratherthan all pairs to enforce the loss. However, it runs the riskof causing sample bias [43], and often appears fragile tooutliers. The vanilla triplet loss needs to calculate overall P K ( K − P K − K ) possible triplets, where K de-notes average number of images per identity and P identi-ties in total [13]. The time complexity can be reduced to P K ( P K −

1) +

P K when hard sample mining is used.In this paper, we will propose a new fast-approximatedtriplet (

FAT ) loss to trim down the computational cost oftriplet loss without hampering its effectiveness. Viewing allimages belonging to the same identity class as a cluster, theproposed FAT loss re-deﬁnes a triplet to include an anchor,its corresponding cluster centroid, and the centroid of an-other cluster. The main idea of FAT loss is to replace point-to-point distances with point-to-cluster distances, throughan upper bound relaxation of the triplet form. Such a re-laxation simultaneously requires the query to be closest toits ground-truth-cluster centroid, and enforces each clusterto have a compact radius. The FAT loss thus has a linearcomplexity w.r.t. the training set size.Another downside of triplet loss, as well as many othermargin-based losses, lies in their fragility to label noise .Unfortunately, ReID datasets are notorious to have manynoisy labels and outliers, such as label ﬂipping, mislabel,and multi-person coexistence, due to the tedious manualannotation process. The proposed FAT loss can alleviatethe label noise to some extent, by averaging all sampleswithin the same cluster. To provide further improved ro-bustness, we consider a distillation network to ﬁrst generatesoft pseudo labels for each sample, associated with its conﬁ-dence. Then we use those soft labels in place of the originallabels to feed into the FAT loss, where each individual sam-ples contribution to the model update will be re-weightedby their label conﬁdence.In sum, we strive to make triplet loss a more effective,efﬁcient, and robust choice for ReID, via multi-fold efforts: • We propose a fast-approximated triplet (FAT) loss toremarkably improve the efﬁciency over the standardtriplet loss, with linear complexity to the training set size. It is derived by relaxing triplet loss to its upperbound form, and operates without hard sample mining. • We are the ﬁrst to demonstrate that explicitly consid-ering and handling label noise can further boost ReIDperformance. A distillation network is presented to as-sign soft labels for samples in place of the original (po-tentially noisy) hard labels. Combined with FAT loss,a more robust re-ID feature can be learned. • We conduct extensive experiments on three most pop-ular ReID benchmarks, and demonstrate that FAT losswith learned soft labels lead to comparable or supe-rior ReID performance than using triplet loss and otherstate-of-the-art baselines, with remarkably higher ef-ﬁciency than triplet loss. We also observe improvedrobustness and direct transferability to unseen data.

2. Related Work

Triplet Loss and Hard Sample Mining

The triplet losswas ﬁrst introduced in FaceNet [31] by Google to train faceembeddings for the recognition task, where softmax crossentropy loss failed to handle a variable number of classes.The goal of triplet loss is to maximize the inter-class varia-tion while minimizing the intra-class variation. Triple lossis formulated as (1) below, where the triplet is deﬁned as ananchor sample a , a positive sample p from the same classand a negative sample n from a different class ( y a , y p , y n denote class labels for a, p, n , respectively): L tri = (cid:88) a,p,ny p = y a y n (cid:54) = y a max { d ( a, p ) + m − d ( a, n ) , } (1)FaceNet picked a random negative for every pair of anchorand positive, which was very time-consuming. Later on,[13] improved the efﬁciency of triplet loss for the ReID task,by proposing two triplet selection strategies: batch all andbatch hard. The batch all strategy selects all valid tripletsand averaged the loss. The batch hard strategy selects thehardest positive and negative samples within the batch whenforming the triplets. The author suggested that batch hardstrategy with soft margin to yield better performance. [43]found that selecting the hardest triplets often led to bad lo-cal minima. They argued that the bias in the triplet selectiondegraded the performance of learning with triplet loss, andproposed a new variant of triplet loss that adaptively cor-rects the distribution shift on the selected triplets.Besides, there are many other successful practices inapplying triplet loss to ReID task. [6] proposed a multi-channel convolutional neural network to learn global-localparts features and improved the triplet loss requiring theintra-class feature distances to be less than a predeﬁnedthreshold. [4] extended the triplet loss to a quadruplet formand required the intra-class variations to be smaller than anynter-class variations. [44] generalized the point-to-point(P2P) triplet loss to the point-to-set (P2S) form by assuminga positive set (to which the anchor belongs) and a negativeset (including all other clusters) for each anchor. It thenpenalizes the difference between the distance from the an-chor to the positive set centroid and the anchor-to-negative-centroid distance. The model was also trained in a soft hard-mining scheme with greater weights to harder samples.Being related to previous works [9, 13, 44], FAT lossdiffers substantially in the following ways: • FAT loss has linear time complexity w.r.t trainingdataset size: O ( P K ) or O ( P K ) (depending on thechoice of negative set), where K denotes the averageimage number per identity and P the number of iden-tities. Previous triplet losses have either cubic (vanilla)and quadratic (with hard sample mining) time com-plexity w.r.t training dataset size. • FAT loss is analytically derived from the upper boundof standard triplet loss. It consists of a P2S loss termand intra-class compactness regularization. Up to ourbest knowledge, all previous approximations or accel-erations for triplet loss, e.g., [6, 44], are only empirical. • We studied different choices of the negative clus-ter/centroid, and compared their impacts. Note thatFAT loss chooses the negative on “cluster” level, anddoes not refer to any individual sample mining.

Learning from Noisy Labels

The growing scale of trainingdatasets embrace the potential of a more powerful model,but introduces sample outliers and label noise during datacollection and annotation. [36] observed that a face recog-nition model trained with only a subset 30% manuallycleaned-label samples can achieve comparable performancewith models trained on the full dataset. To overcome thenegative effect of noisy labels, [29] proposed a bootstraptechnique to modify the labels on-the-ﬂy by augmenting theprediction objective with a notion of consistency. [22] ex-tended [26] and proposed a re-weighting method that canbe combined with any surrogate loss function for classiﬁca-tion, to handle class-conditional random label ﬂipping. [33]introduced an extra noise layer to absorb the label noiseby adapting the network outputs to the noisy label distri-bution. [11] further augmented the correction architectureby adding a softmax layer on top to explicitly connect thecorrect labels to noisy ones. [27] provided a forward-and-backward loss correction method given a class-condition la-bel ﬂipping probability. [35] proposed a generic conditionalrandom ﬁeld (CRF) model as a robust loss to be pluggedinto any existing network for label space smoothness andtherefore noise resistance. [37] designed a Siamese networkto distinguish clean labels from noisy labels and to simulta-neously give clean labels more emphasis.Interestingly, various label noise, such as class-conditional or sample-conditional label ﬂipping, mislabel- ing, and multi-person co-existence, are extensively foundin ReID dataset. Yet to our best knowledge, few previousworks have formally studied how to handle them, and howthat may improve ReID performance.

Network Distillation

Network distillation was ﬁrst devel-oped in [14] to transfer the knowledge in an ensemble ofmodels to a single model, using a soft target distributionproduced by the former models. [2] used distillation to traina more efﬁcient and accurate predictor. [23] uniﬁed distil-lation and privileged information into one generalized dis-tillation framework to learn better representations. [28] fur-ther extended data distillation to omni-supervised learningby ensemble of predictions from multiple transformationsof unlabeled data to generate new training annotations usinga single network. [24, 10] applied data distillation to multi-modal training, while the testing sets might have noisy ormissing modalities. As a relevant work, [20] argued thatnoisy labels contains useful ”side information” and shall notbe discarded. The authors proposed a distillation approachto learn from noisy data guided by a knowledge graph.Our proposed distillation algorithm to learn from noisylabels differs from previous ones in the following respects: • We are free from the assumption of the existence of amanually-cleaned set. Instead, we train the teacher net-work with the entire noisy dataset, but only use mostconﬁdent samples within a batch to update the param-eters. We observed that the model updated based on asubset of conﬁdent samples can achieve similar or bet-ter performance, compared to the model trained withall noisy-labeled samples. • We investigate different loss functions for distillation;the teacher network is trained with cross entropy losswith the purpose of providing pseudo soft label associ-ated with a conﬁdence; the student network is trainedwith FAT loss using the soft pseudo labels generatedby the teacher network. Hence instead of mimickinga similar softmax classiﬁer as the teacher network, thestudent network has the capability to “innovate” on adifferent task with the help of FAT loss, and eventuallyoutperforms the teacher network.

3. Method

Given an anchor image a with the identity label y a , thetriplet loss attempts to ﬁnd a positive sample p with thesame identity label y p = y a and a negative sample n witha different label y n (cid:54) = y a , and then maximizes the differ-ence of distances between the positive pair d ( a, p ) and thenegative pair d ( a, n ) by a margin m . We typically use theeuclidean distance (or cosine similarity) between learnedReID features f E ( a ) , f E ( p ) , f E ( n ) as distance metrics.However, computing triplet loss exhaustively over all lgorithm 1 Derivation of FAT loss as an upper bound for triplet loss (1). L tri = max { , d ( a, p ) + m − d ( a, n ) }≤ max { , d ( a, c a ) + d ( c a , p ) + m − max { , d ( a, c n ) − d ( c n , n ) }} (cid:46) refer to both inequalities in (2) = max { , d ( a, c a ) + d ( c a , p ) + m − d ( a, c n ) + min { d ( c n , a ) , d ( c n , n ) }} (cid:46) move d(a, c n ) out of inner max then reverse sign = max { , d ( a, c a ) + m − d ( a, c n ) + d ( c a , p ) + min { d ( c n , a ) , d ( c n , n ) }} = max { , d ( a, c a ) + m − d ( a, c n ) } + d ( c a , p ) + min { d ( c n , a ) , d ( c n , n ) } (cid:46) move non-negative sums out of max ≤ max { , d ( a, c a ) + m − d ( a, c n ) } + d ( c p , p ) + d ( c n , n ) (cid:46) c a = c p ; min { d( c n , a), d( c n , n) } ≤ d( c n , n) ≤ max { , d ( a, c a ) + m − d ( a, c n ) } (cid:124) (cid:123)(cid:122) (cid:125) anchor-dependent point-to-set loss + R ( a ) + R ( n ) (cid:124) (cid:123)(cid:122) (cid:125) cluster compactness (cid:46) R() deﬁnes the radius of the cluster adnd can be pre-computedpossible pairs is too expensive to be practical. We proposea relaxation of the triplet loss 1 into its upper bound form.We ﬁrst have the following two triangle inequalities: max { , d ( a, c a ) − d ( c a , p ) } ≤ d ( a, p ) ≤ d ( a, c a ) + d ( c a , p )max { , d ( a, c n ) − d ( c n , n ) } ≤ d ( a, n ) ≤ d ( a, c n ) + d ( c n , n ) (2)where c a , c n are deﬁned as the centroids (average) of theclusters that a , n belong to, respectively. Their proofs areself-evident, given that d () is a well-deﬁned distance func-tion in some metric space. Notice that although we use Eu-clidean distance for d () by default, our derivations are ap-plicable to other distances too.We next expand our derivation as in the outline (1). In-terestingly, the upper bound consists of two terms: a point-to-set (P2S) term which depends on the anchor point; plusa penalty term on the cluster compactness, deﬁned as thelargest cluster “radius” among all clusters, whose value isdecided by the entire dataset and is agnostic to the anchor.We minimize this upper bound instead, and name it as the fast approximated triplet ( FAT ) loss: L FAT = (cid:88) a,nn (cid:54) = y a max { , d ( a, c a ) + m − d ( a, c n ) } + R ( a ) + R ( n ) . (3)As the name suggests, the new loss will give rise to similarlycompetitive ReID performance compared to the full tripletloss, but with tremendously better efﬁciency. We now ana-lyze FAT loss w.r.t. the triplet loss from two aspects.As can be obviously seen from its form, FAT loss greatlyaccelerates the cubic/quadratic time complexity of comput-ing triplet loss, to linear complexity, w.r.t. the training setsize. We also examine how tight it approximates the origi-nal tripelet loss. Observing (1), three relaxations take placein the second, sixth and seven lines. For the ﬁrst one, theequality in (2) could be met when: a, c a , p are co-linear with a, p on the same side of c a ; while a, c n , n are also co-linearwith a, n on different sides of c n . The second relaxationbecomes tight if and only if d ( a, c n ) ≥ d ( n, c n ) , whichimplies that a is sufﬁciently far away from the cluster of c n . For the last one, the exact equality can only be takenin a very special case, when every cluster has the same ra-dius and every sample in a cluster distributes on a circle. Insum, when clusters are well-separated and balanced in size,FAT loss can provide a relatively tighter approximation fortriplet loss. However, it is always reasonable to expect thatminimizing this upper bound would lead to suppressing theoriginal triplet loss value too. Normalized FAT Loss

As a margin loss, FAT loss, aswell as triplet loss, is sensitive to input scales. Given thefact that ReID features are also scale-sensitive: neighboringfeatures in the normalized space can be far away from eachother in the original feature space, the learned feature areoften normalized before feeding into the evaluation metrics.That could be reﬂected in a normalized FAT loss: L FATnorm = max { , d ( a || a || , c (cid:48) a ) + m − d ( a || a || , c (cid:48) n ) } + R (cid:48) ( a ) + R (cid:48) ( n ) , (4)where R (cid:48) is similarly deﬁned as the radius of the normalizedsample set. In practice, we empirically ﬁnd that adding across entropy (CE) loss L CE term will help stabilize trainingwith FAT or Normalized FAT loss notably. That leads tominimizing a hybrid loss ( L CE-FAT can be replaced to L FAT-N ; λ is a scalar): L CE-FAT = L FAT + λ ∗ L CE (5) Choices of Centroids

The choice of cluster centroids isalso found to be critical to the effectiveness of FAT loss.Four options of cluster centroids are available: i) mean ofcluster features; ii) mean of normalized cluster features; iii)normalized mean of cluster features; and iv) normalizedmean of normalized cluster features. Mathematically: C i = 1 N i (cid:88) y k = i f E ( X k ) , C i = 1 N i (cid:88) y k = i f E ( X k ) (cid:107) f E ( X k ) (cid:107) C i = (cid:80) y k = i f E ( X k ) (cid:107) (cid:80) y k = i f E ( X k ) (cid:107) , C i = (cid:80) y k = i f E ( X k ) (cid:107) f E ( X k ) (cid:107) (cid:107) (cid:80) y k = i f E ( X k ) (cid:107) f E ( X k ) (cid:107) (cid:107) (6) F1F2 F1

Norm F2 Norm

C1C2 C3C4

Figure 2:

Example of four different centroid options.

A visual comparison of the four options are in Figure 2.Since the original FAT loss (3) is calculated based onun-normalized features, only the ﬁrst centroid option C i makes sense for it. The remaining three options can all beutilized for the normalized FAT loss (4). Our experimentsindicate that the normalized mean of normalized cluster fea-tures C i works best with the normalized FAT loss. Typically, there are three common label noises in ReIDdatasets: i) label ﬂip, i.e., an image is assigned to a wrongidentity class; ii) mislabeling, i.e., an image does not belongto any known identity class; iii) multiple identities co-existin one image. Similar to other margin-based losses, tripletloss is highly sensitive to label noise. Since the proposedFAT loss has a P2S term where all samples within the samecluster are averaged, hence alleviating noisy labels to someextents. We hereby propose a label distillation approachbased on a teacher-student model, to improve FAT loss ro-bustness to label noise further, using “soft labels” predictedfrom another teacher model, trained with a loss that is lesssensitive to label noise, e.g. , cross entropy. The pipeline isplotted in the supplementary, with details explained below.We ﬁrst use a self-bootstrapping approach to learn theteacher model robustly. The teacher net is ﬁrst trainedwith cross entropy loss on classifying all samples (includ-ing noisy labels) for 5 epochs. It was previously observedthat the network would be more inclined to learning withhigh conﬁdence for “easy samples”, within the early stageof training [17, 16]. Those conﬁdent, easy samples are hy-pothesized to have labels that are semantically consistentand correct, less confusing and ambiguous, and thereforemore reliable. We identify those most conﬁdently predictedsamples based on the entropy of their currently predictedsoftmax vectors. We then resume training for another 5epochs; but now in each epoch, we will keep using thoseidentiﬁed conﬁdent samples, while not using or only par-tially using the others that are more likely to contain label noise or outliers. We periodically repeat the above process,and each time we may gradually enlarge the pool of conﬁ-dent examples as the training continues. More details willbe presented in Section 4.1.After the teacher model is trained, its predictions aretreated as soft labels to replace the original labels, for train-ing the student model with FAT loss. Only the “conﬁdent”labels eventually selected by the teacher net will participatein averaging to estimate the cluster centroids. If we use thehybrid FAT loss (5), then soft labels are the prediction tar-gets for the cross entropy (softmax) loss too.

4. Experiment

We evaluate the proposed method on three most popularlarge-scale benchmarks: Market-1501 [47], DukeMTMC-reID [30, 50], and MSMT17 [38].

Market-1501 comprises 32,668 labeled images of 1,501identities captured by six cameras. Following [47], 12,936images of 751 identities are used for training, while the restare used for testing. Among the testing data, the test probeset is ﬁCEd with 3,368 images of 750 identities. The testgallery set also includes 2,793 additional distractors.

DukeMTMC-reID is a subset of the DukeMTMCdataset [30] for person ReID. This dataset contains 36,411images of 1,812 identities, cropped from the videos every120 frames. These images are captured by eight cameras,among which, 1,404 identities appear in more than twocameras and 408 identities (distractors) who appear in onlyone camera. The 1,404 identities are randomly divided,with 702 identities for training and the others for testing.In the testing set, one query image for each ID in each cam-era is chosen for the probe set, while all remaining imagesincluding distractors are in the gallery.

MSMT17 is the current largest publicly-available ReIDdataset. It has 126,441 images of 4,101 identities capturedby 15-camera network (12 outdoor, 3 indoor). We followthe training-testing split of [38].The video is collected withdifferent weather conditions at three time slots (morning,noon, afternoon). All annotations, including camera IDs,weathers and time slots, are available. MSMT17 is signiﬁ-cantly more challenging than the other two, due to its mas-sive scale, more complex and dynamic scenes, and severe label noise (see examples in the supplementary).

Implementation of FAT Loss

We implement our FATloss in PyTorch deep learning framework. In the trainingphase, all images are resized to 144 ×

432 and then ran-domly cropped into 128 ×

384 sub-images. Standard hori-zontal ﬂipping is adopted for data augmentation. In the testphase, all images are re-sized to 128 ×

384 and no data aug-mentations are applied. All images have the training set able 1:

Evaluation results on Market1501 and transfer results from Market1501 to DukeMTMC-reId. We use Resnet50 as our defaultbackbone and trained on Market1501, with only one exception indicated by * using DenseNet161 backbone. Settings Test on Market1501 Transfer to DukeMTMC-reIDloss negative margin top1 top5 top10 mAP top1 top5 top10 mAPHistogram Loss [34] NA NA 59.5 80.7 86.9 - - - -Multi-loss class [19] NA NA 83.9 - - 64.4 - - - -Point to Set Similarity [52] NA NA 70.7 - - 44.3 - - - -Triplet loss [13] NA 1 84.9 94.2 - 69.1 - - - -Support Neighbor Loss [18] NA NA 88.3 - - 73.4 - - - -CycleGAN [8] NA NA - - - - 38.5 54.6 60.8 19.9CE-FAT ctrdAll 1 89.1 95.0 96.7 71.6 34.4 51.5 57.6 18.9CE-FAT ctrdAvg 1 89.2 95.3 97.0 72.4 35.1 51.2 57.6 19.2CE-FAT ctrdHM 1 87.1 94.7 96.3 69.9 34.3 50.8 56.9 18.0CE-FAT batchNeg 1 89.4 95.6 97.1 73.1 37.3 52.3 58.4 20.3CE-P2S ctrdAll 1 87.4 95.0 96.7 68.9 27.6 42.9 50.0 14.1CE-P2S batchNeg 1 87.2 94.6 96.7 67.0 28.1 42.6 49.2 14.3CE-P2Snorm batchNeg 0.1 87.5 95.3 96.8 68.1 27.8 41.7 48.7 13.6CE-FATnorm batchNeg 0.1 88.6 95.1 96.7 69.7 35.0 50.6 57.4 18.9

CE-FAT* (DenseNet161) batchNeg 1

Table 2:

Evaluation results on DukeMTMC-reID and transfer results from DukeMTMC-reID to Market1501. We use Resnet50 as ourbackbone, and trained on DukeMTMC-reID, with only one exception indicated by * using DenseNet161 backbone. Settings Test on DukeMTMC-reID Transfer to Market1501loss negative margin top1 top5 top10 mAP top1 top5 top10 mAPDeep-Person [1] NA NA - - - - - -CycleGAN [8] NA NA - - - - 48.1 66.2 72.7 20.7CE-P2Snorm batchNeg 0.1 76.5 87.3 90.6 57.3 46.5 63.9 71.0 19.9CE-FATnorm batchNeg 0.1 77.9 87.8 91.4 58.3 49.8 65.8 73.2 21.2CE-P2S batchNeg 1 78.2 88.5 91.8 59.5 47.0 64.6 71.4 19.7CE-FAT batchNeg 1 78.8 88.7 91.5 60.8 49.1 67.1 73.9 21.8

CE-FAT * (DenseNet161) batchNeg 1 80.8

Table 3:

Evaluation results on MSMT17, DukeMTMC-reID, and Market1501. We use ResNet50 as our backbone and trained on MSMT17with different negative sets. loss negative set Test on MSMT17 Transfer to DukeMTMC-reID Transfer to Market1501PDC [32] NA 58.0 73.6 79.4 29.7 - - - - - - - -GLAD [39] NA 61.4 76.8 81.6 34.0 - - - - - - - -HHL [51] NA - - - - 45.0 59.4 64.4 23.0

CE-FAT ctrdAll 68.8 81.4 85.4 39.1 50.9 65.0 70.2 mean subtracted and then normalized by the training setstandard deviation, before feeding into the network.Following a standard ReID protocol, we use ResNet [12]or Densenet [15] backbone as the feature extractor f E to-wards learning a pedestrian representation directly super-vised by FAT loss L fat . The cluster centroids are computed at the beginning of each epoch, using C i for FAT loss and C i for normalized FAT loss in Equation 6. Besides, wealso compare four different options of choosing the nega-tive cluster c n for computing FAT loss each time: i) ctrdAll: identity classes that are different from the one a belong to;ii) ctrdAvg: consider all other classes, except the one that a able 4: Evaluation results of the Teacher Net on MSMT17, DukeMTMC-reID, and Market1501. We use ResNet50 as our backbone andtrained on MSMT17.

Method Test on MSMT17 Tranfer to DukeMTMC-reID Tranfer to Market1501top1 top5 top10 mAP top1 top5 top10 mAP top1 top5 top10 mAPwhole set soft percentage 62.9 76.1 80.9 32.6

Table 5:

Evaluation results of the Student Net on MSMT17, DukeMTMC-reID, and Market1501. We use ResNet50 as our backbone andtrained on MSMT17. loss negative set Test on MSMT17 Transfer to DukeMTMC-reID Transfer to Market1501HHL [51] NA - - - - 45.0 59.4 64.4 23.0

CE-FAT batchNeg belongs to, as one cluster and obtain one negative centroidby computing the average of all negative centroids, which issimilar to [44] but differs in the way of calculating all nega-tive samples’ mean; iii) ctrdHM: ﬁnd a hard negative cluster(in terms of closest centroid to the one that a belongs to),from all classes of the whole dataset; iv) batchHM: ﬁnd ahard negative sample on “batch level”, e.g. , from all classesthat are sampled by the current batch. Implementation of Label Distillation

The heavy labelnoise on MSMT17 further motivates us to conduct label dis-tillation experiments on it. Following the basic routine de-scribed in Section 3.2, we further study four different modesof identifying conﬁdent samples: i) hard threshold : selectall samples whose softmax entropy values are below a pre-set threshold t as the trusted training subset, and discard allun-selected samples; ii) soft threshold : select all sampleswhose softmax entropy values are below a pre-set threshold t/ , and then randomly select 50% of the remaining (uns-elected) samples to add into the trusted training subset; iii) hard percentage : always select 50% samples with lowestsoftmax entropy values, as the trusted training subset; iv) hard percentage : always select 25% samples with lowestsoftmax entropy values ﬁrst, and then randomly select an-other / from the remaining 75% (unselected) samples toadd into the trusted training subset.The important difference between “threshold” and “per-centage” methods lies in whether we keep a constant ordynamic size of the trusted training subset for the teachermodel. For the ﬁrst two threshold-based methods, evensticking to the same t throughout one training, the portionof samples selected into the trusted set will be dynamic,as more samples might become better conﬁdent as trainingcontinues. Figure 3 visualizes this trend: given t ≤ . , theﬁnal training stage will always have considered all training Figure 3:

The number of samples actually used as the trustedtraining subset, when training the ResNet50 teacher model withdifferent soft threshold t values, on the Market1501 dataset. samples as trusted; while a larger t may lead to more “con-servative” selection. We choose t = 0.1 as the empiricaldefault value found in experiments for i) and ii). Also, forthe two “soft” strategies ii) and iv), our hope is to utilize alarger set of samples while letting the stochastic selection“smooth out” the impacts from noisy labels. We ﬁrst present a comprehensive ablation study on theeffectiveness of FAT loss in Table 1, using the Market1501dataset. By default, we use the CE-FAT loss deﬁned in (5),with λ = 1, as it consistently improves over either FAT orCE loss alone. The margin m is chosen as 1 for FAT lossand 0.1 for normalized FAT loss, as validated to be effectivein experiments. We study on the four choices of the negativecluster (only ctrdAvg was previously explored in a similarform [44]), as well as the FAT loss hyperparameter (mar-in m ). We also compare CE-FAT with CE-P2S, the latterdeﬁned by removing the cluster compactness term in FATloss; as well as the normalized versions for both, denoted asCE-FATnorm and CE-P2S norm, respectively.We evaluate different methods in terms of their top-1/top-5/top-10 accuracy and mean average precision (mAP)values obtained on the Market1501 testing set. Moreover,we use the direct transfer performance of the Market1501-trained feature extraction to the DukeMTMC-reID dataset,as an additional performance criterion, to avoid overﬁttingsmall ReID datasets. A few popular ReID loss options pro-posed in previous works [34, 19, 52, 13] are also includedinto comparison, so is a CycleGAN [8] baseline for trans-fer evaluation. Note that CycleGAN is a domain adap-tion method that demands re-training on the target domain,while the direct transfer needs no extra re-training.First, comparing CE-FAT with ctrdAll, ctrdAvg,ctrdHM, and batchNeg, it is clear that batchNeg outper-forms the other three. Second, comparing CE-P2S withCE-FAT in fair settings, we show the necessity of clus-ter compactness regularization in addition to the P2S loss;for example, without the compactness term, we will see1.8% (ctrdAll) and 2.2% (batchNeg) top-1 accuracy dropson the Market1501 test case, and 7.5% (ctrdAll) and 9.2%d(batchNeg) top-1 accuracy drops on the transfer case toDukeMTMC-reID. The performance gaps clearly differen-tiate FAT loss from previous empirical P2S losses, thanksto our more rigorous upper-bound derivation. Third, noperformance gain has been observed on Market1501, whenusing normalized features for FAT/P2S. Finally, CE-FAToutperforms all state-of-the-art losses trained with the sameResNet50, on the Market1501 testing set. Furthermore, af-ter we replace the backbone into DenseNet161, CE-FATachieves not only further boosted Market1501 testing re-sults, but also impressive direct transfer performance toDukeMTMC-reID, even surpassing Cycle-GAN domainadaption [8] that is re-trained with the target domain data.Tables 2 and 3 report similar experiments usingDukeMTMC-reID ad MSMT17 datasets, respectively. Withmost observations aligned with the Market1501 cases, weﬁnd the training behavior on MSMT17 to slightly differfrom the other two (much) smaller datasets. In particu-lar, while batchNeg remains effective for its own testingset, ctrdAll becomes the best option when it comes to thefeature transferability evaluation. That might be attributedto the heavier label noise on MSMT17, that likely bene-ﬁts from averaging the triplet effects between with currentone and all other clusters. Also, we observe CE-FATnormto outperform CE-FAT, when transferring from MSMT17to the other two datasets. That implies that normalizationmay become essential to overcome feature scale varianceson large datasets. Finally, training ResNet50 with CE-FATloss and batchNeg has surpassed the state-of-the-art perfor- mance [38] ever reported on MSMT17. To overcome the noisy label issue on MSMT17, we nextinvestigate label distillation to further unleash the power ofFAT loss. Both teacher and student nets adopt the sameResNet50 backbone for simplicity.As shown in Table 4, for the training of the teacher net,the soft threshold/percentage methods appear to outperformtheir hard counterparts, as they can learn with a wider vari-ety of samples (while hard methods may tend to select toomany similar easy samples), meanwhile smoothing out thenegative impacts of potential noisy samples due to stochas-tic sampling/averaging effects. In comparison, soft thresh-old seems to produce superior results on the same MSMT17testing set, whereas soft percentage leads to better featuretransferability. It implies that soft percentage suffers fromless overﬁtting, because of its curriculum-style learning (asFigure 3 shows) that progressively takes into account theentire dataset information. To our surprise, our teachernet trained with only the trusted subsets by soft thresh-old/percentage yield competitive or even superior perfor-mance than the one trained with the whole dataset, in par-ticular on transfer cases. That proves that the teacher netlearns effectively and without being misled by noisy labels.We then pick the teacher net trained with soft percent-age, due to its best transfer performance, to provide softpseudo labels for training the student net. The trainingof student net is supervised by the CE-FAT loss with thebatchNeg strategy, using the soft pseudo labels in place oforiginal one-hot labels for both CE and FAT terms. Thenew model in Table 5, dubbed CE-FAT-distillation, doesnot lead to better test results on MSMT17 than our best re-sult (CE-FAT with batchNeg) in Section 4.2. However, itproduces state-of-the-art direct transfer performance fromMSMT17 to DukeMTMC-reID. Its transfer performance toMarket1501 largely surpasses that of CE-FAT without dis-tillation, and shows competitiveness to state-of-the-art HHLdomain adaption [51]. To re-iterate, direct transfer does notre-train on target domain data as domain adaption has to.

5. Conclusion

This work proposes the fast-approximated triplet (

FAT )loss, which remarkably improve the efﬁciency over the stan-dard triplet loss in ReID models. Instead of using point-to-point distances, the FAT loss uses a point-to-set distanceswith cluster compactness regularization, which is derivedrigorously as an upper bound of standard triplet loss, withlinear complexity to the training set size. A distillation net-work is also designed to assign soft labels for samples inplace of potentially noisy hard labels. Extensive experi-ments demonstrate the high effectiveness and promise ofthe proposed FAT loss along with label distillation. eferences [1] X. Bai, M. Yang, T. Huang, Z. Dou, R. Yu, and Y. Xu. Deep-person: Learning discriminative deep features for person re-identiﬁcation. arXiv preprint arXiv:1711.10658 , 2017.[2] S. R. Bul`o, L. Porzi, and P. Kontschieder. Dropout distilla-tion. In

Proceedings of the 33nd International Conferenceon Machine Learning, ICML 2016, New York City, NY, USA,June 19-24, 2016 , pages 99–107, 2016.[3] T. Chen, S. Ding, J. Xie, Y. Yuan, W. Chen, Y. Yang, Z. Ren,and Z. Wang. Abd-net: Attentive but diverse person re-identiﬁcation.

ICCV , 2019.[4] W. Chen, X. Chen, J. Zhang, and K. Huang. Beyond tripletloss: A deep quadruplet network for person re-identiﬁcation.In

The IEEE Conference on Computer Vision and PatternRecognition (CVPR) , July 2017.[5] D. Cheng, Y. Gong, S. Zhou, J. Wang, and N. Zheng. Personre-identiﬁcation by multi-channel parts-based cnn with im-proved triplet loss function. In

Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition , pages1335–1344, 2016.[6] D. Cheng, Y. Gong, S. Zhou, J. Wang, and N. Zheng. Per-son re-identiﬁcation by multi-channel parts-based cnn withimproved triplet loss function. In

The IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR) , June2016.[7] J. Deng, Y. Zhou, and S. Zafeiriou. Marginal loss fordeep face recognition. In

Proceedings of IEEE Interna-tional Conference on Computer Vision and Pattern Recogni-tion (CVPRW), Faces in-the-wild Workshop/Challenge , vol-ume 4, 2017.[8] W. Deng, L. Zheng, Q. Ye, G. Kang, Y. Yang, andJ. Jiao. Image-image domain adaptation with preservedself-similarity and domain-dissimilarity for person re-identiﬁcation. In

The IEEE Conference on Computer Visionand Pattern Recognition (CVPR) , June 2018.[9] T.-T. Do, T. Tran, I. Reid, V. Kumar, T. Hoang, andG. Carneiro. A theoretically sound upper bound on the tripletloss for improving the efﬁciency of deep distance metriclearning. In

Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition , pages 10404–10413,2019.[10] N. C. Garcia, P. Morerio, and V. Murino. Modality distil-lation with multiple stream networks for action recognition.In

The European Conference on Computer Vision (ECCV) ,September 2018.[11] J. Goldberger and E. Ben-Reuven. Training deep neural-networks using a noise adaptation layer. 2016.[12] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn-ing for image recognition. In

Proceedings of the IEEE con-ference on computer vision and pattern recognition , pages770–778, 2016.[13] A. Hermans, L. Beyer, and B. Leibe. In defense of thetriplet loss for person re-identiﬁcation. arXiv preprintarXiv:1703.07737 , 2017.[14] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledgein a neural network. arXiv preprint arXiv:1503.02531 , 2015. [15] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger.Densely connected convolutional networks. In

CVPR , vol-ume 1, page 3, 2017.[16] A. Katharopoulos and F. Fleuret. Not all samples are cre-ated equal: Deep learning with importance sampling. arXivpreprint arXiv:1803.00942 , 2018.[17] H. Li and M. Gong. Self-paced convolutional neural net-works. In

Proceedings of the International Joint Conferenceon Artiﬁcial Intelligence, IJCAI , pages 2110–2116, 2017.[18] K. Li, Z. Ding, K. Li, Y. Zhang, and Y. Fu. Support neighborloss for person re-identiﬁcation. In , pages 1492–1500.ACM, 2018.[19] W. Li, X. Zhu, and S. Gong. Person re-identiﬁcation by deepjoint learning of multi-loss classiﬁcation. arXiv preprintarXiv:1705.04724 , 2017.[20] Y. Li, J. Yang, Y. Song, L. Cao, J. Luo, and L.-J. Li. Learningfrom noisy labels with distillation. In

The IEEE InternationalConference on Computer Vision (ICCV) , Oct 2017.[21] H. Liu, J. Feng, M. Qi, J. Jiang, and S. Yan. End-to-endcomparative attention networks for person re-identiﬁcation.

IEEE Transactions on Image Processing , 26(7):3492–3506,2017.[22] T. Liu and D. Tao. Classiﬁcation with noisy labels by impor-tance reweighting.

IEEE Transactions on Pattern Analysisand Machine Intelligence , 38(3):447–461, March 2016.[23] D. Lopez-Paz, L. Bottou, B. Sch¨olkopf, and V. Vapnik. Uni-fying distillation and privileged information. arXiv preprintarXiv:1511.03643 , 2015.[24] Z. Luo, J.-T. Hsieh, L. Jiang, J. C. Niebles, and L. Fei-Fei. Graph distillation for action detection with privilegedmodalities. In

The European Conference on Computer Vi-sion (ECCV) , September 2018.[25] Z. Ming, J. Chazalon, M. M. Luqman, M. Visani, and J.-C.Burie. Simple triplet loss based on intra/inter-class metriclearning for face veriﬁcation. In

Computer Vision Workshop(ICCVW), 2017 IEEE International Conference on , pages1656–1664. IEEE, 2017.[26] N. Natarajan, I. S. Dhillon, P. K. Ravikumar, and A. Tewari.Learning with noisy labels. In C. J. C. Burges, L. Bottou,M. Welling, Z. Ghahramani, and K. Q. Weinberger, edi-tors,

Advances in Neural Information Processing Systems 26 ,pages 1196–1204. Curran Associates, Inc., 2013.[27] G. Patrini, A. Rozza, A. Krishna Menon, R. Nock, and L. Qu.Making deep neural networks robust to label noise: A losscorrection approach. In

The IEEE Conference on ComputerVision and Pattern Recognition (CVPR) , July 2017.[28] I. Radosavovic, P. Dollr, R. Girshick, G. Gkioxari, andK. He. Data distillation: Towards omni-supervised learning.In

The IEEE Conference on Computer Vision and PatternRecognition (CVPR) , June 2018.[29] S. Reed, H. Lee, D. Anguelov, C. Szegedy, D. Erhan, andA. Rabinovich. Training deep neural networks on noisy la-bels with bootstrapping. arXiv preprint arXiv:1412.6596 ,2014.[30] E. Ristani, F. Solera, R. Zou, R. Cucchiara, and C. Tomasi.Performance measures and a data set for multi-target, multi-amera tracking. In

The European Conference on ComputerVision (ECCV) , September 2016.[31] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A uni-ﬁed embedding for face recognition and clustering. In

TheIEEE Conference on Computer Vision and Pattern Recogni-tion (CVPR) , June 2015.[32] C. Su, J. Li, S. Zhang, J. Xing, W. Gao, and Q. Tian. Pose-driven deep convolutional model for person re-identiﬁcation.In

The IEEE International Conference on Computer Vision(ICCV) , Oct 2017.[33] S. Sukhbaatar, J. Bruna, M. Paluri, L. Bourdev, and R. Fer-gus. Training convolutional networks with noisy labels. arXiv preprint arXiv:1406.2080 .[34] E. Ustinova and V. Lempitsky. Learning deep embeddingswith histogram loss. In

Advances in Neural Information Pro-cessing Systems , pages 4170–4178, 2016.[35] A. Vahdat. Toward robustness against label noise in train-ing deep discriminative neural networks. In I. Guyon,U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vish-wanathan, and R. Garnett, editors,

Advances in Neural In-formation Processing Systems 30 , pages 5596–5605. CurranAssociates, Inc., 2017.[36] F. Wang, L. Chen, C. Li, S. Huang, Y. Chen, C. Qian, andC. Change Loy. The devil of face recognition is in the noise.In

The European Conference on Computer Vision (ECCV) ,September 2018.[37] Y. Wang, W. Liu, X. Ma, J. Bailey, H. Zha, L. Song, and S.-T. Xia. Iterative learning with open-set noisy labels. In

TheIEEE Conference on Computer Vision and Pattern Recogni-tion (CVPR) , June 2018.[38] L. Wei, S. Zhang, W. Gao, and Q. Tian. Person transfer ganto bridge domain gap for person re-identiﬁcation. In

TheIEEE Conference on Computer Vision and Pattern Recogni-tion (CVPR) , June 2018.[39] L. Wei, S. Zhang, H. Yao, W. Gao, and Q. Tian. Glad:Global-local-alignment descriptor for pedestrian retrieval. In

Proceedings of the 25th ACM International Conference onMultimedia , MM ’17, pages 420–428, New York, NY, USA,2017. ACM.[40] Q. Xiao, H. Luo, and C. Zhang. Margin sample mining loss:A deep learning based method for person re-identiﬁcation. arXiv preprint arXiv:1710.00478 , 2017.[41] T. Xiao, H. Li, W. Ouyang, and X. Wang. Learning deepfeature representations with domain guided dropout for per-son re-identiﬁcation. In

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , pages 1249–1258, 2016.[42] T. Xiao, S. Li, B. Wang, L. Lin, and X. Wang. End-to-end deep learning for person search. arXiv preprintarXiv:1604.01850 , 2, 2016.[43] B. Yu, T. Liu, M. Gong, C. Ding, and D. Tao. Correctingthe triplet selection bias for triplet loss. In

The EuropeanConference on Computer Vision (ECCV) , September 2018.[44] R. Yu, Z. Dou, S. Bai, Z. Zhang, Y. Xu, and X. Bai. Hard-aware point-to-set deep metric for person re-identiﬁcation.In

The European Conference on Computer Vision (ECCV) ,September 2018. [45] Y. Yuan, W. Chen, T. Chen, Y. Yang, Z. Ren, Z. Wang, andG. Hua. Calibrated domain-invariant learning for highly gen-eralizable large scale re-identiﬁcation. In

WACV , 2019.[46] Y. Zhao, Z. Jin, G.-j. Qi, H. Lu, and X.-s. Hua. An adver-sarial approach to hard triplet generation. In

Proceedingsof the European Conference on Computer Vision (ECCV) ,pages 501–517, 2018.[47] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian.Scalable person re-identiﬁcation: A benchmark. In

The IEEEInternational Conference on Computer Vision (ICCV) , De-cember 2015.[48] L. Zheng, Y. Yang, and A. G. Hauptmann. Person re-identiﬁcation: Past, present and future. arXiv preprintarXiv:1610.02984 , 2016.[49] X. Zheng, R. Ji, X. Sun, Y. Wu, F. Huang, and Y. Yang. Cen-tralized ranking loss with weakly supervised localization forﬁne-grained object retrieval. In

IJCAI , pages 1226–1233,2018.[50] Z. Zheng, L. Zheng, and Y. Yang. Unlabeled samples gener-ated by gan improve the person re-identiﬁcation baseline invitro. In

The IEEE International Conference on ComputerVision (ICCV) , Oct 2017.[51] Z. Zhong, L. Zheng, S. Li, and Y. Yang. Generalizing a per-son retrieval model hetero- and homogeneously. In

The Eu-ropean Conference on Computer Vision (ECCV) , September2018.[52] S. Zhou, J. Wang, J. Wang, Y. Gong, and N. Zheng. Point toset similarity based deep feature learning for person reiden-tiﬁcation. In