Deep Self-Learning From Noisy Labels
DDeep Self-Learning From Noisy Labels
Jiangfan Han Ping Luo Xiaogang Wang CUHK-SenseTime Joint Laboratory, The Chinese University of Hong Kong The University of Hong Kong { jiangfanhan@link., xgwang@ee. } cuhk.edu.hk [email protected] Abstract
ConvNets achieve good results when training from cleandata, but learning from noisy labels significantly degradesperformances and remains challenging. Unlike previousworks constrained by many conditions, making them infea-sible to real noisy cases, this work presents a novel deepself-learning framework to train a robust network on thereal noisy datasets without extra supervision. The proposedapproach has several appealing benefits. (1) Different frommost existing work, it does not rely on any assumption onthe distribution of the noisy labels, making it robust to realnoises. (2) It does not need extra clean supervision oraccessorial network to help training. (3) A self-learningframework is proposed to train the network in an iterativeend-to-end manner, which is effective and efficient. Exten-sive experiments in challenging benchmarks such as Cloth-ing1M and Food101-N show that our approach outperformsits counterparts in all empirical settings.
1. Introduction
Deep Neural Networks (DNNs) achieve impressive re-sults on many computer vision tasks such as image recog-nition [13, 33, 34], semantic segmentation [22, 40, 24], ob-ject detection [5, 30, 27, 18] and cross modality tasks [20,21, 41]. However, many of these tasks require large-scaledatasets with reliable and clean annotations to train DNNssuch as ImageNet [2] and MS-COCO [19]. But collect-ing large-scale datasets with precise annotations is expen-sive and time-consuming, preventing DNNs from being em-ployed in real-world noisy scenarios. Moreover, most of the“ground truth annotations” are from human labelers, whoalso make mistakes and increase biases of the data.An alternative solution is to collect data from the Inter-net by using different image-level tags as queries. Thesetags can be regarded as labels of the collected images. Thissolution is cheaper and more time-efficient than human an-notations, but the collected labels may contain noises. Alot of previous work has shown that noisy labels lead toan obvious decrease in performance of DNNs [38, 23, 26].Therefore, attentions have been concentrated on how to im-
Prototype Decision BoundaryPrototypes PrototypesPrototypeDecision Boundary
Figure 1. An example of solving two classes classification problemusing different number of prototypes.
Left : Original data distribu-tion. Data points with the same color belong to the same class.
Upper Right : The decision boundary obtained by using a singleprototype for each class.
Lower Right : The decision boundaryobtained by two prototypes for each class. Two prototypes foreach class leads to a better decision boundary. prove the robustness of DNNs against noisy labels.Previous approaches tried to correct the noisy labelsby introducing a transition matrix [25, 9] into their lossfunctions, or by adding additional layers to estimate thenoises [6, 32]. Most of these methods followed a simpleassumption to simplify the problem: There is a single tran-sition probability between the noisy label and ground-truthlabel, and this probability is independent of individual sam-ples. But in real cases, the appearance of each sample hasmuch influence on whether it can be misclassified. Dueto this assumption, although these methods worked wellon hand-crafted noisy datasets such as CIFAR10 [12] withmanually flipped noisy labels, their performances were lim-ited on real noisy datasets such as Clothing1M [38] andFood101-N [15].Also, noisy tolerance loss functions [35, 39] have beendeveloped to fight against label noises, but they had a simi-lar assumption as the above noise correction approaches. Sothey were also infeasible for real-world noisy datasets. Fur-thermore, many approaches [15, 17, 37] solved this prob- a r X i v : . [ c s . C V ] A ug em by using additional supervision. For instance, some ofthem manually selected a part of samples and asked humanlabelers to clean these noisy labels. By using extra super-vision, these methods could improve the robustness of deepnetworks against noises. The main drawback of these ap-proaches was that they required extra clean samples, mak-ing them expensive to apply in large-scale real-world sce-narios.Among all the above work, CleanNet [15] achieved theexisting state-of-the-art performance on real-world datasetsuch as Clothing1M [38]. CleanNet used “class prototype”( i.e . a representative sample) to represent each class cate-gory and decided whether the label for a sample is corrector not by comparing with the prototype. However, CleanNetalso needed additional information or supervision to train.To address the above issues, we propose a novel frame-work of Self-Learning with Multi-Prototypes (SMP), whichaims to train a robust network on the real noisy dataset with-out extra supervision. By observing the characteristics ofsamples in the same noisy category, we conjecture that thesesamples have widely spread distribution. A single class pro-totype is hard to represent all characteristics of a category.More prototypes should be used to get a better represen-tation of characteristics. Figure 1 illustrated the case andfurther exploration has been conducted in the experiment.Furthermore, extra information (supervision) is not neces-sarily available in practice.The proposed SMP trains in an iterative manner whichcontains two phases: the first phase is to train a networkwith the original noisy label and corrected label generatedin the second phase. The second phase uses the networktrained in the first stage to select several prototypes. Theseprototypes are used to generate the corrected label for thefirst stage. This framework does not rely on any assump-tion on the distribution of noises, which makes it feasibleto real-world noises. It also does not use accessorial neuralnetworks nor require additional supervision, providing aneffective and efficient training scheme.The contributions of this work are summarized as fol-lows. (1) We propose an iterative learning framework SMPto relabel the noisy samples and train ConvNet on the realnoisy dataset, without using extra clean supervision. Boththe relabeling and training phases contain only one singleConvNet that can be shared across different stages, mak-ing SMP effective and efficient to train. (2) SMP results ininteresting findings for learning from noisy data. For exam-ple, unlike previous work [15], we show that a single pro-totype may not be sufficient to represent a noisy class. Byextracting multiple prototypes for a category, we demon-strate that more prototypes would get a better representa-tion of a class and obtain better label-correction results. (3)Extensive experiments validate the effectiveness of SMP ondifferent real-world noisy datasets. We demonstrate new state-of-the-art performance on all these datasets.
2. Related Work
Learning on noisy data.
ConvNets achieved great suc-cesses when training with clean data. However, the per-formances of ConvNets degraded inevitably when trainingon the data with noisy labels [23, 26]. The annotationsprovided by human labelers on websites such as AmazonMechanical Turk [10] would also introduce biases and in-correct labels. As annotating large-scale clean and unbiasdataset is expensive and time-consuming, many efforts havebeen made to improve the robustness of ConvNets trainedon noisy datasets. They can be generally summarized asthree parts mentioned below.First, the transition matrix was widely used to capturethe transition probability between the noisy label and truelabel, i.e . the sample with a true label y has a certain prob-ability to be mislabeled as a noisy label ˜ y . Sukhbaatar etal . in [32] added an extra linear layer to model the transi-tion relationships between true and corrupted labels. Patrini et al . in [25] provided a loss correction method to estimatethe transition matrix by using a deep network trained on thenoisy dataset. The transition matrix was estimated by usinga subset of cleanly labeled data in [9]. The above meth-ods followed an assumption that the transition probabilityis identical between classes and is irrelevant to individualimages. Therefore, these methods worked well on the noisydataset that is created intentionally by human with label flip-ping such as the noisy version of CIFAR10 [12]. However,when applying these approaches to real-world datasets suchas Clothing1M [38], their performances were limited sincethe assumption above is no longer valid.Second, another scenario was to explore the robust lossfunction against label noises. [4] explored the tolerance ofdifferent loss functions under uniform label noises. Zhangand Sabuncu [39] found that the mean absolute loss func-tion is more robust than the cross-entropy loss function, butit has other drawbacks. Then they proposed a new loss func-tion that benefits both of them. However, these robust lossfunctions had certain constraints so that they did not per-form well on real-world noisy datasets.Third, CleanNet [15] designed an additional network todecide whether a label is noisy or not. The weight of eachsample during network training is produced by the Clean-Net to reduce the influence of noisy labels in optimization.Ren et al . [29] and Li et al . [16] tried to solve noisy labeltraining by meta-learning. Some methods [7, 11] based oncurriculum learning were also developed to train against la-bel noises. CNN-CRF model was proposed by Vahdat [36]to represent the relationship between noisy and clean la-bels. However, most of these approaches either requiredextra clean samples as additional information or adopteda complicated training procedure. In contrast, SMP notnly corrects noisy labels without using additional cleansupervision but also trains the network in an efficient end-to-end manner, achieving state-of-the-art performances onboth Clothing1M [38] and Food101-N [15] benchmarks.When equipped with a few additional information, SMPfurther boosts the accuracies on these datasets. Self-learning by pseudo-labels.
Pseudo-labeling [3, 35,14] belongs to the self-learning scenario, and it is often usedin semi-supervised learning where the dataset has a few la-beled data and most of the data are unlabeled. In this case,the pseudo-labels are given to the unlabeled data by usingthe predictions from the model pretrained on labeled data.In contrast, when learning from noisy datasets, all data havelabels, but they may be incorrect. Reed et al . [28] pro-posed to jointly train noisy and pseudo-labels. However, themethod proposed in [28] over-simplifies the assumption ofthe noisy distribution, leading to a sub-optimal result. JointOptimization [35] completely replaced all labels by usingpseudo-labels. However, [35] discarded the useful informa-tion in the original noisy labels. In this work, we predictthe pseudo-labels by using SMP and train deep network byusing both the original labels and pseudo-labels in a self-learning scheme.
3. Our Approach
Overview.
Let D be a noisily-labeled dataset, D = { X , Y } = { ( x , y ) , ..., ( x N , y N ) } , which contains N samples, and y i ∈ { , , ..., K } is the noisy label corre-sponding to the image x i . K is the number of classes inthe dataset. Since the labels are noisy, they would be incor-rect, impeding model training. To this end, a neural network F ( θ ) with parameter θ is defined to transform the image x to the label probability distribution F ( θ, x ) . When train-ing on a cleanly-labeled dataset, an optimization problem isdefined as θ ∗ = argmin θ L ( Y, F ( θ, X )) (1)where L represents the empirical risk. However, when Y contains noises, the solution of the above equation wouldbe sub-optimal. When label noises are presented, all previ-ous work that improved the model robustness can be treatedas adjusting the term in Eqn.(1). In this work, we propose toattain the corrected label ˆ Y ( X , X s ) in a self-training man-ner, where X s indicates a set of class prototypes to repre-sent the distribution of classes. Our optimization objectiveis formulated as θ ∗ = argmin θ L ( Y, ˆ Y ( X , X s ) , F ( θ, X )) (2)Although the corrected label ˆ Y ( X , X s ) is more precise thanthe original label Y , we believe that it is still likely to mis-classify the hard samples as noises. So we keep the originalnoisy label Y as a part of supervision in the above objectivefunction. The corrected label ˆ y i ( x i , X s ) ∈ ˆ Y ( X , X s ) of image x i is given by a similarity metric between the image x i andthe set of prototypes X s . Since the data distribution of eachcategory is complicated, a single prototype is hard to repre-sent the distribution of the entire class. We claim that usingmulti-prototypes can get a better representation of the dis-tribution, leading to better label correction.In the following sections, we introduce the iterative self-learning framework in details, where a deep network learnsfrom the original noisy dataset, and then it is trained to cor-rect the noisy labels of images. The corrected labels willsupervise the training process iteratively. Pipeline . The overall framework is illustrated in Fig-ure 2. It contains two phases, the training phase, andthe label-correction phase. In the training phase, a neu-ral network F with parameters θ is trained, taking image x as input and producing the corresponding label predic-tion F ( θ, x ) . The supervision signal is composed by twobranches, (1) the original noisy label y corresponding to theimage x and (2) the corrected label ˆ y generated by the sec-ond phase of label correction.In the label correction phase, we extract the deep fea-tures of the images in the training set by using the network G trained in the first stage. Then we explore a selectionscheme to select several class prototypes for each class. Af-terward, we correct the label for each sample according tothe similarity of the deep features of the prototypes. Thecorrected labels are then used as a part of supervision in thefirst training phase. The first and the second phases proceediteratively until the training converged. The pipeline of the training phase is illustrated in Fig-ure 2 (a). This phase aims to optimize the parameters θ ofthe deep network F . In general, the objective function is theempirical risk of cross-entropy loss, which is formulated by L ( F ( θ, x ) , y ) = − n n (cid:88) i =1 log( F ( θ, x i ) y i ) (3)where n is the mini-batch size and y i is the label corre-sponding to the image x i . When learning on a noisy dataset,the original label y i may be incorrect, so we introduce an-other corrected label as a complementary supervision. Thecorrected label is produced by a self-training scheme in thelabel correction phase. With the corrected signal, the objec-tive loss function is L total = (1 − α ) L ( F ( θ, x ) , y ) + α L ( F ( θ, x ) , ˆ y ) (4)where L is the cross entropy loss as shown in Eqn.(3), y is the original noisy label, and ˆ y is the corrected label pro- raining PhaseLabel Correction Phase Feature Extractor FCNetwork Image x Prediction Noisy LabelFeature ExtractorRandomly Sampled Images Share ParametersFeature (x) Image x Feature Set Clustering & Prototype Selection Class PrototypesLabel CorrectionCorrected Labelxx (a)(b)
Figure 2. Illustration of the pipeline of iterative self-learning framework on the noisy dataset. (a) shows the training phase and (b) showsthe label correction phase, where these two phases proceed iteratively. The deep network G can be shared, such that only a single modelneeds to be evaluated in testing. duced by the second phase. The weight factor α ∈ [0 , controls the important weight of the two terms.Since the proposed approach does not require extra infor-mation (typically produced by using another deep networkor additional clean supervision), at the very beginning oftraining, we set α to 0 and train the network F by usingonly the original noisy label y . After a preliminary networkwas trained, we can step into the second phase and obtainthe corrected label ˆ y . At this time, α is a positive value,where the network is trained jointly by y and ˆ y with theobjective shown in Eqn. (4). In the label correction phase, we aim to obtain a cor-rected label for each image in the training set. These cor-rected labels will be used to guide the training procedure forthe first phase in turn.For label correction, the first step is to select severalclass prototypes for each category. Inspired by the clus-tering method [31], we propose the following method topick up these prototypes. (1) We use the preliminary net-work trained in the first phase to extract deep features ofimages in the training set. In experiments, we employ theResNet [8] architecture, where the output before the fully-connected layer is regarded as the deep features, denotedas G ( x ) . Therefore, the relationship between F ( θ, x ) and G ( x ) is F ( θ, x ) = f ( G ( x )) , where f is the operation onthe fully-connected layer of ResNet. (2) In order to select the class prototypes for the c -th class, we extract a set ofdeep features, {G ( x i ) } ni =1 , corresponding to a set of im-ages { x i } ni =1 in the dataset with the same noisy label c .Then, we calculate the cosine similarity between the deepfeatures and construct a similarity matrix S ∈ R n × n , n isthe number of images with noisy label c and S ij ∈ S with S ij = G ( x i ) T G ( x j ) ||G ( x i ) || ||G ( x j ) || (5)Here S ij is a measurement of the similarity between twoimages x i and x j . Larger S ij indicates the two imageswith higher similarity. Both [31] and [7] used Euclideandistance as the similarity measurement, but we find that co-sine similarity is a better choice to correct the labels. Thecomparisons between the Euclidean distance and the cosinesimilarity are provided in experiment. An issue is that thenumber of images n in a single category is huge e.g . n = S time-consuming. Furthermore, latter calcu-lation using such a huge matrix is also expensive. So wejust randomly sample m images ( m < n ) in the same classto calculate the similarity matrix S m × m to reduce the com-putational cost. To select prototypes, we define a density ρ i for each image x i , ρ i = m (cid:88) j =1 sign( S ij − S c ) (6)here sign( x ) is the sign function . The value of S c is aconstant number given by the value of an element rankedtop 40% in S where the values of elements in S are rankedin an ascending order from small to large. We find that theconcrete choice of S c does not have influence in the finalresult, because we only need the relative density of images. Discussions.
From the above definition of density ρ ,the image with larger ρ has more similar images around it.These images with correct labels should be close to eachother, while the images with noisy labels are usually iso-lated from others. The probability density with ρ for imageswith correct label and images with the wrong label is shownin Figure 3 (a). We can find the images with correct labelsare more possible to have large ρ value while those imageswith wrong labels appear in the region with low ρ . In otherwords, the images with larger density ρ have a higher prob-ability to have the correct label in the noisy dataset and canbe treated as prototypes to represent this class. If we need p prototypes for a class, we can regard the images with thetop- p highest density values as the class prototypes.Nevertheless, the above strategy to choose prototypeshas a weakness, that is, if the chosen p prototypes belongingto the same class are very close to each other, the represen-tative ability of these p prototypes is equivalent to using justa single prototype. To avoid such case, we further define asimilarity measurement η i for each image x i η i = (cid:40) max j, ρ j >ρ i S ij , ρ i < ρ max min j S ij , ρ i = ρ max (7)where ρ max = max { ρ , ..., ρ m } . From the definition of η ,we find that for the image x i with density value equaled to ρ max ( ρ i = ρ max ), its similarity measure η i is the small-est. Otherwise, for those images x i with ρ i < ρ max , thesimilarity η i is defined as the maximum of the cosine simi-larity between the image i with features G ( x i ) and the otherimage j with features G ( x j ) , whose density value is higherthan x i ( ρ j > ρ i ).From the above definitions, smaller similarity value η i indicates that the features corresponding to the image i arenot too close the other images with density ρ larger thanit. So, the sample with high-density value ρ (probability aclean label), and low similarity value η (a clean label butmoderately far away from other clean labels) can fulfill ourselection criterion as the class prototypes. In experiments,we find that the samples with high density ρ ranked the topoften have relatively small similarity values η .As shown in Figure 3 (b), red dots are samples with den-sity ρ ranked in the top. Over 80% of the samples have η > . and half of the samples have η > . . So thosered dots have relative small η value and far away from each We have sign( x ) = 1 if x > ; sign( x ) = 0 if x = 0 ; otherwise sign( x ) = − . P r o b a b ili t y d e n s i t y CorrectWrong 1000 500 0 500 1000Density 0.50.60.70.80.91.0 S i m il a r i t y (a) (b)Figure 3. (a) The probability density with the density ρ for samplewith correct label (blue line) and sample with wrong label (greenline) for 1280 images sampled from the same noisy class in Cloth-ing1M dataset. (b) The distribution between similarity η and den-sity ρ . the samples are the same as (a). Red dots are samples withtop-8 highest ρ value. other. It also proves our claim that the samples in the sameclass tend to gather in several clusters, so a single proto-type is hard to represent an entire class and therefore moreprototypes are necessary. In experiments, we select the pro-totypes ranking in the top with η < . .After the selection of prototypes for each class, we have aprototype set {G ( X ) , ..., G ( X c ) , ..., G ( X K ) } (representedby deep features), where X c = { x c , ..., x cp } is the selectedimages for the c -th class, p is the number of prototypes foreach class, and K is the number of classes in the dataset.Given an image x , we calculate the cosine similarity be-tween extracted features G ( x ) and different sets of proto-types G ( X c ) . The similarity score σ c for the c -th class iscalculated as σ c = 1 p p (cid:88) l =1 cos( G ( x ) , G ( x cl )) , c = 1 ...K (8)where G ( x cl ) is the l -th prototype for the c -th class. Herewe use the average similarity over p prototypes, instead ofthe maximum similarity, because we find that combination(voting) from all the prototypes might prevent misclassify-ing some hard samples with almost the same high similar-ity to different classes. Then, we obtain the corrected label ˆ y ∈ { , . . . , K } by ˆ y = argmax c σ c , c = 1 ...K (9)After getting the corrected label ˆ y , we treat it as comple-mentary supervision signal to train the neural network F inthe training phase. As shown in Algorithm 1, the training phase and the la-bel correction phase proceed iteratively. The training phasefirst trains an initial network by using image x with noisylabel y , as no corrected label ˆ y provided. Then we proceedto the label correction phase. Feature extractor in this phaseshares the same network parameters as the network F in lgorithm 1 Iterative Learning Initialize network parameter θ for M = 1 : num epochs do if M < start epoch then sample ( X , Y ) from training set. θ ( t +1) ← θ ( t ) − ξ ∇L ( F ( θ ( t ) , X ) , Y ) else Sample { x c , . . . , x cm } for each class label c . Extract the feature and calculate the similarity S . Calculate the density ρ and elect the class proto-types G ( X c ) for each class c . Get the corrected ˆ y for each sample x i sample ( X , Y, ˆ Y ) from training set. θ ( t +1) ← θ ( t ) − ξ ∇ ((1 − α ) L ( F ( θ ( t ) , X ) , Y ) + α L ( F ( θ ( t ) , X ) , ˆ Y ) end if end for the training phase. We randomly sample m images fromthe noisy dataset for each class and extract features by F ,and then the prototype selection procedure selects p proto-types for each class. Corrected label ˆ y is assigned to everyimage x by calculating the similarity between its features G ( x ) and the prototypes. This corrected label ˆ y is then usedto train the network F in the next epoch. The above proce-dure proceeds iteratively until converged.
4. Experiments
Datasets.
We employ two challenging real-world noisydatasets to evaluate our approach, Clothing1M [38] andFood101-N [15]. (1)
Clothing1M [38] contains 1 millionimages of clothes, which are classified into 14 categories.The labels are generated by the surrounding text of the im-ages on the Web, so they contain many noises. The accuracyof the noisy label is 61.54%. Clothing1M is partitioned intotraining, validation and testing sets, containing 50k, 14k and10k images respectively. Human annotators are asked toclean a set of 25k labels as a clean set. In our approach,they are not required to use in training. (2)
Food101-N [15] is a dataset to classify food. It contains 101 classes with310k images searched from the Web. The accuracy of thenoisy label is 80%. It also provides 55k verification labels(clean by humans) in the training set.
Experimental Setup.
For the Clothing1M dataset, weuse ResNet50 pretrained on the ImageNet. The data prepro-cessing procedure includes resizing the image with a shortedge of 256 and randomly cropping a 224 ×
224 patch fromthe resized image. We use the SGD optimizer with a mo-mentum of 0.9. The weight decay factor is × − , andthe batchsize is 128. The initial learning rate is 0.002 anddecreased by 10 every 5 epochs. The total training pro-cesses contain 15 epochs. In the label correction phase, we w hard [15] 1M noisy + 25k verify 74.158 CleanNet w soft [15] 1M noisy + 25k verify 74.699 Ours 1M noisy + 25k verify
10 Cross Entropy 1M noisy+ 50k clean 80.2711 Forward [25] 1M noisy + 50k clean 80.3812 CleanNet w soft [15] 1M noisy + 50k clean 79.9013 Ours 1M noisy + 50k clean Table 1. The classification accuracy (%) on Clothing1M comparewith other methods. randomly sample 1280 images for each class in the noisytraining set, and 8 class prototypes are picked out for eachclass. For the Food-101N, the learning rate decreases by 10every 10 epochs, and there are 30 epochs in total. The othersettings are the same as that of Clothing1M.
We adopt the following three settings by following pre-vious work. First, only noisy dataset is used for trainingwithout using any extra clean supervision in the trainingprocess. Second, verification labels are provided, but theyare not used to train the network directly. e.g . They are usedto train the accessorial network as [15] or to help select pro-totypes in our method. Third, both noisy dataset and 50kclean labels are available for training.We compare the results in Table 1. We see that in thefirst case, the proposed method outperforms the others by alarge margin, e.g . improving the accuracy from 69.54% to74.45%, better than Joint Optimization [35] ( - s h i r t S h i r t K i n t w e a r C h i ff o n S w e a t e r H oo d i e W i n db r e a k e r J a c k e t D o w n c o a t S u i t S h a w l D r e ss V e s t U n d e r w e a r Class name020406080100 C o rr e c t a cc u r a c y o f l a b e l ( % ) OriginalCorrect InitialCorrect Final 2 4 6 8 10Number of prototypes p for each class70717273747576 A cc u r a c y ( % ) o n t e s t s e t Noisy FinalNoisy InitialNoisy+verify FinalNoisy+verify Initial 0.0 0.2 0.4 0.6 0.8 1.0Weight factor 70717273747576 A cc u r a c y ( % ) o n t e s t s e t NoisyNoisy+verify (a) (b) (c)Figure 4. (a) The label accuracy (%) of labels in the original dataset (Original), labels corrected by the label correction phase in the firstiterative cycle (Correct Initial) and labels corrected by the model at the end of training (Correct Final) for each class in Clothing1M. (b)Testing accuracy (%) with the number of prototypes p ranging from 1 to 10 for each class. The solid line denotes the accuracy got by themodel at the end of training (Final). The dotted line denotes the correct accuracy by the model just step into the label correction phase forthe first time (Initial). Noisy is the result of the training from noisy dataset only, noisy+verify indicates additional verification informationis used. (c) Testing accuracy (%) with weight factor α ranging from 0.0 to 1.0. Noisy and noisy+verify have the same meaning as (b).Original Correct Initial Correct FinalAccuracy 61.74 74.38 Table 2. Overall label accuracy (%) of the labels in original noisydataset (Original), accuracy of the corrected label generated by thelabel correction phase in first iterative cycle (Correct Initial) andaccuracy of the corrected labels generated by the final model whentraining ends (Correct Final). methods, showing that our method is effective and suitablefor board situations.
Label Correction Accuracy.
We explore the classifica-tion accuracy in the label-correction phase. Table 2 lists theoverall accuracy in the original noisy set: the accuracy ofthe corrected label in the initial iterative cycle ( i.e . the firsttime we step into the label correction phase after trainingthe preliminary model), and accuracy of the corrected labelby the final model at the end of training (Final). We see thatthe accuracy after the initial cycle already reaches 74.38%,improving the original accuracy by 12.64% (61.74% vs .74.38%). The accuracy is further improved to 77.36% atthe end of the training.We further explore the classification accuracy for differ-ent classes as shown in Figure 4 (a). We can find that for themost classes with the original accuracies lower than 50%,our method can improve the accuracy to higher than 60%.Even for the 5th class (“Sweater”) with about 30% originalaccuracy, our method still improves the accuracy by 10%.Some of the noisy samples successfully corrected by ourapproach are shown in Figure 5. Number of Class Prototypes p . The number of classprototypes is the key to the representation ability to a class.When p = 1 , the case is similar to CleanNet [15]. In ourmethod, we use p ≥ . Another difference is that CleanNetattained the prototype by training an additional network.But we just need to select images as prototypes by their Hoodie Jacket Shirt T-Shirt Jacket Hoodie Jacket SuitApple pie Cup cakeEdamame Dumpling Churros Ice cream Sashimi Sushi
Figure 5. Samples corrected by our method.
Left : The originalnoisy label.
Right : The right label corrected by our method. Thefirst row from Clothing1M and the second row from Food101-N. density and similarity according to the data distribution.Figure 4 (b) shows the effect of changing the numberof prototypes for each class. We select five p values andevaluate the final test accuracy trained by either only using1M noisy data or adding 25k verification information, asshown by the solid lines. To have a better observation of theinfluence caused by p value, we evaluate the label correctionrate by the model step into the first label correction phase,which is similar to the correction accuracy discussed in thelast experiment. But this time we evaluate on the testingset. This metric is easy to be evaluated, so we explore 10 p values from 1 to 10, as shown in the dotted line. Whencomparing these two settings, they follow the same trend.From the result, we find that when p = 1 i.e . one pro-totype for each class, the accuracies are sub-optimal com-pared to others. When using more prototypes, the perfor-mance improves a lot, e.g . the accuracy using two proto-types outperforms using a single one by 2.04% This alsoproves our claim that a single prototype is not enough torepresent the distribution of a class. Multiple prototypesprovide more comprehensive representation to the class. Weight factor α . Weight factor α plays an importantrole in the training procedure, which decides the networkwill concentrate on the original noisy labels Y or on the
320 640 1280 25601M noisy (Final) 74.37 74.07
1M noisy + 25k verify (Initial) 74.09 73.97 74.17
Table 3. The classification accuracy (%) on Clothing1M with dif-ferent number of samples used to select prototypes for each class.Final denotes the accuracy get by the model at the end of training.Initial denotes the correct accuracy by the model just step into thefirst label correction phase. corrected labels ˆ Y . If α = 0 , the network is trained by us-ing only noisy labels without correction. Another extremecase is when α = 1 , the training procedure discards theoriginal noisy labels and only depends on the corrected la-bels. We study the influence of different α ranging from0.0 to 1.0 and the test accuracy with different α is shown inFigure 4 (c).From the result, we find that training using only the noisylabel Y i.e . α = 0 leads to poor performance. Although thecorrected label is more precise, the model trained using onlycorrected label ˆ Y also performs sub-optimal. The modeljointly trained by using the original noisy label Y and thecorrected label ˆ Y achieves the best performance when α =0 . . The accuracy curve also proves our claim that labelcorrection may misrecognize some hard samples as noises.Directly replacing all noisy labels with the corrected oneswould make the network focused on simple features andthus degrade the generalization ability. Number of Samples m . To avoid massive calculationrelated to the similarity matrix S , we randomly select m images rather than using all of the images in the same classto compute the similarity matrix. We examine how manysamples are enough to select the class prototypes to repre-sent the class distribution well. We explore the influence ofthe images number m for each class.The results are listed in Table 3. Experiment settingis similar to the experiments above to study the numberof class prototypes. The models are trained on the noisydataset as well as the noisy dataset plus extra verification la-bels respectively. The results are the accuracy of the trainedmodel on the test set. Besides evaluating the classificationaccuracy got by the final model, we also examine the cor-rection accuracy of the model just step into the first labelcorrection phase, which is denoted by “Initial” in the ta-ble. By analyzing the results in different cases, we see thatthe performance is not sensitive to the number of images m .Compared with 70k training images in Clothing1M for eachclass, we merely sample 2% of them and obtain the classprototypes to represent the distribution of the class well. Prototype Selection . To explore the influence of themethod used to select prototypes, we also use two otherclustering methods to get the prototypes. One is the den-sity peak by Euclidean distance [31], while the other is the
Method Data AccuracyK-means++ [1] 1M noisy 74.08Density peak Euc. [31] 1M noisy 74.11Ours 1M noisy
K-means++ [1] 1M noisy + 25k verify 76.22Density peak Euc. [31] 1M noisy + 25k verify 76.05Ours 1M noisy + 25k verify
Table 4. The classification accuracy (%) on Clothing1M with dif-ferent cluster methods used to select the prototypes. w hard [15] 83.473 CleanNet w soft [15] 83.954 Ours Table 5. The classification accuracy (%) on Food-101N comparewith other methods. widely used K-means algorithm, that is, K-means++ [1].The prototypes attained by all the methods are used to pro-duce the corrected labels for training. Results are listed inTable 4. We see that the method used to generate proto-types does not largely impact the accuracy, implying thatour framework is not sensitive to the clustering method. Butthe selection method proposed in this work still performsbetter than others.
We also evaluate our method on the Food-101N [15]dataset. The results are shown in Table 5. We find that ourmethod also achieves state-of-the-art performance on Food-101N, outperforming CleanNet [15] by 1.16%.
5. Conclusion
In this paper, we propose an iterative self-learning frame-work for learning on the real noisy dataset. We prove thata single prototype is insufficient to represent the distribu-tion of a class and multi-prototypes are necessary. We alsoverify our claim that original noisy labels are helpful in thetraining procedure although the corrected labels are moreprecise. By correcting the label using several class proto-types and training the network jointly using the correctedand original noisy iteratively, this work provides an effec-tive end-to-end training framework without using an acces-sorial network or adding extra supervision on a real noisydataset. We evaluate the methods on different real noisydatasets and obtain state-of-the-art performance.
Acknowledgement
This work is supported in part by SenseTime GroupLimited, in part by the General Research Fund throughthe Research Grants Council of Hong Kong under GrantsCUHK14202217, CUHK14203118, CUHK14205615,CUHK14207814, CUHK14213616. eferences [1] David Arthur and Sergei Vassilvitskii. k-means++: The ad-vantages of careful seeding. In
Proceedings of the eigh-teenth annual ACM-SIAM symposium on Discrete algo-rithms , pages 1027–1035. Society for Industrial and AppliedMathematics, 2007.[2] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li,and Li Fei-Fei. Imagenet: A large-scale hierarchical imagedatabase. 2009.[3] Yifan Ding, Liqiang Wang, Deliang Fan, and Boqing Gong.A semi-supervised two-stage approach to learning fromnoisy labels. In , pages 1215–1224. IEEE,2018.[4] Aritra Ghosh, Naresh Manwani, and PS Sastry. Makingrisk minimization tolerant to label noise.
Neurocomputing ,160:93–107, 2015.[5] Ross Girshick. Fast r-cnn. In
Proceedings of the IEEE inter-national conference on computer vision , pages 1440–1448,2015.[6] Jacob Goldberger and Ehud Ben-Reuven. Training deepneural-networks using a noise adaptation layer. 2016.[7] Sheng Guo, Weilin Huang, Haozhi Zhang, Chenfan Zhuang,Dengke Dong, Matthew R Scott, and Dinglong Huang. Cur-riculumnet: Weakly supervised learning from large-scaleweb images. In
Proceedings of the European Conferenceon Computer Vision (ECCV) , pages 135–150, 2018.[8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Deep residual learning for image recognition. In
Proceed-ings of the IEEE conference on computer vision and patternrecognition , pages 770–778, 2016.[9] Dan Hendrycks, Mantas Mazeika, Duncan Wilson, andKevin Gimpel. Using trusted data to train deep networkson labels corrupted by severe noise. In
Advances in NeuralInformation Processing Systems , pages 10477–10486, 2018.[10] Panagiotis G Ipeirotis, Foster Provost, and Jing Wang. Qual-ity management on amazon mechanical turk. In
Proceed-ings of the ACM SIGKDD workshop on human computation ,pages 64–67. ACM, 2010.[11] Lu Jiang, Zhengyuan Zhou, Thomas Leung, Li-Jia Li, andLi Fei-Fei. Mentornet: Learning data-driven curriculum forvery deep neural networks on corrupted labels. In
Interna-tional Conference on Machine Learning , pages 2309–2318,2018.[12] Alex Krizhevsky and Geoffrey Hinton. Learning multiplelayers of features from tiny images. Technical report, Cite-seer, 2009.[13] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.Imagenet classification with deep convolutional neural net-works. In
Advances in neural information processing sys-tems , pages 1097–1105, 2012.[14] Dong-Hyun Lee. Pseudo-label: The simple and efficientsemi-supervised learning method for deep neural networks.In
Workshop on Challenges in Representation Learning,ICML , volume 3, page 2, 2013.[15] Kuang-Huei Lee, Xiaodong He, Lei Zhang, and LinjunYang. Cleannet: Transfer learning for scalable image classi- fier training with label noise. In
Proceedings of the IEEEConference on Computer Vision and Pattern Recognition ,pages 5447–5456, 2018.[16] Junnan Li, Yongkang Wong, Qi Zhao, and Mohan S Kankan-halli. Learning to learn from noisy labeled data. In
Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition , pages 5051–5059, 2019.[17] Yuncheng Li, Jianchao Yang, Yale Song, Liangliang Cao,Jiebo Luo, and Li-Jia Li. Learning from noisy labels withdistillation. In
Proceedings of the IEEE International Con-ference on Computer Vision , pages 1910–1918, 2017.[18] Tsung-Yi Lin, Piotr Doll´ar, Ross Girshick, Kaiming He,Bharath Hariharan, and Serge Belongie. Feature pyramidnetworks for object detection. In
Proceedings of the IEEEConference on Computer Vision and Pattern Recognition ,pages 2117–2125, 2017.[19] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C LawrenceZitnick. Microsoft coco: Common objects in context. In
European conference on computer vision , pages 740–755.Springer, 2014.[20] Xihui Liu, Hongsheng Li, Jing Shao, Dapeng Chen, and Xi-aogang Wang. Show, tell and discriminate: Image captioningby self-retrieval with partially labeled data. In
Proceedingsof the European Conference on Computer Vision (ECCV) ,pages 338–354, 2018.[21] Xihui Liu, Zihao Wang, Jing Shao, Xiaogang Wang, andHongsheng Li. Improving referring expression groundingwith cross-modal attention-guided erasing. In
Proceedingsof the IEEE Conference on Computer Vision and PatternRecognition , pages 1950–1959, 2019.[22] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fullyconvolutional networks for semantic segmentation. In
Pro-ceedings of the IEEE conference on computer vision and pat-tern recognition , pages 3431–3440, 2015.[23] David F Nettleton, Albert Orriols-Puig, and Albert Fornells.A study of the effect of different types of noise on the preci-sion of supervised learning techniques.
Artificial intelligencereview , 33(4):275–306, 2010.[24] Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han.Learning deconvolution network for semantic segmentation.In
Proceedings of the IEEE international conference on com-puter vision , pages 1520–1528, 2015.[25] Giorgio Patrini, Alessandro Rozza, Aditya Krishna Menon,Richard Nock, and Lizhen Qu. Making deep neural networksrobust to label noise: A loss correction approach. In
Pro-ceedings of the IEEE Conference on Computer Vision andPattern Recognition , pages 1944–1952, 2017.[26] Mykola Pechenizkiy, Alexey Tsymbal, Seppo Puuronen, andOleksandr Pechenizkiy. Class noise and supervised learn-ing in medical domains: The effect of feature extraction. In , pages 708–713. IEEE, 2006.[27] Joseph Redmon and Ali Farhadi. Yolo9000: better, faster,stronger. In
Proceedings of the IEEE conference on computervision and pattern recognition , pages 7263–7271, 2017.28] Scott Reed, Honglak Lee, Dragomir Anguelov, ChristianSzegedy, Dumitru Erhan, and Andrew Rabinovich. Train-ing deep neural networks on noisy labels with bootstrapping. arXiv preprint arXiv:1412.6596 , 2014.[29] Mengye Ren, Wenyuan Zeng, Bin Yang, and Raquel Urta-sun. Learning to reweight examples for robust deep learning. arXiv preprint arXiv:1803.09050 , 2018.[30] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.Faster r-cnn: Towards real-time object detection with regionproposal networks. In
Advances in neural information pro-cessing systems , pages 91–99, 2015.[31] Alex Rodriguez and Alessandro Laio. Clustering by fastsearch and find of density peaks.
Science , 344(6191):1492–1496, 2014.[32] Sainbayar Sukhbaatar, Joan Bruna, Manohar Paluri,Lubomir Bourdev, and Rob Fergus. Training convolutionalnetworks with noisy labels. arXiv preprint arXiv:1406.2080 ,2014.[33] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet,Scott Reed, Dragomir Anguelov, Dumitru Erhan, VincentVanhoucke, and Andrew Rabinovich. Going deeper withconvolutions. In
Proceedings of the IEEE conference oncomputer vision and pattern recognition , pages 1–9, 2015.[34] Yaniv Taigman, Ming Yang, Marc’Aurelio Ranzato, and LiorWolf. Deepface: Closing the gap to human-level perfor-mance in face verification. In
Proceedings of the IEEE con-ference on computer vision and pattern recognition , pages1701–1708, 2014.[35] Daiki Tanaka, Daiki Ikami, Toshihiko Yamasaki, and Kiy-oharu Aizawa. Joint optimization framework for learningwith noisy labels. In
Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , pages 5552–5560, 2018.[36] Arash Vahdat. Toward robustness against label noise in train-ing deep discriminative neural networks. In
Advances inNeural Information Processing Systems , pages 5596–5605,2017.[37] Andreas Veit, Neil Alldrin, Gal Chechik, Ivan Krasin, Abhi-nav Gupta, and Serge Belongie. Learning from noisy large-scale datasets with minimal supervision. In
Proceedingsof the IEEE Conference on Computer Vision and PatternRecognition , pages 839–847, 2017.[38] Tong Xiao, Tian Xia, Yi Yang, Chang Huang, and XiaogangWang. Learning from massive noisy labeled data for im-age classification. In
Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , pages 2691–2699, 2015.[39] Zhilu Zhang and Mert Sabuncu. Generalized cross entropyloss for training deep neural networks with noisy labels. In
Advances in Neural Information Processing Systems , pages8792–8802, 2018.[40] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, XiaogangWang, and Jiaya Jia. Pyramid scene parsing network. In
Proceedings of the IEEE conference on computer vision andpattern recognition , pages 2881–2890, 2017.[41] Hang Zhou, Yu Liu, Ziwei Liu, Ping Luo, and XiaogangWang. Talking face generation by adversarially disentan-gled audio-visual representation. In