[PDF] A Survey of Label-noise Representation Learning: Past, Present and Future

Abstract

Classical machine learning implicitly assumes that labels of the training data are sampled from a clean distribution, which can be too restrictive for real-world scenarios. However, statistical-learning-based methods may not train deep learning models robustly with these noisy labels. Therefore, it is urgent to design Label-Noise Representation Learning (LNRL) methods for robustly training deep models with noisy labels. To fully understand LNRL, we conduct a survey study. We first clarify a formal definition for LNRL from the perspective of machine learning. Then, via the lens of learning theory and empirical study, we figure out why noisy labels affect deep models' performance. Based on the theoretical guidance, we categorize different LNRL methods into three directions. Under this unified taxonomy, we provide a thorough discussion of the pros and cons of different categories. More importantly, we summarize the essential components of robust LNRL, which can spark new directions. Lastly, we propose possible research directions within LNRL, such as new datasets, instance-dependent LNRL, and adversarial LNRL. We also envision potential directions beyond LNRL, such as learning with feature-noise, preference-noise, domain-noise, similarity-noise, graph-noise and demonstration-noise.

Full PDF

AA SURVEY OF LABEL-NOISE REPRESENTATION LEARNING, NOVEMBER 2020 1

A Survey of Label-noise RepresentationLearning: Past, Present and Future

Bo Han, Quanming Yao, Tongliang Liu, Gang Niu,Ivor W. Tsang, James T. Kwok,

Fellow, IEEE and Masashi Sugiyama

Abstract —Classical machine learning implicitly assumes that labels of the training data are sampled from a clean distribution, which canbe too restrictive for real-world scenarios. However, statistical learning-based methods may not train deep learning models robustly withthese noisy labels. Therefore, it is urgent to design Label-Noise Representation Learning (LNRL) methods for robustly training deepmodels with noisy labels. To fully understand LNRL, we conduct a survey study. We ﬁrst clarify a formal deﬁnition for LNRL from theperspective of machine learning. Then, via the lens of learning theory and empirical study, we ﬁgure out why noisy labels affect deepmodels’ performance. Based on the theoretical guidance, we categorize different LNRL methods into three directions. Under this uniﬁedtaxonomy, we provide a thorough discussion of the pros and cons of different categories. More importantly, we summarize the essentialcomponents of robust LNRL, which can spark new directions. Lastly, we propose possible research directions within LNRL, such as newdatasets, instance-dependent LNRL, and adversarial LNRL. Finally, we envision potential directions beyond LNRL, such as learning withfeature-noise, preference-noise, domain-noise, similarity-noise, graph-noise, and demonstration-noise.

Index Terms —Label-noise Learning, Representation Learning. (cid:70)

NTRODUCTION “H OW can a learning algorithm cope with incorrecttraining examples?” This is the question raisedin Dana Angluin’s paper entitled “Learning From NoisyExamples” in 1988 [1]. She made the statement that, “whenthe teacher may make independent random errors in classify-ing the example data, the strategy of selecting the mostconsistent rule for the sample is sufﬁcient, and usuallyrequires a feasibly small number of examples, providednoise affects less than half the examples on average”. Inother words, she claimed that a learning algorithm can copewith incorrect training examples, once the noise rate is lessthan one half under the random noise model. Over the last30 years, her seminal research opened a new door to machinelearning, since standard machine learning assumes that thelabel information is fully clean and intact. More importantly,her research echoed the real-world environment indeed, aslabels or annotations are often noisy and imperfect in realscenarios.For example, the surge of deep learning comes from 2012,because Geoffrey Hinton’s team leveraged AlexNet (i.e., deepneural networks) [2] to win the ImageNet challenge [3] with • B. Han is with the Department of Computer Science, Hong Kong BaptistUniversity, Hong Kong SAR. E-mail: [email protected] • Q. Yao is with 4Paradigm Inc. and Department of Electronic Engineering,Tsinghua University, Beijing China. E-mail: [email protected] • T. Liu is with the School of Computer Science, The University of Sydney,Sydney, Australia. E-mail: [email protected] • G. Niu is with the Imperfect Information Learning Team, RIKEN AIP,Tokyo, Japan. E-mail: [email protected] • I. W. Tsang is with Australian Artiﬁcial Intelligence Institute, Universityof Technology Sydney, Sydney, Australia. E-mail: [email protected] • J. T. Kwok is with the Department of Computer Science and Engineering,Hong Kong University of Science and Technology, Hong Kong SAR.E-mail: [email protected] • M. Sugiyama is with RIKEN AIP and the Department of ComplexityScience and Engineering, The University of Tokyo, Tokyo, Japan.E-mail: [email protected] obvious margin. However, due to the huge quantity of data,the ImageNet-scale dataset was necessarily annotated bycrowdsourced workers in Amazon Mechanical Turk . Dueto the limited knowledge, crowdsourced workers cannotannotate speciﬁc tasks with 100% accuracy, which naturallybrings noisy labels. Another vivid example locates in medicalapplications, where datasets are typically small. However,it requires domain expertise to label medical data, whichoften suffers from high inter- and intra-observer variability,leading to noisy labels. We should notice that, noisy labelswill cause wrong model predictions, which might furtherinﬂuence decisions that impact human health negatively.Lastly, noisy labels are ubiquitous in speech domains, e.g.,Voice-over-Internet-Protocol (VoIP) calls [4]. In particular,due to unstable network conditions, VoIP calls are easilyprone to various speech impairments, which should involvethe user feedback to identify the cause. Such user feedbackcan be viewed as the cause labels, which are highly noisy,since most of users lack the domain expertise to accuratelyarticulate the impairment in the perceived speech.All the above noisy cases stem from our daily life, whichcannot be avoided. Therefore, it is urgent to build up arobust learning algorithm for handling noisy labels withtheoretical guarantees. In this survey paper, we term such arobust learning paradigm label-noise learning , and the noisytraining data ( x, ¯ y ) is sampled from a corrupted distribution p ( X, ¯ Y ) , where we assume that the features are intact butthe labels are corrupted. Note that such assumption canlimit the scope of the study. As far as we know, label-noiselearning spans over two important ages in machine learning:statistical learning (i.e., shallow learning) and representationlearning (i.e., deep learning). In the age of statistical learning,label-noise learning focused on designing noise-tolerant a r X i v : . [ c s . L G ] N ov SURVEY OF LABEL-NOISE REPRESENTATION LEARNING, NOVEMBER 2020 2 losses or unbiased risk estimators [5]. However, in the ageof representation learning, label-noise learning has moreoptions to combat with noisy labels, such as designing biasedrisk estimators or leveraging memorization effects of deepnetworks [6], [7].

Label-noise representation learning has become very impor-tant for both academic and industry. There are two reasonsbehind. First, from the essence of the learning paradigm, deepsupervised learning requires a lot of well-labeled data, whichmay require too much cost, especially for many start-ups.However, deep unsupervised learning (even self-supervisedlearning) is too immature to work very well in complexreal-world scenarios. Therefore, as deep weakly-supervisedlearning, label-noise representation learning naturally hasattracted much attention and has become a hot topic. Second,from the aspect of data, many real-world scenarios lackpurely clean annotations, such as ﬁnancial data, web data,and biomedical data. This has directly motivated researchersto explore label-noise representation learning.As far as we know, there indeed exist three pioneersurveys related to label noise. Frenay and Verleysen [8]focused on discussing label-noise statistical learning, insteadof label-noise representation learning. Although Algan etal. [9] and Karimi et al [10]. focused on deep learning withnoisy labels, both of them only considered on image (or med-ical image) classiﬁcation task. Moreover, their surveys werewritten from the applied perspective, instead of discussingmethodology. To compensate for them and go beyond, wewant to contribute to the label-noise representation learningarea as follows. • From the perspective of machine learning, we give theformal deﬁnition for label-noise representation learning(LNRL). The deﬁnition is not only general enough toinclude all existing LNRL, but also speciﬁc enough toclarify what the goal of LNRL is and how we can solve it. • In contrast to [9], [10], via the lens of learning theory, weprovide a deeper understanding why noisy labels affectthe performance of deep models. Meanwhile, we report thegeneralization of deep models under noisy labels, whichcoincides with our theoretical ﬁndings. • We perform extensive literature review from the age ofrepresentation learning, and categorize them in a uniﬁedtaxonomy in terms of data, objective and optimization. Thepros and cons of different categories are analyzed. We alsopresent a summary of insights for each category. • Based on the above observations, we summarize anddiscuss the essential components of robust label-noiserepresentation learning. These can help to spark newdirections in label-noise representation learning. • Beyond label-noise representation learning, we proposeseveral promising future directions, such as learning withnoisy feature, preference, domain, similarity, graph, anddemonstration. We hope they can provide some insights.

2. An update-to-date list of papers related to label-noise representationlearning is here: https://github.com/bhanML/label-noise-papers.

The position of this survey is explained as follows. Frenayand Verleysen [8] mainly summarized the methods of label-noise statistical learning (LNSL), which cannot be used fordeep learning models directly. Note that although both LNSLand LNRL approaches address the same problem setting,they are fundamentally different. First, the underlying the-ories should be different due to different hypothesis space(e.g., Section 3.5.2); Second, the potential solution should bedifferent due to different models, e.g., Section 6. Meanwhile,LNSL may fail to handle large-scale data with label noise,while LNRL is good at handling such data.Although Algan et al. [9] and Karimi et al. [10] respec-tively summarized some methods of label-noise representa-tion learning, both of them discussed from the perspectiveof applications, i.e., (medical) image analysis. Recently, Songet al. [11] summarized some methods of label-noise repre-sentation learning from the view of methodology. However,their categorization is totally different from us in philosophy.In our survey, we ﬁrst introduce label-noise representationlearning from three general views: input data, objectivefunctions and optimization policies, with more theoreticalunderstanding.

The remainder of this survey is organized as follows. Sec-tion 2 provides the related literature of label-noise learning,and the full version can be found in Appendix. Section 3provides an overview of the survey, including the formaldeﬁnition of LNRL, core issues, and a taxonomy of exist-ing works in terms of data, objectives and optimizations.Section 4 is for methods that leverage the noise transitionmatrix to solve LNRL. Section 5 is for methods that modifythe objective function to make LNRL feasible. Section 6is for methods that leverage the characteristics of deepnetworks to address LNRL. In Section 7, we propose futuredirections for LNRL itself. Meanwhile, beyond LNRL, thesurvey discloses several promising future directions withconclusions in Section 8.

ELATED L ITERATURE

We divide the development of label-noise learning into threestages as follows.

Before delving into label-noise representation learning, wegive a brief overview of some milestone works in label-noisestatistical learning. In 1988, Angluin et al. [1] proved that alearning algorithm can handle incorrect training examplesrobustly, when the noise rate is less than one half underthe random noise model. Lawrence and Schölkopf [12]constructed a kernel Fisher discriminant to formulate thelabel-noise problem as a probabilistic model. Bartlett etal. [13] justiﬁed that most loss functions are not completelyrobust to label noise. This means that classiﬁers based onlabel-noise learning algorithms are still affected by labelnoise.

SURVEY OF LABEL-NOISE REPRESENTATION LEARNING, NOVEMBER 2020 3

During this period, a lot of works emerged and con-tributed to this area. For example, Crammer et al. [14] pro-posed the online Passive-Aggressive perceptron algorithm tocope with label noise. Natarajan et al. [5] formally formulatedan unbiased risk estimator for binary classiﬁcation with noisylabels. This work was very important to the area, since it isthe ﬁrst work to provide guarantees for risk minimizationunder random label noise. Meanwhile, Scott et al. [15] studiedthe classiﬁcation problem under the class-conditional noisemodel, and proposed a way to handle asymmetric labelnoise. In contrast, van Rooyen et al. [16] proposed theunhinge loss to tackle symmetric label noise. Liu and Tao [17]proposed a method of anchor points to estimate the noiserate, and further leveraged importance reweighting to designsurrogate loss functions for class-conditional label noise.In 2015, research of label-noise learning has been shiftedfrom statistical learning to representation learning, since deeplearning models have become mainstream due to its betterempirical performance. Therefore, it is urgent to design label-noise representation learning methods for robustly trainingdeep models with noisy labels.

There are three seminal works in label-noise representa-tion learning with noisy labels from 2015. For example,Sukhbaatar et al. [18] introduced an extra but constrainedlinear “noise” layer on top of the softmax layer, which adaptsthe network outputs to model the noisy label distribution.Reed et al. [19] augmented the prediction objective withthe notion of consistency via soft and hard bootstrapping.Intuitively, this bootstrapping procedure provides the learnerto disagree with an inconsistent training label, and re-labelthe training data to improve its label quality. Azadi et al. [20]proposed an auxiliary image regularization technique, whichexploits the mutual context information among training data,and encourages the model to select reliable labels.Followed by seminal works, Goldberger et al. [21] in-troduced a nonlinear “noise” adaptation layer on top ofthe softmax layer. Patrini et al. [22] proposed the forwardand backward loss correction approaches simultaneously.Both Wang et al. [23] and Ren et al. [24] leveraged the samephilosophy, namely data reweighting, to learn with labelnoise. Jiang et al. [6] is the ﬁrst to leverage small-loss tricksto handle label noise. However, they trained only a singlenetwork iteratively, which inherits the accumulated error. Toalleviate this, Han et al. [7] trained two deep neural networks,and backpropagated the data selected by its peer networkand updates itself.In the context of representation learning, classical meth-ods, such as estimating the noise transition matrix, regulariza-tion and designing losses, are still prosperous for handlinglabel noise. For instance, Hendrycks et al. [25] leveragedtrusted examples to estimate the gold transition matrix,which approximates to the true transition matrix well. Hanet al. [26] proposed a “human-in-the-loop” idea to easilyestimate the transition matrix. Zhang et al. [27] introducedan implicit regularization called mixup, which constructsvirtual training data by linear interpolations of features andlabels in training data. Zhang et al. [28] generalized both thecategorical cross entropy (CCE) loss and mean absoulte error (MAE) loss by the negative Box-Cox transformation. Ma etal. [29] developed a dimensionality-driven learning strategy,which can learn robust low-dimensional subspaces capturingthe true data distribution.

Since 2019, label-noise representation learning has becomemature in the top conference venues. Arazo et al. [30] formu-lated clean and noisy samples as a two-component (clean-noisy) beta mixture model on the loss values. Hendryckset al. [31] empirically demonstrated that pre-training canimprove model robustness against label corruption for large-scale noisy datasets. Under the criterion of balanced error rate(BER) minimization, Charoenphakdee et al. [32] proposedthe Barrier Hinge Loss. In contrast to selected samplesvia small-loss tricks, Thulasidasan et al. [33] introducedthe abstention-based training, which allows deep networksto abstain on confusing samples while learning on non-confusing samples. Following the re-weighting strategy, Shuet al. [34] parameterized the weighting function adaptivelyas a one-layer multilayer perceptron called Meta-Weight-Net.Menon et al. [35] mitigated the effects of label noisefrom an optimization lens, which naturally introduces thepartially Huberised loss. Nguyen et al. [36] proposed a self-ensemble label ﬁltering method to progressively ﬁlter out thewrong labels during training. Li et al. [37] modeled the per-sample loss distribution with a mixture model to dynamicallydivide the training data into a labeled set with clean samplesand an unlabeled set with noisy samples. Lyu et al. [38]proposed a provable curriculum loss, which can adaptivelyselect samples for robust stagewise training. Han et al. [39]proposed a versatile approach called scaled stochastic inte-grated gradient underweighted ascent (SIGUA). SIGUA usesstochastic gradient decent on good data, while using scaledstochastic gradient ascent on bad data rather than droppingthose data. 5 years after the birth of

Clothing1M , Jiang etal. [40] proposed a new but realistic type of noisy datasetcalled “web-label noise” (or red noise ). VERVIEW

In this section, we ﬁrst provide the notation used throughoutthe paper in Section 3.1. A formal deﬁnition of LNRL problemis given in Section 3.2 with concrete examples. As LNRLproblem relates to many machine learning problems, wediscuss their relatedness and difference in Section 3.3. InSection 3.4, we reveal the core issues that make LNRLproblem hard. Then, according to how existing works handlethe core issues, we present a uniﬁed taxonomy in Section 3.5.

Let x be features and y be labels. Consider a supervisedlearning task T , LNRL deals with a data set D = { ¯ D tr , D te } consisting of training set ¯ D tr = { ( x ( i ) , ¯ y ( i ) ) } Ni =1 and test set D te = { x te } , where training set ¯ D tr = { ( x ( i ) , ¯ y ( i ) ) } Ni =1 i.i.d.drawn from a corrupted distribution p ( ¯ Y | X ) ( ¯ Y denoteslabel corruption). Note that ( X, Y ) denotes the variable,while ( x, y ) denotes its selected value. For the corrupteddistribution p ( ¯ Y | X ) , we assume that the features are intactbut the labels are corrupted. Let p ( X, Y ) be the ground-truth SURVEY OF LABEL-NOISE REPRESENTATION LEARNING, NOVEMBER 2020 4 (i.e., non-corrupted) joint probability distribution of input x and output y , and f ∗ be the (Bayes) optimal hypothesisfrom x to y . To approximate f ∗ , the objective determinesa hypothesis space H of hypotheses f θ ( · ) parameterizedby θ . Algorithm contains the optimization policy to searchthrough H in order to ﬁnd the θ ∗ that parameterizes theoptimal f θ ∗ ∈ H for ¯ D tr .Intuitively, LNRL learns to discover f θ ∗ by ﬁtting ¯ D tr robustly, which can assign correct labels for D te . LNRLmethods robustly train deep neural networks with noisylabels, where hypotheses f θ ( · ) can be modeled by non-convex deep neural networks. Since the hypothesis space H is sufﬁciently complex in deep neural networks, f θ ∗ ∈ H canapproximate the Bayes optimal f ∗ well. As LNRL is naturally a sub-area in machine learning, beforegiving the deﬁnition of LNRL, let us recall how machinelearning is deﬁned literately. We borrow Tom Mitchell’sdeﬁnition here, which is shown in Deﬁnition 1.

Deﬁnition 1. (Machine Learning [41], [42]). A computer pro-gram is said to learn from experience E with respect to someclasses of task T and performance measure P if its performancecan improve with E on T measured by P . The above deﬁnition is quite classical, which has beenwidely adopted in machine learning community. It meansthat a machine learning problem is deﬁned by three keycomponents: E , T and P . For instance, consider speechrecognition task ( T , e.g., Apple Siri ), machine learningprograms can improve its recognition accuracy ( P ) viatraining with large-scale speech data set ( E ) ofﬂine. Anotherexample of T is the hot topic in the security area, calledempirical defense [43]. In a high level, machine learningalgorithms can make deep neural networks defend againstmalicious cases. Speciﬁcally, a stop sign crafted by maliciouspeople may cause an accident autonomous vehicles, whichemploy deep neural networks to recognize the sign. However,after adversarially training with adversarial examples ( E ),the robust generalization ( P ) of deep neural networks canimprove a lot, which may avoid the above accident withlarge probability.The above-mentioned classical applications of machinelearning require a lot of “correctly” supervised information { ( x ( i ) , y ( i ) ) } Ni =1 for the given tasks. However, this may bedifﬁcult and even impossible. As far as we know, LNRL is aspecial case of machine learning, which belongs to weaklysupervised learning [44]. Intuitively, LNRL exactly targetsat acquiring good learning performance with “incorrectly”(a.k.a., noisy) supervised information provided by data set ¯ D ,i.i.d. drawn from a corrupted distribution p ( ¯ Y | X ) . The noisysupervised information refers to training data set ¯ D tr , whichconsists of the intact input features x ( i ) but with corruptedlabels ¯ y ( i ) . More important, LNRL focuses on trainingdeep neural networks robustly, which has many specialcharacteristics, such as memorization effects. Formally, wedeﬁne LNRL in Deﬁnition 2.

3. https://en.wikipedia.org/wiki/Siri

Deﬁnition 2. (Label-Noise Representation Learning (LNRL)).LNRL is a special but common case of machine learning problems(speciﬁed by E , T and P ), where E contains noisy supervisedinformation for the target T . Meanwhile, deep neural networkswill be leveraged to model the target T directly. To understand this deﬁnition better, let us show threeclassical scenarios of LNRL (Table 1): • Crowdsourcing : Large-scale image data (e.g., ImageNet [45])is the key factor to drive the second surge of deep learningfrom 2012. Note that it is impossible to annotate suchscale data individually, which motivates us to leveragecrowdsourcing technique (e.g., Amazon Mechanical Turk).However, the quality of crowdsourced data is normally lowwith a certain degree of label noise. Therefore, an importanttask ( T ) is to robustly training deep neural networks withcrowdsourced data ( E ), and the trained deep models canbe evaluated via test accuracy ( P ). • Healthcare : Healthcare is highly related to each individual,whose data requires machine learning technique to analyzedeeply and intelligently. However, intelligent healthcare( T ) requires domain expertise to label medical data ﬁrst,which often suffers from high inter- and intra-observervariability, leading to noisy medical data ( E ). We shouldnotice that, noisy labels will cause the high error rate ( P )of deep model predictions, which might further inﬂuencedecisions that impact human health negatively. • Speech : In speech task (e.g., Apple Siri), machine learningprograms can improve its recognition accuracy via trainingwith large-scale speech data set ofﬂine. However, noisylabels are ubiquitous in speech domains, e.g., the task ofrating for service calls ( T ). Due to the difference of personalmood and understanding, service calls are easily prone tovarious rating ( E ) for the same service. Such rating canbe viewed as labels, which are highly noisy since mostof users lack the domain expertise to accurately rate thespeech service. Therefore, it is critical to robustly train deepneural networks with the user rating ( E ), and evaluate thetrained model via the quality rate of service calls ( P ).As noisy supervised information related to T is directlycontained in E , it is quite natural that common deepsupervised learning approaches will fail on LNRL problems.One of the recent ﬁndings (i.e., memorization effects [46])in deep learning area may explain this: due to the highmodel capacity, deep neural networks will eventually ﬁtand memorize label noise. Therefore, when facing the noisydata E , LNRL methods make the learning of the target T feasible by leveraging the intrinsic characteristics of deepneural networks, e.g., memorization effects and non-linearity. In this section, we discuss the relevant learning problems ofLNRL. The relatedness and difference with respect to LNRLare clearly clariﬁed as follows. • Semi-supervised Learning (SSL) [47] learns the hypothesis f from experience E consisting of both labeled and unlabeleddata, where unlabeled data will be normally annotated bypseudo labels. Since the labeling process may be not fullycorrect and noisy, SSL has some relatedness with LNRL.However, SSL assumes that labeled data are fully clean, SURVEY OF LABEL-NOISE REPRESENTATION LEARNING, NOVEMBER 2020 5

TABLE 1Illustrations of three LNRL examples based on Deﬁnition 2.2.

T E = ( x, y ) P web-scale image classiﬁcation (ImageNet, crowdsourced labels) test accuracyintelligent healthcare (medical data, annotations by variability) error rateVoIP speech analysis (perceived speech, user feedback) quality rate of voice which is different with LNRL, where labeled data are stillnoisy to some degree. To address SSL, there are severaltypical algorithms, such as [27], [48], [48], [49], [50]. • Positive-unlabeled Learning (PUL) [51] learns the hypothesis f from experience E consisting of only positive labeledand unlabeled data. Similar to SSL, unlabeled data willbe normally annotated by pseudo labels. However, PULassumes that labeled data are fully clean and only positive.To address PUL, there are several typical algorithms, suchas [52], [53], [54]. • Complementary Learning (CL) [55] speciﬁes a class that apattern does NOT belong to. Namely, CL learns the hypoth-esis f from experience E consisting of only complementarydata. Since the labeling process cannot fully exclude theuncertainty, namely belonging to which categories, CLhas some relatedness with LNRL. However, CL requiresthat all diagonal entries of the transition matrix are zeros.Sometimes, the transition matrix may be not required tobe invertible in empirical. To address CL, there are severaltypical algorithms, such as [55], [56], [57], [58]. • Unlabeled-unlabeled Learning (UUL) [59] is a recently pro-posed learning paradigm, which allows us to train a binaryclassiﬁer only from two unlabeled datasets with differentclass priors. Different to SSL and PUL, there are two setsof unlabeled data in UUL instead of one set. To addressUUL, there are two typical algorithms, including [59], [60].

When machine learning in an ideal environment, the datashould be with clean supervision. Therefore, (cid:96) -risk under theclean distribution should be as follows. R (cid:96),D ( f θ ∗ ) := E ( X,Y ) ∼ D [ (cid:96) ( f θ ( X ) , Y )] , (1)where ( X, Y ) is the clean example i.i.d. drawn from cleandistribution D , f θ is a learning model (e.g., a deep neuralnetwork) parameterized by θ and (cid:96) is normally cross-entropyloss. In this survey, we consider the classiﬁcation problem.However, when machine learning is in the real-worldenvironment, the data will be with noisy supervision.Namely, (cid:96) -risk under the noisy distribution should be E ( X, ¯ Y ) ∼ ¯ D [ (cid:96) ( f θ ( X ) , ¯ Y )] . Furthermore, under the limited data,empirical ˜ (cid:96) -risk under the noisy distribution should be asfollows. (cid:98) R ˜ (cid:96), ¯ D ( f θ ) := 1 n (cid:88) ni =1 ˜ (cid:96) ( f θ ( X i ) , ¯ Y i ) , (2)where ( X i , ¯ Y i ) is the (observed) noisy example i.i.d. drawnfrom noisy distribution ¯ D (with noise rate ρ ). Note that ˜ (cid:96) is a suitably modiﬁed loss, which is noise-tolerant. Here,we empirically demonstrate the generalization differencebetween (cid:96) and ˜ (cid:96) under label noise (Figure 1).Generally, the aim of LNRL is to “construct” such noise-tolerant ˜ (cid:96) that the learned f θ in (2) approximates to the Fig. 1. We empirically demonstrate the generalization difference betweenoriginal (cid:96) and corrected ˜ (cid:96) . We choose MNIST with 35% of uniform noiseas noisy data. There is an obvious gap between (cid:96) and ˜ (cid:96) on noisy MNIST . optimal f θ ∗ in (1) well. Speciﬁcally, via a suitably constructed ˜ (cid:96) , we can learn a robust deep classiﬁer f θ from the noisytraining examples that can assign clean labels for testinstances. Before delving into constructing ˜ (cid:96) , we ﬁrst take atheoretical look at label-noise learning, which will help usbuild ˜ (cid:96) more effectively. In contrast to [9], [10], via the lens of learning theory, weprovide a systematical way to understand LNRL. Our focusis to explore why noisy labels affect the performance of deepmodels. To ﬁgure it out, we should rethink the essence oflearning with noisy labels. Normally, there are three keyingredients in label-noise learning problems, including inputdata, objective function and optimization policy.In high level, there are three rule of thumbs, which explainhow to handle noisy labels effectively via deep models. • For data , the key is to discover the underlying noise transi-tion pattern, which directly links the clean class posteriorand the noisy class posterior. Based on this insight, itis critical to design unbiased estimator to estimate noisetransition matrix T accurately. • For objective function , the key is to design noise-tolerant ˜ (cid:96) ,which enjoys the statistical consistency guarantees. Basedon this insight, it is critical to learn a robust classiﬁer onnoisy data, which can provably converge to the learnedclassiﬁer on clean data. • For optimization policy , the key is to explore the dynamicprocess of optimization policies, which relates to mem-orization. Based on this insight, it is critical to trade-offoverﬁt/underﬁt in training deep networks, such as earlystopping and small-loss tricks.

SURVEY OF LABEL-NOISE REPRESENTATION LEARNING, NOVEMBER 2020 6

Data

LNRL Adaptation layerLoss correction

Prior knowledgeRegularization

ReweightingRedesigning

Self-training

Co-training

Beyond memorization

ObjectiveOptimization

Linear case

Nonlinear case

Backward/ForwardGold correctionLabel smoothing

Fine-tuning revision

HITL estimationExplicitImplicit

Importance

BayesianNeural NetsMixtureLossLabel ensembleMentorNetLearn to reweightCo-teaching (+)Beyond Co-teachingDeep k-NNPre-training

Noise transition matrix

Memorization effects

Fig. 2. A taxonomy of LNRL based on the focus of each method. For each technique branch, we list few representative works here.

Early works on noisy labels studied random classiﬁcation noise (RCN) for binary classiﬁcation [1], [61]. In the RCN model,each instance has its label ﬂipped with a ﬁxed noise rate ρ ∈ [0 , ) . A natural extension of RCN is class-conditionalnoise (CCN) for multi-class classiﬁcation. In the CCN model,each instance from class i has a ﬁxed probability ρ i,j ofbeing assigned to class j . Thus, it is possible to encodesome similarity information between classes. For example,we can expect that the image of a “dog” is more likely tobe erroneously labelled as “cat” than “boat”. Until 2020,most of LNRL methods share an implicit assumption onnoise model, namely class-conditional noise (CCN) model,which is formulated as p ( ¯ Y | Y ) . Both RCN and CCN studyinstance-independent label noise.Speciﬁcally, from the perspective of input data, the focusis to build up noise transition matrix, which models theprocess of label corruption. In general, there are two types oflabel noise: instance-dependent label noise (e.g., p ( ¯ Y | Y, X ) )[62] and instance-independent label noise (e.g., p ( ¯ Y | Y ) ) [22].For instance-dependent label noise, the noise transition ma-trix can be represented as T ( X ) , which depends on features.However, it can be ill-posed to learn the transition matrix T ( X ) by only exploiting noisy data, i.e., the transition matrixis unidentiable [62], [63]. Therefore, we emphasize instance-independent label noise here, and the noise transition matrixcan be represented as T , which is independent of features.In this case, the noise transition matrix T approximatelymodels the process of label corruption. Given an instance x i is an anchor point of the i -th clean class, if p ( Y = e i | x i ) = 1 ,where Y = e i means Y belongs to the i th class. The transition matrix can be obtained via p ( ¯ Y = e j | x i ) = (cid:88) Ck =1 p ( ¯ Y = e j | Y = e k , x i ) p ( Y = e k | x i ) , = p ( ¯ Y = e j | Y = e i , x i ) p ( Y = e i | x i ) , = p ( ¯ Y = e j | Y = e i , x i ) = T ij , Note that if anchor points are hard to identify, we can use x i = arg max x p ( ¯ Y = i | x ) [17]. This transition matrix isvery important, since it can bridge noisy class posterior andclean class posterior, i.e., p ( Y | x ) = T − p ( ¯ Y | x ) . In practice,this transition matrix has been employed to build risk-consistent estimator via loss correction or classiﬁer-consistentestimator via hypotheses correction. Besides, for inconsistentalgorithms, the diagonal entries of this matrix are used toselect reliable examples for further robust training. From the perspective of objective function, the focus is toderive the statistical consistency guarantees for robust ˜ (cid:96) [5],[35], [64]. Assume the Frobenius norm of the weight matrices W , · · · , W d are at most M , · · · , M d . Let the activationfunctions be 1-Lipschitz, positive-homogeneous, and appliedelement-wise (such as the ReLU). Let x is upper boundedby B , i.e., for any (cid:107) x (cid:107) ≤ B . With probability at least − δ ,if (cid:96) is classiﬁcation-calibrated, there exists a non-decreasingfunction ξ (cid:96) with ξ (cid:96) (0) = 0 such that, R D ( ˆ f ) − R ∗ ≤ ξ (cid:96) (cid:0) min f ∈H R (cid:96),D ( f ) − min R (cid:96),D ( f )+ 4 L ρ R ( H ) + 2 (cid:113) log( / δ ) / n (cid:1) , ≤ ξ (cid:96) (cid:0) min f ∈H R (cid:96),D ( f ) − min R (cid:96),D ( f )+4 L ρ B ( (cid:112) d log 2+1) (cid:89) di =1 M i / √ n +2 (cid:113) log( / δ ) / n (cid:1) , where R D ( ˆ f ) = E ( X,Y ) ∼ D [1 { sign ( ˆ f ( X )) (cid:54) = Y } ] denotes the riskof ˆ f w.r.t. the 0-1 loss; R ∗ = R D ( f ∗ ) denotes Bayes risk SURVEY OF LABEL-NOISE REPRESENTATION LEARNING, NOVEMBER 2020 7 for Bayes optimal classiﬁer f ∗ under the clean distribution D [42]; ˆ f = arg min f ∈H (cid:98) R ˜ (cid:96), ¯ D ( f ) and L ρ is the Lipschitzconstant of ˜ (cid:96) . When the function class H is parameterizedby deep neural networks, due to [65], the Rademachercomplexity R ( H ) of the function class H is upper boundedby B ( √ d log 2 + 1) (cid:81) di =1 M i / √ n .Note that min f ∈H R (cid:96),D ( f ) − min R (cid:96),D ( f ) denotes theapproximation error for employing the hypothesis class H . According to the universal approximation theorem [66],if a certain deep network model is employed, H will bea universal hypothesis class and thus contains the Bayesclassiﬁer. Then, min f ∈H R (cid:96),D ( f ) − min R (cid:96),D ( f ) = 0 . Thismeans that by employing a proper deep network, the upperbound will converge to zero by increasing the training samplesize n . Since R D ( ˆ f ) is always bigger than or equal to R ∗ , wehave that R D ( ˆ f ) will converge to R ∗ . This further meansthat ˆ f learned from noisy data (i.i.d. drawn from ¯ D ) willconverge to Bayes optimal f ∗ deﬁned by the clean data. From the perspective of optimization policy, the focus isto explore dynamic process of optimization policies. Takeearly stopping, which is a simple yet effective trick to avoidoverﬁtting on noisy labels, as an illustrative example [67].Consider a dataset ( x i , y i ) ni =1 . Assume an initial weight ma-trix W ∼ N (0 , entries, and W τ update via stochastic gra-dient descent with step size η , i.e., W τ +1 = W τ − η ∇L ( W τ ) .If ε ≤ δλ ( C ) /K and ρ ≤ δ/ with high probability, thenafter I ∝ (cid:107) C (cid:107) / λ ( C ) steps, there are two conclusions [67].First, the model W I predicts the true label function ˜ y ( x ) for all input x that lie within ε neighborhood of a clustercenter { c k } Kk =1 . Namely, ˜ y ( x ) = arg min l | f ( W I , x ) − α l | ,where { α l } ¯ Kl =1 ∈ [ − , denotes the labels associated witheach class (cf. Deﬁnition 1.1 in [67]). Second, for all trainingsamples, the distance to initialization satisﬁes (cid:107) W τ − W (cid:107) F (cid:46) (cid:0) √ K + τε K / (cid:107) C (cid:107) (cid:1) , (3)where ≤ τ ≤ I . The above conclusions demonstrate thatgradient descent with early stopping (i.e., I steps) can berobust when training of deep neural networks. Moreover,the ﬁnal network weights do not stray far from the initialweights for robustness, since the distance between the initialmodel and ﬁnal model grows with the square root of thenumber of clusters √ K . Intuitively, due to memorizationeffects, deep neural networks will eventually overﬁt noisytraining data. Thus, it is a good strategy to stop training early,when deep neural networks ﬁt clean training data in ﬁrstfew epochs. This denotes the robust weights are not far awayfrom the initial weights. Based on the above theoretical understanding, we categorizethese works into three general perspectives:1)

Data [68]: From the perspective of data, we can construct ˜ (cid:96) by leveraging the noise transition matrix, which exploresthe data relationship between clean and noisy labels.For example, we ﬁrst model and estimate the noisetransition matrix between latent Y and observed ¯ Y .Then, via the estimated matrix, different techniques can generate ˜ (cid:96) from the original (cid:96) . The key step here isto estimate the noise transition matrix. Mathematically, (cid:98) R ˜ (cid:96), ¯ D ( f θ ) := n (cid:80) ni =1 ˜ (cid:96) ( f θ ( X i ) , ¯ Y i ) , where (cid:96) T −→ ˜ (cid:96) , and ˜ (cid:96) isa corrected loss transitioning from (cid:96) via T .2) Objective : From the perspective of objective, we canconstruct ˜ (cid:96) by augmenting the objective function (i.e.,the original (cid:96) ) via either explicit or implicit regularization.For instance, we may augment (cid:96) by auxiliary exampleregularizer explicitly. Meanwhile, we may augment (cid:96) by designing implicit regularization algorithms, such assoft-/hard-bootstrapping and virtual adversarial training(VAT). Meanwhile, we can construct ˜ (cid:96) by reweightingthe objective function (cid:96) . Lastly, motivated by some phe-nomenon or criteria, we can also construct and design ˜ (cid:96) directly. Thus, ˜ (cid:96) has three options: • ˜ (cid:96) = (cid:96) + r , where r denotes a regularization; • ˜ (cid:96) = (cid:80) i w i (cid:96) i , where (cid:96) i denotes i th sub-objective withthe coefﬁcient w i ; • ˜ (cid:96) has a special format (cid:96) (cid:48) independent of (cid:96) .3) Optimization [46], [69]: From the perspective of optimiza-tion, we can construct ˜ (cid:96) by leveraging the memoriza-tion effects of deep models. For example, due to thememorization effects, deep models tend to ﬁt the easy(clean) pattern ﬁrst, and then over-ﬁt the complex (noisy)pattern gradually. Based on this observation, we can backpropagate the small-loss examples, which is equal toconstructing the restricted ˜ (cid:96) where ˜ (cid:96) = sort ( (cid:96), − τ ) ,namely, sorting (cid:96) from small, and fetching − τ percentageof small loss ( τ is noise rate).Accordingly, existing works can be categorized into auniﬁed taxonomy as shown in Figure 2. We will detail eachcategory in the sequel. ATA

Methods in this section solve LNRL problem by estimatingthe noise transition matrix, which builds up the relationshipbetween latent clean labels and observed noisy labels. Thestructure of this section is arranged as follows. First, weexplain what is noise transition matrix and why this matrixis important. Then, we introduce three common ways toleverage noise transition matrix for combating label noise.The ﬁrst common way is to leverage an adaptation layer inthe end-to-end deep learning system to mimic the noisetransition matrix, which bridges latent clean labels andobserved noisy labels. The second common way is to estimatethe noise transition matrix empirically, and further correctthe cross-entropy loss by the estimated matrix. Lastly, thethird common way is to leverage prior knowledge to easethe estimation burden.

Before introducing three common ways, we ﬁrst deﬁne whatis the noise transition matrix, and explain why the noisetransition matrix is important.

Deﬁnition 3. (Noise transition matrix [68]) Suppose that theobserved label ¯ y is noisy i.i.d. drawn from a corrupted distribution p ( ¯ Y | X ) , where features are intact. Meanwhile, there exists acorruption process, transitioning from the latent clean label y SURVEY OF LABEL-NOISE REPRESENTATION LEARNING, NOVEMBER 2020 8 to the observed noisy label ¯ y . Such corruption process can beapproximately modeled via a label transition matrix T , where T ij = p (¯ y = e j | y = e i ) . We term this label transition matrix asnoise transition matrix. To further understand the transition matrix, we presenttwo representative structures of T : (1) Sym-ﬂipping [16]; (2)Pair-ﬂipping [26]. The deﬁnition of the label transition matrix T is as follow, where τ is the noise rate and n is the numberof the classes.Sym-ﬂipping: T =  − τ τn − . . . τn − τn − − τ τn − ... . . . ... τn − τn − . . . − τ  ; Pair-ﬂipping: T =  − τ τ − τ τ ... . . . ... ττ . . . − τ  . Speciﬁcally, Sym-ﬂipping structure models the mostcommon classiﬁcation task under label noise, where theclass of clean label can uniformly ﬂip into other classes.Meanwhile, Pair-ﬂipping structure models the ﬁne-grainedclassiﬁcation task, where the class (e.g., Norwich terrier)of clean label can ﬂip into its adjunct class (e.g., Norfolkterrier) instead of far-away class (e.g., Australian terrier).In the area of label-noise learning, we normally leveragethe above two structures to generate simulated noise, andexplore the root cause why the proposed algorithms canwork on the simulated noise. Nonetheless, the real-worldscenarios are very complex, where the noise transition matrixmay not have structural rules (i.e., irregular). For example,

Clothing1M is a taobao clothing dataset, where mislabeledclothing images often share similar visual patterns. The noisestructure of

Clothing1M is irregularly asymmetric, which ishardly estimated.In mathematical modeling, the ij -th entry of the transitionmatrix, i.e., [ T ( x )] ij = p ( ¯ Y = e j | Y = e i , X = x ) , representsthe probability that the instance x with the clean label Y = e i will have a noisy label ¯ Y = e j . The transition matrix hasbeen widely studied to build statistically consistent classiﬁers,because the clean class posterior p ( Y | x ) = [ p ( Y = e | X = x ) , . . . , p ( Y = e c | X = x )] (cid:62) can be inferred by using thetransition matrix and the noisy class posterior p ( ¯ Y | x ) =[ p ( ¯ Y = e | X = x ) , . . . , p ( ¯ Y = e c | X = x )] (cid:62) , i.e., we havethe important equation p ( ¯ Y | x ) = T ( x ) p ( Y | x ) , where thenoise transition matrix is a bridge between clean and noisyinformation. As the noisy class posterior can be estimated byexploiting the noisy training data, the key step remains howto effectively estimate the transition matrix and leverage theestimated matrix to combat with label noise. Based on thisobservation, we have three general ways as in Section 4.2-4.4. Deep learning can be viewed as an end-to-end learningsystem. Therefore, the most intuitive way is to add anadaptation layer (Fig. 3) to estimate the transition matrix.

Noisy label ෤𝑦 Input features 𝑥 Base network 𝜔 with softmax layer Adaptation layer 𝜃 Cross entropy loss ℓ(𝜔, 𝜃)

Fig. 3. A general case of adaptation layer.

To realize this adaptation layer, Sukhbaatar et al. proposeda constrained linear layer inserted between base networkand cross-entropy loss layer [18]. This linear adaptation layeris parameterized by T , which is equal to the function ofnoise transition matrix. Based on this idea, we can modifya classiﬁcation model using a probability matrix T thatmodiﬁes its prediction to match the label distribution ofthe noisy data.The training model consists of two independent parts: thebase model parameterized by ω and the noise model param-eterized by T . Since the noise matrix T has been modeledas a constrained linear layer, the update of T matrix can beeasily ﬁnished by back propagating the cross-entropy loss.However, it is bare to achieve the optimal T via minimizingcross-entropy loss, which is jointly parameterized by ω and T . To achieve the optimal T , Sukhbaatar et al. [18] leverage aregularizer on the T , e.g., a trace norm or a ridge regression,which forces it to diffuse. This work paves the way fordeep learning with noisy labels, which directly motivates thefollowing nonlinear case of adaptation layer. Following the linear case, Goldberger et al. [21] proposed anon-linear layer inserted between base network and cross-entropy loss layer to realize this adaptation layer. Beyond thelinear case, the training model consists of two independentparts: the base model parameterized by ω and the noisemodel/channel parameterized by θ (equal to the functionof noise transition matrix). Since the outputs of the basemodel are hidden, they proposed to leverage EM algorithmto estimate the hidden outputs (E-step) and the currentparameters (M-step). Different with linear case, nonlinearcase is free of strong assumptions.However, there are several potential drawbacks to theEM-based approach, such as local optima and scalability.To address these issues, Goldberger et al. proposed twonoise modeling variants: c-model and s-model. Speciﬁcally,c-model predicts the noisy label based on both the latenttrue label and the input features; while s-model predictsthe noisy label only based on the latent true label. Sincethe EM algorithm equals to the s-model, they regard both ω and θ as components of the same network and optimizethem simultaneously. Moreover, s-model is similar to thelinear case proposed by Sukhbaatar et al [18], although they SURVEY OF LABEL-NOISE REPRESENTATION LEARNING, NOVEMBER 2020 9 proposed a different learning procedure. Note that the c-model depends on the input features, and they still leveragenetwork training in the M-step to update θ . Another important branch is to conduct loss correction vialeveraging noise transition matrix, which can be also inte-grated into deep learning system. The aim of loss correctionis that, training on noisy labels via the corrected loss shouldbe approximately equal to training on clean labels via theoriginal loss. Note that we also introduce a label smoothingmethod related to loss correction.

Patrini et al. [22] introduced two loss correction techniques,namely forward correction and backward correction. In highlevel, backward procedure corrects the cross-entropy loss bytransition matrix T − ; while forward procedure corrects thenetwork predictions by transition matrix T . Both correctionsshare the formal theoretical guarantees w.r.t the clean datadistribution. Theorem 1. (Backward Correction, Theorem 1 in [22]) Supposethat the label transition matrix T is non-singular, where T ij = p (¯ y = e j | y = e i ) given that corrupted label ¯ y = e j is ﬂippedfrom clean label y = e i . Given loss (cid:96) and network parameter w f ,Backward Correction is deﬁned as (cid:96) ← ( x , y ; w f ) = T − (cid:96) ( x , y ; w f ) . (4) Then, corrected loss (cid:96) ← ( x , y ; w f ) is unbiased, namely, E ¯ y | x (cid:96) ← ( x , y ; w f ) = E y | x (cid:96) ( x , y ; w f ) , ∀ x . (5) Remark.

Backward Correction operates on the loss vector directly,where the loss is a vector. It is unbiased. LHS of (5) drawsfrom corrupted labels, and RHS of (5) draws from clean labels.Note that the corrected loss is differentiable, but not always non-negative [68].

Theorem 2. (Forward Correction, Theorem 1 in [22]) Supposethat the label transition matrix T is non-singular, where T ij = p (¯ y = e j | y = e i ) given that corrupted label ¯ y = e j is ﬂippedfrom clean label y = e i . Given loss (cid:96) and network parameter w f ,Forward Correction is deﬁned as (cid:96) → ( x , y ; w f ) = (cid:96) ( T (cid:62) w f ( x ) , y ) . (6) Then, the minimizer of the corrected loss under the noisy distri-bution is the same as the minimizer of the orginal loss under theclean distribution, namely, arg min E x , ¯ y (cid:96) → ( x , y ; w f ) = arg min E x ,y (cid:96) ( x , y ; w f ) . (7) Remark.

Forward Correction also operates on the loss vectordirectly, where the loss is a vector. LHS of (7) draws from corruptedlabels, and RHS of (7) draws from clean labels. Note that theproperty is weaker than unbiasedness of Theorem 1.

Normally, T is unknown, which needs to be estimated(i.e., (cid:98) T ). Therefore, Patrini et al. proposed a robust two-stagetraining. First, they train the network with (cid:96) on noisy data,and obtain an estimation of T via the output of the softmax.Then, they re-train the network with the corrected loss by (cid:98) T . Note that the estimation quality of T directly decides thelearning performance via the loss correction. Based on Forward Correction, Hendrycks et al. proposedGold Correction to handle severe noise [25]. When severenoise, the transition matrix can not be estimated accuratelyby purely noisy data. The key motivation is to assume that asmall subset of the training data is trusted. Normally, a largeamount of crowdsourced workers may produce an untrustedset (cid:101) D ; while a small amount of experts can produce a trustedset D . In high level, Hendrycks et al. aim to leverage D toestimate the noise transition matrix accurately, and employForward Correction based on the estimated matrix. Then,they train deep neural networks on (cid:101) D via the corrected loss,while training on D via the original loss. This method iscalled as Gold Loss Correction (GLC).Therefore, the key step in GLC is to estimate the noisetransition matrix accurately. Mathematically, we can estimatethe transition matrix C by (cid:98) C as follows. (cid:98) C ij = 1 A i (cid:88) x ∈ A i (cid:98) p ( ¯ Y = e j | Y = e i , x ) , (8)where (cid:98) p (¯ y | x ) can be modeled by deep neural networkstrained on (cid:101) D . Empirically, the better estimation (cid:98) C will leadto the better GLC’s performance. The technique of label smoothing is to smooth labels bymixing in a uniform label vector [70], which is a meansof regularization. Lukasik et al. relates label smoothing toa general family of loss-correction techniques [71], whichdemonstrates that label smoothing signiﬁcantly improvesperformance under label noise. In general, both methods canbe uniformed into a label smearing framework: (cid:96) SM ( f θ ( X ) , Y ) = M · (cid:96) ( f θ ( X ) , Y ) , (9)where M is smearing matrix [72]. Such matrix is used forbridging original loss (cid:96) and smeared loss (cid:96) SM . Therefore, inthis framework, there are three cases: • Standard training: suppose M = I , where I is identitymatrix. • Label smoothing: suppose M = (1 − α ) I + αJ / L , where J is the all-ones matrix. • Backward correction: suppose M = / − α · ( I − αJ / L ) ,where M = T − in Theorem 1.We can clearly see the closed connection between labelsmoothing and backward correction. Actually, label smooth-ing can have a similar effect to shrinkage regularization [73]. As mentioned before, methods in this section solve LNRLproblem by estimating the noise transition matrix. How-ever, the accurate estimation can be hard in the real-worldscenarios, which motivates researchers to incorporate priorknowledge for better estimation.

Han et al. proposed a human-assisted approach called“Masking” [26], which decouples the structure and valueof transition matrix. Speciﬁcally, the structure can be viewedas a prior knowledge, coming from human cognition, since

SURVEY OF LABEL-NOISE REPRESENTATION LEARNING, NOVEMBER 2020 10 human can mask invalid class transitions (i.e., cat (cid:61) car).Given the structure information, we can only focus onestimating the noise transition probability along the structurein an end-to-end system. Therefore, the estimation burdenwill be largely reduced. Actually, it makes sense that humancognition masks invalid class transitions and highlights validclass transitions, such as column-diagonal, tri-diagonal andblock-diagonal. Therefore, the rest issue is how to incorporatesuch prior structure into an end-to-end learning system? Theanswer is generative model.In particular, there are three steps in this generativemodel. First, the latent ground-truth label y ∼ p ( y | x ) ,where p ( y | x ) is a Categorical distribution. Second, the noisetransition matrix variable s ∼ p ( s ) (in Bayes form) andits structure s o ∼ p ( s o ) , where p ( s ) is an implicit distribu-tion modeled by neural networks without the exact form, p ( s o ) = p ( s ) dsds o (cid:12)(cid:12) s o = f ( s ) , and f ( · ) is the mapping functionfrom s to s o . Third, the noisy label ¯ y ∼ p (¯ y | y, s ) , where p (¯ y | y, s ) models the transition from y to ¯ y given s . Based onthe above generative process, we can deduce the evidencelower bound (ELBO) [74] to approximate the log-likelihoodof the noisy data. Xia et al. [75] introduced a transition-revision method toeffectively learn transition matrix, which is called ReweightT-Revision (Reweight-R). Speciﬁcally, they ﬁrst initializeit by exploiting data points that are similar to anchorpoints, having high noisy class posterior probabilities. Thus,the initialized transition matrix can be viewed as a priorknowledge. Then, they ﬁne-tune the initialized matrix byadding a slack variable, which can be learned and validatedtogether with the classiﬁer by using noisy data.Speciﬁcally, given noisy training sample ¯ D tr and noisyvalidation set ¯ D v , there are two stages in Reweight-R. Inthe ﬁrst stage, they minimize the unweighted loss to learn ˆ p ( ¯ Y | X = x ) without a noise adaption layer. Then, theyinitialize the noise transition matrix ˆ T , which can be viewedas a prior knowledge for further ﬁne-tuning. Namely, theyinitialize ˆ T according to Eq. (1) in [75] by using data with thehighest ˆ p ( ¯ Y = e i | X = x ) as anchor points for the i -th class.In the second stage, based on the prior ˆ T , they initialize theneural networks by minimizing the weighted loss with anoisy adaptation layer ˆ T (cid:62) . Furthermore, they minimize theweighted loss to learn classiﬁer f and incremental ∆ T with anoisy adaptation layer ( ˆ T + ∆ T ) (cid:62) . Namely, the second stagemodiﬁes ˆ T gradually by adding a slack variable ∆ T , andlearns the classiﬁer and ∆ T by minimizing the weighted loss.The two stages alternate until converges, namely minimumerror on ¯ D v . Based on the above observations, we can know that it is atypical method to leverage the noise transition matrix forsolving LNRL problem. First, we can insert an adaptationlayer into the original networks, and this layer can mimic thefunction of noise transition matrix. Second, we may keep theoriginal networks, but correct the cross-entropy loss via theestimated transition matrix. Lastly, since the accurate matrix estimation will lead to the better classiﬁcation accuracy, wecan use the prior knowledge to ease the estimation burden.Note that there are other related works from the dataperspective. For example, structured noise modeling demon-strate that the noise in human-centric annotations exhibitsstructure, which can be modeled by decoupling the humanbias from the correct visually grounded label [76]; Noisyﬁne-grained recognition shows the potential to train effectivemodels of ﬁne-grained recognition using noisy data fromthe web only [77]; Distillation with side information buildsa uniﬁed distillation framework to use “side” informa-tion, including a small clean dataset and label relationsin knowledge graph, to combat noisy labels [78]; Rankpruning addresses the fundamental problem of estimatingthe noise rates [79]; Negative learning trains deep networksusing complementary labels, which decrease the risk ofproviding incorrect information [80]; Combinatorial inferencereduces noise level by simply constructing meta-classes andimproves accuracy via combinatorial inferences over multi-ple constituent classiﬁer [81]; Robust GANs incorporatesa noise transition model, which can learn a clean labelconditional generative distribution even when training labelsare noisy [82]; Noise-tolerant fairness can learn fair classiﬁersgiven noisy sensitive features using the mean-differencescore [83]; and latent class-conditional noise models the noisetransition in a Bayesian form, namely projecting the noisetransition in a Dirichlet-distributed space [84].

BJECTIVE

Methods in this section solve LNRL problem by modifyingthe objective function, and such modiﬁcation can be realizedin three different ways. The structure of this section isarranged as follows. First, we can directly augment theobjective function by either explicit regularization or implicitregularization. Note that the implicit regularization tends tooperate in algorithm level, which is equal to modifying theobjective function. Second, we can assign dynamic weights todifferent objective sub-functions, and the more weights meanthe more importance for the corresponding sub-functions.Lastly, motivated by some interesting observations and tricks,we can directly redesign new loss functions.

Regularization is the most direct way to modify the objectivefunction. Mathematically, we need to add regularization termin original objective, i.e., ˜ (cid:96) = (cid:96) + r . In label-noise learning,the aim of regularization is to achieve better generalization,which avoids or alleviates overﬁtting noisy labels. There aretwo intuitive ways as follows. Azadi et al. propose a novel regularizer r ≡ Ω aux ( w ) toexploit the data structure for combating label noise [20],where Ω aux ( w ) = (cid:107) F w (cid:107) g . Note that (cid:107)·(cid:107) g denotes the groupnorm, which induces a group sparsity that encourages themost coefﬁcients to be zero. This operation will encouragea small number of clean data to contribute to learning ofthe model, while ﬁltering mislabeled and non-relevant data.In other words, this regularizer enforces the features of the SURVEY OF LABEL-NOISE REPRESENTATION LEARNING, NOVEMBER 2020 11 good data to be used for modeling learning, while noisyadditional activations will be disregarded.It is worth noted that Berthelot et al. introduces Mix-Match [50] for semi-supervised learning (SSL), and theempirical performance of MixMatch reaches the state-of-the-art. Most importantly, one of the key components inMixMatch is Minimum Entropy Regularization (MER), whichbelongs to explicit regularization in LNRL. Speiciﬁcally,MER was proposed by Grandvalet & Bengio [85], andthe key idea is to augment cross-entropy loss with anexplicit term encouraging the classiﬁer to make predictionswith high conﬁdence on the unlabeled examples, namelyminimizing the entropy of p model ( y | x ; θ ) for unlabeled data x . Similar to MER, pseudo-label method conducts entropyminimization implicitly [86], which generates hard labelsfrom high-conﬁdence predictions on unlabeled data forfurther training. In particular, pseudo-label method (i.e., labelguessing) ﬁrst computes the average of the model’s predictedclass distributions across all augmentations, and then appliesa temperature sharpening function to reduce the entropy oflabel distribution. Recently, there are more and more implicit regularization,which takes the effects of regularization without the explicitform like Section 5.1.1. For example, • Bootstrapping [19]: Reed et al. augment the predictionobjective with a notion of consistency [19]. In high level,this provides the learner justiﬁcation to “disagree” with aperceptually-inconsistent training label, and effectively re-label the data. Namely, the learner bootstraps itself in thisway, which uses a convex combination of training labelsand the current model’s predictions to generate the trainingtargets. Intuitively, as the learner improves over time, itspredictions can be trusted more. Such bootstrapping canavoid modeling the noise distribution.Speciﬁcally, Reed et al. propose two ways to realizebootstrapping, such as soft and hard bootstrapping. Softversion uses predicted class probabilities q directly togenerate regression targets for each batch as follows: (cid:96) soft ( q, t ) = (cid:88) Lk =1 [ βt k + (1 − β ) q k ] log( q k ) , (10)where t denotes the training labels. The soft versionequals to softmax regression with minimum entropyregularization, which encourages the model to have ahigh conﬁdence in predicting labels.Meanwhile, hard version modiﬁes targets using MAPestimate of q given x as follows: (cid:96) hard ( q, t ) = (cid:88) Lk =1 [ βt k + (1 − β ) z k ] log( q k ) , (11)where z k := [ k = arg max q i ] and i ∈ [1 , · · · , L ] . Tosolve the hard version via SGD, EM-like method will beemployed. In the E-step, the approximate-truth conﬁdencetargets are estimated as a convex combination of traininglabels and model predictions. In the M-step, the modelparameters are updated in order to predict those generatedtargets better. • Mixup [27]: Motivated by Vicinal Risk Minimization(VRM), Zhang et al. introduces a data-agnostic regular- ization method called Mixup [27], which constructs virtualtraining examples (˜ x, ˜ y ) as follows. ˜ x = λx i + (1 − λ ) x j and ˜ y = λy i + (1 − λ ) y j , (12)where ( x i , y i ) and ( x j , y j ) are two examples randomlydrawn from the training data. Intuitively, Mixup conductsvirtual data augmentation, which dilutes the noise effectsand smooths the data manifold. This simple but effectiveidea can be used in not only noisy labels but also adversar-ial examples, since the smoothness happen in both featuresand labels according to (12). • VAT [49]: Miyato et al. also explore the smoothness forcombating label noise [49], and they propose virtualadversarial loss, where a new measure of local smooth-ness of the conditional label distribution given input.Speciﬁcally, their method trains the output distributionto be isotropically smooth around each input data viaselectively smoothing the model in its most anisotropicdirection. To realize such smoothness, they ﬁrst designvirtual adversarial direction, which can greatly deviate thecurrent inferred output distribution from the status quowithout the label information. Based on such direction, theydeﬁne local distributional smoothness, and then propose atraining method called virtual adversarial training (VAT).In mathematically, they ﬁrst deﬁne local distributionalsmoothness (LDS):LDS ( x ∗ , θ ) := D [ p ( y | x ∗ , ˆ θ ) , p ( y | x ∗ + r vadv , θ )] , (13)where r vadv := arg max (cid:107) r (cid:107) ≤ (cid:15) D [ p ( y | x ∗ , ˆ θ ) , p ( y | x ∗ + r )] , and x ∗ represent either labeled or unlabeled features. Weuse ˆ θ to denote the vector of model parameters at a speciﬁciteration step of the training process. Then, we have aregularization term: R vadv ( D l , D ul , θ ) := 1 N l + N ul (cid:88) x ∗ ∈ D l ,D ul LDS ( x ∗ , θ ) . Therefore, the full objective function of virtual adversarialtraining (VAT) is given by (cid:96) ( D l , θ ) + αR vadv ( D l , D ul , θ ) . • SIGUA [39]: It is noted that, given data with noisylabels, over-parameterized deep networks can graduallymemorize the data, and ﬁt everything in the end [46],[69], [87]. Although equipped with corrections for noisylabels, many learning methods in this area still sufferoverﬁtting due to undesired memorization. To relieve thisissue, Han et al. propose stochastic integrated gradientunderweighted ascent (SIGUA) [39]: in a mini-batch,we adopt gradient descent on good data as usual, andlearning-rate-reduced gradient ascent on bad data; theproposal is a versatile approach where data goodness orbadness is w.r.t. desired or undesired memorization givena base learning method. Technically, SIGUA is a speciallydesigned regularization by pulling optimization back forgeneralization when their goals conﬂict with each other. Akey difference between SIGUA and parameter shrinkagelike weight decay is that SIGUA pulls optimization back onsome data but parameter shrinkage does the same on all

SURVEY OF LABEL-NOISE REPRESENTATION LEARNING, NOVEMBER 2020 12 data; philosophically, SIGUA shows forgetting undesiredmemorization can reinforce desired one, which provides anovel viewpoint on the inductive bias of neural networks.

Instead of augmenting the objective via explicit/implicitregularization, there is another typical solution to modifythe objective function termed Reweighting. In high level,Reweighting is a way to assign different weights to differentsub-objective functions. The weights should be learned, andthe larger weights will bias to sub-objectives that can betterovercome label noise. The procedure can be formulatedas ˜ (cid:96) = (cid:80) i w i (cid:96) i . Here, we introduce several Reweightingapproaches as follows. Liu and Tao [17] introduce importance reweighting [88] fromdomain adaptation to label-noise learning by treating thenoisy training data as the source domain and the clean testdata as the target domain. The idea is to rewrite the risk w.r.t.the clean data by exploiting the noisy data. Speciﬁcally, R ( f ) = E ( X,Y ) ∼ D [ (cid:96) ( f ( X ) , Y )]= (cid:90) x (cid:88) i p D ( X = x, Y = i ) (cid:96) ( f ( x ) , i ) dx = (cid:90) x (cid:88) i p ¯ D ( X = x, ¯ Y = i ) p D ( X = x, ¯ Y = i ) p ¯ D ( X = x, ¯ Y = i ) (cid:96) ( f ( x ) , i ) dx = (cid:90) x (cid:88) i p ¯ D ( X = x, ¯ Y = i ) p D ( ¯ Y = i | X = x ) p ¯ D ( ¯ Y = i | X = x ) (cid:96) ( f ( x ) , i ) dx = E ( X, ¯ Y ) ∼ ¯ D [ β ( X, ¯ Y ) (cid:96) ( f ( X ) , ¯ Y )] , where the second last equation holds because label noiseis assumed to be independent of instances and β ( X, ¯ Y ) = p D ( ¯ Y = i | X = x ) /p ¯ D ( ¯ Y = i | X = x ) denotes the weightswhich plays a core part in importance reweighting and canbe learned by either exploiting the transition matrix or asmall set of clean data. Wang et al. [23] propose reweighted probabilistic models(RPM) to combat label noise. The idea is simple and intuitive:down-weighting on corrupted labels but up-weighting onclean labels, which brings us Bayesian data reweighting. Themathematical formulations include three steps:1) Deﬁne a probabilistic model p β ( β ) (cid:81) Nn =1 (cid:96) ( y n | β ) .2) Assign a positive latent weight w n to each likelihood,and choose a prior on the weights p w ( w ) , where w =( w , · · · , w N ) . Thus, the RPM can be represented by p ( y, β, w ) = 1 Z p β ( β ) p w ( w ) (cid:89) Nn =1 (cid:96) ( y n | β ) w n , where Z is the normalizing factor.3) Infer the posterior of both the latent variables β andthe weight w and p ( β, w | y ) . The prior knowledge onthe weights p w ( w ) trades off extremely low likelihoodterms, where the options of p w ( w ) are bank of Beta priors,scaled Dirichlet prior and bank of Gamma priors. Notethat, RPMs treat weights w as latent variables, which areautomatically inferred. Arazo et al. introduce a two-component (clean-noisy) betamixture model (BMM) for a mixture of clean and noisydata [30], which brings us bootstrapping loss. Speciﬁcally,the posterior probabilities under BMM are leveraged toimplement a dynamically-weighted bootstrapping loss, ro-bustly dealing with noisy samples without discarding them.Mathematically, the probability density function of a mixturemodel of K components on the loss (cid:96) is deﬁned as: p ( (cid:96) ) = (cid:88) Kk =1 λ k p ( (cid:96) | k ) , (14)where λ k are dynamic weights, and p ( (cid:96) | k ) can be modeledby beta distribution: p ( (cid:96) | α, β ) = Γ( α + β )Γ( α )Γ( β ) (cid:96) α − (1 − (cid:96) ) β − , (15)The above BMM can be solved by Expectation Maximization(EM) procedure. Speciﬁcally, they introduce latent variables γ k ( (cid:96) ) = p ( k | (cid:96) ) . In the E-step, they ﬁx λ k , α k , β k andupdate γ k ( (cid:96) ) via Bayes rule. In the M-step, given ﬁxed γ k ( (cid:96) ) , they estimate α k and β k using the weighted moments.Meanwhile, the dynamic weights are updated in an easy way: λ k = N (cid:80) Ni =1 γ k ( (cid:96) i ) . Based on this BMM model, they furtherpropose dynamic hard/soft bootstrapping losses, where theweight w i of each sample is dynamically set to p ( k = 1 | (cid:96) i ) . Shu et al. introduce Meta-Weight-Net (MW-Net) [34], whichcan adaptively learn an explicit weighting function fromdata. In high level, the weighting function is an MLP withone hidden layer, mapping from loss to weights. Note thatMW-Net is theoretically a universal approximator for anycontinuous function. Mathematically, the optimal parameter w can be calculated by minimizing the weighted loss. w ∗ ( θ ) = arg min w (cid:96) tr ( w ; θ ) = 1 N (cid:88) Ni =1 V ( (cid:96) tr i ( w ); θ ) (cid:96) tr i ( w ) , where V ( (cid:96) tr i ( w ); θ ) denotes MW-Net, namely a MLP networkwith one hidden layer. Here, the parameters in MW-Netcan be optimized by the meta learning idea. Given a smallamount of meta-data set { x ( meta ) i , y ( meta ) i } Mi =1 , the optimalparameter θ can be obtained by minimizing the meta-loss: θ ∗ = arg min θ (cid:96) meta ( w ∗ ( θ )) = 1 M (cid:88) Mi =1 (cid:96) meta i ( w ∗ ( θ )) . They employ SGD to update between w (parameters ofclassiﬁer network) and θ (parameters of MW-Net) iteratively. Besides augmenting and reweighting the objective function,there is another common solution called redesigning. Inhigh level, redesigning, which generally replaces ˜ (cid:96) with aspecial format (cid:96) (cid:48) independent of (cid:96) , is motivated by differentobservations and principles. Thus, these method diversemuch for different scenarios. Here, we introduce severalredesigning approaches as follows. SURVEY OF LABEL-NOISE REPRESENTATION LEARNING, NOVEMBER 2020 13

In recent years, there are a lot of new losses for combatinglabel noise. Their designs are based on different principles,such as gradient clipping [35] and curriculum learning [89].Here, we choose several representative losses to explain. • Zhang et al. propose a generalized cross entropy losscalled (cid:96) q [28], which encompasses both mean absoluteerror (MAE) and categorical cross entropy (CCE) loss. (cid:96) q loss can be well justiﬁed by theoretical analysis. Moreimportantly, it empirically works well for both closed-setand open-set noise. Mathematically, they use the negativeBox-Cox transformation as a (cid:96) q loss function: (cid:96) q ( f ( x ) , e j ) = (1 − f j ( x ) q ) /q, where q ∈ (0 , . The proposed loss equals to CCE when lim q → (cid:96) q ( f ( x ) , e j ) , and becomes MAE when q = 1 . Sincea tigher bound can bring stronger noise tolerance, theypropose the truncated (cid:96) q loss: (cid:96) trunc ( f ( x ) , e j ) = (cid:40) (cid:96) q ( k ) if f j ( x ) ≤ k(cid:96) q ( f ( x ) , e j ) otherwise , (16)where < k < , and (cid:96) q ( k ) = − k q / q . When k → , thetruncated (cid:96) q loss equals to the original (cid:96) q loss. • Charoenphakdee et al. [32] theoretically justify the advan-tages of symmetric losses via the lens of theoretical tools,including classiﬁcation-calibration condition, excess riskbound, conditional risk minimizer and AUC-consistency.The key idea is to design a loss that does not have to satisfythe symmetric condition everywhere, i.e., (cid:96) ( z ) + (cid:96) ( − z ) is aconstant for every z ∈ R . Motivated by this phenomenon,they introduce a barrier hinge loss, which satisﬁes asymmetric condition not everywhere but gives a largepenalty once z is outside of the interval. Mathematically, abarrier hinge loss is deﬁned as follows. (cid:96) ( z ) = max( − b ( r + z ) + r, max( b ( z − r ) , r − z )) , (17)where b > and r > . • Thulasidasan et al. propose to abstain some confusingexamples during training deep networks [33]. Philosophi-cally speaking, abstention has some relation with small-losstricks, which implicitly abstain big-loss examples duringtraining. In practice, abstention has some relation withloss redesign. Based on abstention, they introduce a deepabstaining classiﬁer (DAC). DAC has an additional output p k +1 , which indicates the probability of abstention. Theloss of DAC is as follows. (cid:96) ( x j ) = − ˜ p k +1 (cid:88) ki =1 t i log( p i / ˜ p k +1 ) − α log ˜ p k +1 , (18)where ˜ p k +1 = 1 − p k +1 . If α is large, the penalty drives p k +1 to zero, which denotes the model not abstaining.If α is small, the classiﬁer may abstain on everything.They further propose an auto-tuning algorithm to tune theoptimal α . Note that DAC can be used for both structuredand unstructured label noise, where DAC becomes a datacleaner. • Aditya et al. [35] leverage gradient clipping to designa new loss. Intuitively, clipping the gradient preventsover-conﬁdent descent steps in the scenario of label noise. Motivated by gradient clipping, they propose partiallyHuberized loss ˜ (cid:96) θ ( x, y ) = (cid:40) − τ p θ ( x, y )+log τ +1 if p θ ( x, y ) ≤ τ − log p θ ( x, y ) otherwise . . (19) • Lyu et al. [38] propose curriculum loss (CL), which is atighter upper bound of the 0-1 loss. Moreover, CL canadaptively select samples for stagewise training. In partic-ular, giving any base loss function (cid:96) ( u ) ≥ ( u < , u ∈ R ,CL can be deﬁned as follows. Q ( u ) = max (cid:0) min v ∈{ , } n f ( v ) , min v ∈{ , } n f ( v ) (cid:1) , where f ( v ) = n (cid:88) i =1 v i (cid:96) ( u i ) and f ( v ) = n − n (cid:88) i =1 v i + n (cid:88) i =1 ( u i < . To adapt CL for deep learning models, they furtherintroduce the noise pruned CL. • Laine & Aila [90] introduce self-ensembling, including π -model and temporal ensembling. Speciﬁcally, π -modelencourages consistent network output between two realiza-tions of the same input, under two different dropout con-ditions. Beyond π -model, temporal ensembling considerthe network predictions over multiple previous trainingepochs. The loss function of π -model is: (cid:96) = − B (cid:88) i log z i [ y i ] + w ( t ) C | B | (cid:88) i (cid:107) z i − ˜ z i (cid:107) , (20)where the ﬁrst term handles labeled data, and the secondterm handles unlabeled data. Both z i and ˜ z i are trans-formed from the same input x i . The second term is alsoscaled by time-dependent weighting function w ( t ) .Temporal ensembling goes beyond π -model by aggregatingthe predictions of multiple previous network evaluationsinto an ensemble prediction. Namely, the main differenceto π -model is that the network and augmentations areevaluated only once per input per epoch, and the target ˜ z is based on prior network evaluations. After everytraining epoch, the network outputs z are accumulatedinto ensemble outputs Z by updating Z ← αZ + (1 − α ) z and ˜ z ← Z / (1 − α t ) . • Nguyen et al. [36] propose a self-ensemble label ﬁltering(SELF) method to progressively ﬁlter out the wronglabels during training. In high level, they leverage theknowledge provided in the network’s output over differenttraining iterations to form a consensus of predictions,which progressively identify and ﬁlter out the noisy labelsfrom the labeled data. In ﬁltering strategy, the model candetermine the set of potentially correct labels L i based onagreement between the label y and its maximal likelihoodprediction ˆ y | x with L i = { ( y, x ) | ˆ y x = y ; ∀ ( y, x ) ∈ L } ,where L is the label set in the beginning. In self-ensemblestrategy, they maintain two-level ensemble. First, theyleverage model ensemble with Mean Teacher, namelyan exponential running average of model snapshots.Second, they employ prediction ensemble by collectingthe sample predictions over multiple training epochs: SURVEY OF LABEL-NOISE REPRESENTATION LEARNING, NOVEMBER 2020 14 ¯ z j = α ¯ z j − + (1 − α )ˆ z j , where ¯ z j depicts the moving-average prediction of sample k at epoch j and ˆ z j is themodel prediction for sample k in epoch j . • Ma et al. [29] investigate the dimensionality of the deeprepresentation subspace of training samples. Then, theydevelop a dimensionality-driven learning strategy, whichmonitors the dimensionality of subspaces during trainingand adapts the loss function accordingly.In high level, they leverage local intrinsic dimensionality(LID) to discriminate clean labels and noisy labels duringtraining deep networks. Then, they ﬁnd two-stage oflearning in this scenario: an early stage of dimensionalitycompression and dimensionality expansion, which corre-sponds to memorization effects implicitly. Based on suchtwo stages, they propose Dimensionality-Driven Learning(D2L), which avoids the dimensionality expansion stageof learning by adapting the loss function. Mathematically,the estimation of LID can be deﬁned: (cid:100)

LID = − (cid:18) / k (cid:88) ki =1 log r i ( x ) / r max ( x ) (cid:19) − , (21)where r i ( x ) denotes the distance between x and its i -thnearest neighbor, and r max ( x ) denotes the maximum ofthe neighbor distance. They can update LID score via batchsampling for computation efﬁciency. Based on LID score,they can oversee the dynamics of deep networks.Speciﬁcally, when learning with clean labels, LID score isconsistently decreasing and the test accuracy is increasingwith the increase of training epochs. However, when learn-ing with noisy labels, LID score ﬁrst decreases and thenincreases after few epochs. In contrast, the test accuracy istotally opposite, which ﬁrst increases and then decreases.This corresponds to memorization effects of deep networks(Section 5.1). Based on observation in LID, they proposedimensionality-driven learning strategy. Intuitively, thelarger LID denotes dimensionality expansion, vice versa.To avoid dimensionality expansion, they propose theadaptive LID-corrected labels: y ∗ = α i y + (1 − α i )ˆ y, (22)where α i = exp (cid:0) − λ (cid:100) LID i / min i − j =0 (cid:100) LID j (cid:1) is a LID-based factor.Based on y ∗ , they can robustly train deep networks. Based on the above observations, we can know that modify-ing the objective function is another mainstream method tosolve LNRL problem. First, we can augment the objective viaeither explicit regularizer, e.g., Minimum Entropy Regular-ization [85], or implicit regularizer, e.g., Virtual AdversarialTraining [49]. Second, instead of treating all sub-objectivefunctions equally, we can leverage reweighting strategy toassign different weights to sub-objective functions. The moreweights we assign, the more important these sub-objectivefunctions are. We can realize reweighting strategy via dif-ferent ways, e.g., importance reweighting, Bayesian method,mixture model and neural networks. Lastly, we can modifythe objective function via redesigning the loss function, e.g., L q , barrier hinge loss, partial Huberized loss and curriculumloss. Moreover, we can conduct label ensemble, e.g., temporalensembling and self-ensemble ﬁltering. Note that there are other related works from the objectiveperspective. For instance, online crowdsourcing greatlyreduces the amount of redundant annotations, when crowd-sourcing annotations such as bounding boxes, parts, andclass labels [91]; undirected graphical model representsthe relationship between noisy and clean labels, where theinference over latent clean labels is tractable and regularizedusing auxiliary information [92]; active-bias method trainsmore robust deep networks by emphasizing high variancesamples [93]; model bootstrapped EM jointly models labelsand worker quality from noisy crowdsourced data [94]; jointoptimization framework corrects labels during training byalternating update of network parameters and labels [95];iterative learning framework trains deep networks withopen-set noisy labels [96]; deep bilevel learning is basedon the principles of cross-validation, where a validation setis used to limit the model over-ﬁtting [97]; symmetric crossentropy (CE) boosts CE symmetrically with a noise robustcounterpart Reverse Cross Entropy (RCE) [98]; ubiquitousreweighting network learns a robust model from large-scale noisy web data, by considering ﬁve key challengesin image classiﬁcation [99]; information-theoretic loss is ageneralized version of mutual information, which is provablyrobust to instance-independent label noise [100]; peer lossenables learning from noisy labels without requiring a priorispeciﬁcation of the noise rates [101]; and normalized losstheoretically demonstrates that a simple normalization canmake any loss function robust to noisy labels [102]. PTIMIZATION P OLICY

Methods in this section solve LNRL problem by changingoptimization policy, such as early stopping. The effectivenessof early stopping is due to memorization effects of deepneural networks, which avoid overﬁtting noisy labels to somedegree. To combat with noisy labels using memorizationeffects, there exist the other better way, namely small-losstricks. The structure of this section is arranged as follows.First, we explain what are memorization effects and whythis phenomenon is important. Then, we introduce severalcommon ways to leverage such memorization effects forcombating label noise. The ﬁrst common way is to self-trainsingle network robustly via small-loss tricks, which bringsus MentorNet and Learning to Reweight. Furthermore, thesecond common way is to co-train double networks robustlyvia small-loss tricks, which brings us Co-teaching and Co-teaching+. Lastly, there are several ways to further improvethe performance of Co-teaching, such as cross-validation,automated learning and Gaussian mixture model.

Arpit et al. introduce a very critical work called “A closer lookat memorization in deep networks” [46], which shapes a newdirection towards solving LNRL. In general, memorizationeffects can be deﬁned as the behavior exhibited by DeepNetworks trained on noise data. Speciﬁcally, deep networkstend to memorize and ﬁt easy (clean) patterns, and graduallyover-ﬁt hard (noisy) patterns. Here, we empirically reproducea simulated experiment to justify this hypothesis.From Fig. 4, we leverage MNIST dataset, and addrandom noise on its labels. The noise rate comes from the

SURVEY OF LABEL-NOISE REPRESENTATION LEARNING, NOVEMBER 2020 15

Fig. 4. A simulated experiment based on different noise rates ( - ).We choose MNIST with uniform noise as noisy data. The solid linedenotes train accuracy; while the dotted line means validation accuracy. range between and . Empirically, we train our deepnetworks on corrupted training data. Then, we test thetrained networks on both training data and validation data,respectively. We can clearly see two phenomenon: • The training curve will ﬁnally reach or approximate 100%accuracy eventually. All curves will converge the same. • The validation curve will ﬁrst reach a near 100% accuracyin ﬁrst few epochs, but drop gradually until convergence(after 40 epochs).Under the data corruption, since deep networks tend tomemorize and ﬁt easy (clean) patterns in corrupted data,the validation curve will ﬁrst reach a peak. However, suchoverparameterized model will gradually over-ﬁt hard (noisy)patterns. The validation curve will drop gradually, since thevalidation data is clean. This simple experiment not onlyjustiﬁes the hypothesis of memorization effects, but alsoopen a new door to LNRL problem, namely small-loss tricks.Speciﬁcally, small-loss tricks mean deep networks regardsmall-loss examples as “clean” examples, and only backpropagate such examples to update the model parameters.Mathematically, small-loss tricks equal to constructing therestricted ˜ (cid:96) , where ˜ (cid:96) = sort ( (cid:96), − τ ) . Namely, ˜ (cid:96) can beconstructed by sorting (cid:96) from small to large, and fetching − τ percentage of small loss ( τ is noise rate). Based on small-loss tricks, the seminal works leverage self-training to improve the model robustness (left panel of Fig. 5).There are two works as follows.

Jiang et al. introduce MentorNet [6], which supervises thebase deep networks termed StudentNet. The key focusof MentorNet is to provides a curriculum for StudentNet.Instead of pre-deﬁning, MentorNet learns a data-drivencurriculum dynamically.Mathematically, MentorNet g m can approximate a prede-ﬁned curriculum via minimizing the equation as follows. arg min θ (cid:88) ( x i ,y i ) ∈D g m ( z i ; θ ) (cid:96) i + G ( g m ( z i ; θ ); λ , λ ) . Fig. 5. Self-training (i.e., MentorNet, abbreviated as M-Net) vs. Co-training(i.e., Co-teaching and Co-teaching+).

To address the above objective, we can get the closed-formsolution as follows. g m ( z i ; θ ) = (cid:26) ( (cid:96) i ≤ λ ) if λ = 0min(max(0 , − (cid:96) i − λ / λ ) , if λ (cid:54) = 0 , Intuitively, when λ = 0 , MentorNet only provides small-loss samples, where (cid:96) i < λ . When λ (cid:54) = 0 , MentorNetwill not provide big-loss samples, namely, samples of losslarger than ( λ + λ ) will not be selected during training.Meanwhile, MentorNet can also discover new curriculumsfrom data directly, which is unrelated to small-loss tricks. Ren et al. [24] employ meta-learning framework to assign dif-ferent weights to training examples based on their gradientdirections. In general, the small-loss examples are assignedto more weights, since small-loss examples are more likely tobe clean. In general, Ren et al. believe that the best exampleweighting should minimize the loss of a set of unbiasedclean validation examples. Namely, they perform validationat every training iteration to dynamically determine theexample weights of the current batch. Mathematically, theyhope to learn a reweighting of the inputs via minimizing aweighted loss: θ ∗ ( w ) = arg min θ (cid:88) Ni =1 w i f i ( θ ) , (23)where w is based on its validation performance: w ∗ = arg min w M (cid:88) Mi =1 f vi ( θ ∗ ( w )) . (24)To realize “Learning to Reweight”, there are three techniquesteps. First, they forward and backward noisy training exam-ples via training loss, which updates the model parameter θ and calculates ∇ θ . Second, the incremental ∇ θ affectsthe validation networks, where they forward and backwardclean validation examples via validation loss. Lastly, trainingnetworks leverage meta-learning to update example weights w via backward on backward. Note that the same strategycan be also used for class imbalance problems, where big-loss tricks are preferred, since they are more likely to be theminority class. Although self-training works well, in the long term, the errorwill be deﬁnitely accumulated, which motivates us to exploreCo-training based methods (right panel of Fig. 5).

SURVEY OF LABEL-NOISE REPRESENTATION LEARNING, NOVEMBER 2020 16

Han et al. propose a new deep learning paradigm called“Co-teaching” [7]. In general, instead of training single deepnetwork, they train two deep networks simultaneously, andlet them teach each other given every mini-batch. Speciﬁcally,each network feeds forward all data and selects some dataof possibly clean labels; then, two networks communicatewith each other what data in this mini-batch should be usedfor training; lastly, each network back propagates the dataselected by its peer network and updates itself. The selectioncriteria is still small-loss trick.In MentorNet [6], the error from one network will bedirectly transferred back to itself in the second mini-batchof data, and the error should be increasingly accumulated.However, in Co-teaching, since two networks have differentlearning abilities, they can ﬁlter different types of error intro-duced by noisy labels. Namely, in this exchange procedure,the error ﬂows can be reduced by peer networks mutually.However, with the increase of training epochs, two networkswill converge to a consensus and Co-teaching will reduceto the self-training MentorNet in function. Note that theprinciple of ensemble learning is to keep different classiﬁersdiverged.Yu et al. [103] introduce “update by disagreement” [104]strategy to keep Co-teaching diverged and named theirmethod Co-teaching+. In general, Co-teaching+ consistsof disagreement-update step and cross-update step. Indisagreement-update step, two networks feed forward andpredict all data ﬁrst, and only keep prediction disagreementdata. This step indeed keeps two networks diverged. Thecross-update step has been explored in Co-teaching. Notethat both Co-teaching and Co-teaching+ share the samedropping rate for big-loss examples, which is hand designed.Via both methods, we summarize three key factors in thisline research: (1) using the small-loss trick; (2) cross-updatingparameters of two networks; and (3) keeping two networksdiverged.

After 2018, there are several important works to furtherimprove Co-teaching. Here, we specify two representativeworks based on Co-teaching and go beyond. • Chen et al. use cross-validation to randomly split noisydatasets, which identiﬁes most samples that have correctlabels [105]. In general, Chen et al. design Iterative NoisyCross-Validation (INCV) method to select a subset ofsamples, which has much smaller noise ratio than theoriginal dataset. Then, they leverage Co-teaching forfurther training over a selected subset. Apart from selectingclean samples, the INCV removes samples that have largeloss at each iteration. • Yao et al. [87] use automated machine learning (AutoML)[106], [107] to explore the memorization effect thus im-prove Co-teaching. It is noted that both Co-teaching andCo-teaching+ share the same dropping rate for big-lossexamples, which is hand designed. However, such rateis critical in training deep networks. Speciﬁcally, Yao etal. [87] design a domain-speciﬁc search space based onsuch rate and propose a novel Newton algorithm to solve the bi-level optimization problem efﬁciently. To explore theoptimal rate R ( · ) , they formulate the problem as R ∗ = arg min R ( · ) ∈F L val ( f ( w ∗ ; R ) , D val ) , s.t. ,w ∗ = arg min w L tr ( f ( w ∗ ; R ) , D tr ) , where F is the search space of R ( · ) exploring the generalpattern of memorization effect. Such a prior on F not onlyallows efﬁcient bi-level optimization but also boost theﬁnal learning performance. • Motivated by MixMatch [50], Li et al. [37] promote anovel framework termed DivideMix by leveraging semi-supervised learning techniques. In high level, DivideMixmodels the per-sample loss distribution with a mixturemodel, which dynamically divide the training data intotwo parts. The ﬁrst part includes labeled data with cleanlabels; while the second part includes unlabeled data withnoisy labels. During the semi-supervised learning phase,they leverage variants of co-training, such as co-reﬁnementon labeled data and co-guessing on unlabeled data.Speciﬁcally, Li et al. [37] use Gaussian Mixture Model(GMM) to better distinguish clean and noisy samples,due to its ﬂexibility in the sharpness of distribution. Theyﬁt a two-component GMM to (cid:96) using the EM algorithm.For each sample, its clean probability w i is the posteriorprobability p ( g | (cid:96) i ) . Based on GMM, they propose co-divide,where the GMM for one network is used to divide trainingdata for its peer network. The dividing criteria is to set athreshold τ on w i . Namely, samples larger than τ will beregarded as clean samples. Once the data is divided intolabeled and unlabeled data, they conduct co-reﬁnementfor labeled data, which linearly combines the ground-truth label with the network’s prediction and sharpenson the reﬁned label. Then they conduct co-guessing forunlabeled data, which averages the predictions from bothnetworks. After co-reﬁnement and co-guessing, they followthe routine MixMatch to mix the data, and update themodel parameters. This section focuses on solving LNRL by leveraging overpa-rameterized deep models, especially from their memorizationeffects. However, besides memorization, there are two newbranches based on deep models as follows.

In many CV and NLP applications, pre-training paradigmhas become commonplace, especially when data is scarcein target domain. Hendrycks et al. [31] demonstrate thatpre-training can improve model robustness and uncertainty,including adversarial robustness and label corruption.Normally, pre-training occurs on a bigger dataset ﬁrst,and ﬁne-tuning the pre-trained model on a smaller targetdataset. For example, if we design a LNRL method for imageclassiﬁcation with label noise, we can pre-train a model onImageNet via LNRL method ﬁrst. Then, we can ﬁne-tunea pre-trained model on target dataset via LNRL method.Note that pre-training approach has been demonstrated inmany robustness and uncertainty tasks, including label noise,adversarial examples, class imbalance, out-of-distributiondetection and calibration.

SURVEY OF LABEL-NOISE REPRESENTATION LEARNING, NOVEMBER 2020 17

Bahri et al. propose Deep k-NN method [108], which executeson an intermediate layer of a preliminary deep model to ﬁltermislabeled training data. In high level, deep k-NN ﬁlteringconsists of two steps. In the ﬁrst step, they train a model M with architecture A (e.g., 2-layer DNN with 20 hiddennodes) to ﬁlter noisy data D noisy via k-NN algorithm, whichidentiﬁes and removes examples whose labels disagree withtheir neighbors. After ﬁltering D noisy , in the second step, theyre-train a ﬁnal model with architecture A on D ﬁltered ∪ D clean . Based on the above observations, we can know that leverag-ing memorization effects is an emerging mainstream methodto solve LNRL problem. First, we can combine self-trainingwith memorization effects, which brings us self-paced Men-torNet and learning to reweight. Second, we can combineco-training with memorization effects, which introduces Co-teaching, Co-teaching+, INCV Co-teaching and S2E. Lastly,we can combine co-training with the SOTA semi-supervisedlearning MixMatch, which provides DivideMix. Meanwhile,besides memorization, pre-training and deep k-NN are newbranches using overparamterized models.Note that there are other related works from the op-timization policy perspective, namely changing trainingdynamic. For example, multi-task network jointly learns toclean noisy annotations and accurately classify images [109];the uniﬁed framework of random grouping and attentioneffectively reduces the negative impact of noisy web imageannotations [110]; decoupling trains two deep networkssimultaneously, and only updates parameters on examples,where there is a disagreement between the two classi-ﬁers [104]; CleanNet is designed to make label noise detectionand learning from noisy data with human supervisionscalable through transfer learning [111]; CurriculumNetdesigns a training curriculum by measuring and ranking thecomplexity of data using its distribution density in a featurespace [112]; co-mining combines co-teaching with Arcfaceloss [113] for face recognition task [114]; O2U-Net onlyrequires adjusting the hyper-parameters of deep networks tomake its status transfer from overﬁtting to underﬁtting (O2U)cyclically [115]; deep self-learning proposes an iterativelearning framework to relabel noisy samples and train deepnetworks on the real noisy dataset, without using extra cleansupervision [116]; label-noise information strategy proposestraining methods that control memorization by regularizinglabel noise information in weights [117]; different from Co-teaching+, Co-regularization aims to reduce the diversityof two networks during training [118]; and data coefﬁcientmethod wisely takes advantage of a small trusted datasetto optimize exemplar weights and labels of mislabeled data,which distills effective supervision for robust training [119].

UTURE W ORKS

Starting from 1988, there are three decades in label-noiselearning, evolving from statistical learning to representationlearning. Especially from 2012, representation learning be-comes increasingly important, which directly gives birth toabove LNRL methods. Similar to other areas in machine learning, we hope to propose not only new methods, butalso new research directions, which can broaden and boostthe LNRL research in both academia and industry.

In LNRL, the ﬁrst thing we should do is to constructnew datasets with real noise, which is critical to the rapiddevelopment of LNRL. To our best knowledge, most ofresearchers test their LNRL methods on simulated datasets,such as

MNIST and

CIFAR-10 . To make further breakthrough,we should build up new datasets with real noise, such as

Clothing1M [120].Note that, similar to

ImageNet , many researchers traindeep networks to overﬁt

Clothing1M via different tricks,which may not touch the core issue of LNRL. This motivatesus to rethink real datasets in LNRL. After Clothing1M bornin 5 years, Jiang et al. [40] propose a new but realistictype of noisy dataset called “web-label noise” (or rednoise), which enables us to conduct controlled experimentssystematically in realistic scenario. Another interesting pointis that benchmark datasets with real noise mainly focus onimage classiﬁcation, instead of natural language and speechprocessing. Obviously, these directions also involve labelnoise, which need to be addressed further.To sum up, we should build up new benchmark datasetswith real noise , not only in image but also in language andspeech. Normally, the better datasets can boost the rapiddevelopment of LNRL.

Previously in Section 3.5.1, we have seen that CCN is apopular assumption in LNRL. However, CCN model isonly an approximation to the real-world noise, which maynot work well in practice. To directly model the real-worldnoise, we should consider the features in the label corruptionprocess. This motivates us to explore instance-dependentnoise (IDN) model, which is formulated as p ( ¯ Y | X, Y ) .Speciﬁcally, the IDN model considers a more generalnoise, where the probability that an instance is mislabeleddepends on both its class and features. Intuitively, this noiseis quite realistic, as poor-quality or ambiguous instances aremore likely to be mislabeled in real-world datasets. However,it is much more complex to formulate the IDN model, sincethe probability of a mislabeled instance is a function ofnot only the label space but also the input space that canbe very high dimensional. Moreover, without some extraassumption/information, IDN is unidentiﬁable.Towards IDN model, there are three seminal explorations.For instance, Menon et al. propose the boundary-consistentnoise model [62], which considers stronger noise for samplescloser to the decision boundary of the Bayesian optimalclassiﬁer. However, such a model is restricted to binary andcannot estimate noise functions. Cheng et al. recently studieda particular case of the IDN model [63], where noise functionsare upper-bounded. Nonetheless, their method is limitedto binary classiﬁcation and has only been tested on smalldatasets. Berthon et al. propose to tackle the IDN model fromthe source, by considering conﬁdence scores to be availablefor the label of each instance. They term this new settingconﬁdence-scored IDN (CSIDN) [121]. Based on CSIDN, theyderive an instance-level forward correction algorithm. SURVEY OF LABEL-NOISE REPRESENTATION LEARNING, NOVEMBER 2020 18

When discussing robustness, most of researchers will thinkadversarial robustness [43] ﬁrst. However, adversarial ro-bustness can be obtained via adversarial training, wherethe features are adversarially perturbed while the labels areintact (i.e., clean). Is it the optimal way to acquire adversarialrobustness? In other words, shall we consider the scenario,where the features are adversarially perturbed while thelabels are noisy? We term this adversarial LNRL.Towards adversarial LNRL, there are two seminal works.For example, Wang et al. propose a new defense algo-rithm called misclassiﬁcation aware adversarial training(MART) [122], which explicitly differentiates the misclas-siﬁed examples (i.e., label noise) and correctly classiﬁedexamples during the training. To address this issue, MARTintroduces misclassiﬁcation aware regularization, namely /n (cid:80) ni =1 ( h θ ( x i ) (cid:54) = h θ (ˆ x (cid:48) i )) · ( h θ ( x i ) (cid:54) = y i ) . Intuitively, ( h θ ( x i ) (cid:54) = y i ) denotes the misclassiﬁed examples, whichcan be closely connected with label noise. Meanwhile, Zhanget al. consider the same issue, namely misclassiﬁed examplesin adversarial training. Speciﬁcally, they propose friendlyadversarial training (FAT) [123], which trains deep networksusing the wrongly-predicted adversarial data minimizing theloss and the correctly-predicted adversarial data maximizingthe loss. To realize FAT, they introduce early-stopped PGD.Namely, once adversarial data is misclassiﬁed by the currentmodel, they just stop the PGD iterations early. In high level,the objective of FAT is min-min instead of min-max instandard adversarial training. We have envisioned three promising direction above, whichbelong to the vertical domain in LNRL. However, we hopeto explore the horizontal domain more, namely noisy datainstead of only noisy labels. Here, we summarize differentformats of noisy data, and show some preliminary works. • Feature : Naturally, label noise can arouse us to considerfeature noise, where adversarial example is one of specialcases in feature noise. The problem of feature noise isformulated as p ( ¯ X | Y ) , where features are corrupted butlabels are intact. Therefore, adversarial training can be themain tool to defend adversarial examples. Note that, thereexists another feature noise called random perturbation.To address this issue, Zhang et al. propose a robustResNet [124], which is motivated by dynamic systems.Speciﬁcally, they characterize ResNet based on an explicitEuler method. This allows us to exploit the step factor h in the Euler method to control the robustness of ResNet.They have proved that a small step factor h can beneﬁt itstraining and generalization robustness during back andforward propagation. Namely, controlling h robustiﬁesdeep networks, which can alleviate feature noise. • Preference : Han et al. [125] and Pan et al. [126] try toaddress preference noise in ranking problem, which playsan important role in our daily life, such as ordinal peergrading, online image-rating and online product recom-mendation. Speciﬁcally, Han et al. propose ROPAL model,which integrates the Plackett-Luce model with a denoisingvector. Based on the Kendall-tau distance, this vectorcorrects k -ary noisy preferences with a certain probability. However, ROPAL cannot handle dynamic length of k -arynoisy preferences, which motivates Pan et al. to proposeCOUPLE, which leverages stagewise learning to break thelimit of ﬁxed length. To update the parameters of bothmodels, they use online Bayesian inference. • Domain : Domain adaptation (DA) is one of the fundamentalproblems in machine learning, when the data volume intarget domain is scarce. Previous DA methods assume thatlabeled data in source domain is purely clean. However,in practice, labeled data in source domain come fromamateur annotators or the Internet due to its large vol-ume. This issue bring us a new setting, where labels insource domain are noisy. We call this setting as wildlydomain adaptation (WDA) . There are two seminal works.Speciﬁcally, to handle WDA, Liu et al. [127] propose aButterﬂy framework, which maintains four deep networkssimultaneously. Butterﬂy can obtain high-quality domain-invariant representations (DIR) and target-speciﬁc repre-sentations (TSR) in an iterative manner. Meanwhile, Yu etal. [128] propose a novel Denoising Conditional InvariantComponent (DCIC) framework, which provably ensuresextracting invariant representations and estimating thelabel distribution in target domain with no bias. • Similarity : Similarity-based learning is one of emergingproblems. Compared to class labels, similarity labels areeasier to obtain, especially for some sensitive issues, e.g.,religion and politics. For example, for sensitive matters,people often hesitate to directly answer “What is youropinion on issue A?”; while they are more likely to answer“With whom do you share the same opinion on issue A?”.Therefore, similarity labels are easier to obtain. However,for some cases, people may not be willing to providetheir real thoughts even facing easy questions. Therefore,noisy-similarity-labeled data are very challenging. Wu etal. employ a noise transition matrix to model similaritynoise [129], which has been integrated into a deep learningsystem. • Graph : Graph neural networks (GNNs) are very hot inmachine learning community [130]. However, are GNNsrobust to noise? For example, once the node or edgeis corrupted, the performance of GNNs will deterioratedeﬁnitely. Since GNNs is highly related to discrete andcombinatorial optimization, LNRL methods can not bedirectly deployed. Therefore, it is very meaningful torobustify GNNs under the node or edge noise, wherenoise can occur in label and features. Recently, Wang et al.propose a robust and unsupervised embedding frameworkcalled Cross-Graph [131], which can handle structuralcorruption in attributed graphs. • Demonstration : The goal of imitation learning (IL) is to learna good policy from high-quality demonstrations. However,the quality of demonstrations in reality should be diverse,since it is easier and cheaper to collect demonstrations froma mix of experts and amateurs. This brings us a new settingin IL called diverse-quality demonstrations, where low-quality demonstrations are highly noisy. To handle diverse-quality demonstrations, when experts provide additionalinformation about the quality, learning becomes relativelyeasy, since the quality can be estimated by their conﬁdencescores [132], ranking scores [133] and a small number ofhigh-quality demonstrations [134]. However, without the

SURVEY OF LABEL-NOISE REPRESENTATION LEARNING, NOVEMBER 2020 19 availability of experts, these methods may not work well.Recently, Voot et al. push forward this line, and proposeto model the quality with a probabilistic graphic modeltermed VILD [135]. Speciﬁcally, they estimate the qualityalong with a reward function that represents an intention ofexperts’ decision making. Moreover, they use a variationalapproach to handle large state-action spaces, and employimportance sampling to improve the data efﬁciency.

In future, we should have four research directions. Amongthem, the ﬁrst three directions mainly focus on LNRL itself.Building up new datasets will boost the rapid developmentof LNRL; while instance-dependent INRL and adversarialLNRL will push the knowledge boundary of INRL deeply.Lastly, beyond noisy labels, there are noisy data, includingfeature, preference, domain, similarity, graph and demonstra-tion. Our research scope extends from a point to a surface.

ONCLUSIONS

In this survey, we thoroughly review the history of label-noise representation learning (LNRL), and formally deﬁnewhat is LNRL from the view of machine learning. Via the lensof representation learning theory and empirical experiments,we try to understand the mechanism of deep networks underlabel noise. Based on the above analysis, we categorizedifferent LNRL methods into three perspectives, namelydata, objective and model. Speciﬁcally, under this uniﬁedtaxonomy, we provide thorough discussion of pros and consacross different categories. Moreover, we summarize theessential components of robust LNRL, which can enlightennew directions in LNRL. Lastly, we propose four possibleresearch directions. The ﬁrst three directions mainly focuson pushing the knowledge boundary of LNRL, includingbuilding up new datasets, instance-dependent LNRL andadversarial LNRL. The last direction is beyond LNRL, whichlearns from noisy data, such as preference-noise, domain-noise, similarity-noise, graph-noise and demonstration-noise.Ultimately, we hope to uncover the secret of data-noiserepresentation learning, and formulate a general frameworkin the near future. R EFERENCES [1] D. Angluin and P. Laird, “Learning from noisy examples,”

MachineLearning , vol. 2, no. 4, pp. 343–370, 1988.[2] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classi-ﬁcation with deep convolutional neural networks,” in

NeurIPS ,2012, pp. 1097–1105.[3] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei,“Imagenet: A large-scale hierarchical image database,” in

CVPR ,2009, pp. 248–255.[4] C. K. Reddy, R. Cutler, and J. Gehrke, “Supervised classiﬁers foraudio impairments with noisy labels,” in

INTERSPEECH , 2019.[5] N. Natarajan, I. S. Dhillon, P. K. Ravikumar, and A. Tewari,“Learning with noisy labels,” in

NeurIPS , 2013.[6] L. Jiang, Z. Zhou, T. Leung, L.-J. Li, and L. Fei-Fei, “Mentornet:Learning data-driven curriculum for very deep neural networkson corrupted labels,” in

ICML , 2018, pp. 2304–2313.[7] B. Han, Q. Yao, X. Yu, G. Niu, M. Xu, W. Hu, I. Tsang, andM. Sugiyama, “Co-teaching: Robust training of deep neuralnetworks with extremely noisy labels,” in

NeurIPS , 2018, pp.8527–8537. [8] B. Frénay and M. Verleysen, “Classiﬁcation in the presence oflabel noise: a survey,”

IEEE Transactions on Neural Networks andLearning Systems , vol. 25, no. 5, pp. 845–869, 2013.[9] G. Algan and I. Ulusoy, “Image classiﬁcation with deep learn-ing in the presence of noisy labels: A survey,” arXiv preprintarXiv:1912.05170 , 2019.[10] D. Karimi, H. Dou, S. K. Warﬁeld, and A. Gholipour, “Deeplearning with noisy labels: exploring techniques and remedies inmedical image analysis,”

Medical Image Analysis , 2020.[11] H. Song, M. Kim, D. Park, and J.-G. Lee, “Learning from noisylabels with deep neural networks: A survey,” arXiv preprintarXiv:2007.08199 , 2020.[12] N. Lawrence and B. Schölkopf, “Estimating a kernel ﬁsherdiscriminant in the presence of label noise,” in

ICML , 2001, pp.306–306.[13] P. L. Bartlett, M. I. Jordan, and J. D. McAuliffe, “Convexity,classiﬁcation, and risk bounds,”

Journal of the American StatisticalAssociation , vol. 101, no. 473, pp. 138–156, 2006.[14] K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer,“Online passive-aggressive algorithms,”

Journal of Machine LearningResearch , vol. 7, no. Mar, pp. 551–585, 2006.[15] C. Scott, G. Blanchard, and G. Handy, “Classiﬁcation withasymmetric label noise: Consistency and maximal denoising,”in

COLT , 2013, pp. 489–511.[16] B. van Rooyen, A. Menon, and R. Williamson, “Learning withsymmetric label noise: The importance of being unhinged,” in

NeurIPS , 2015, pp. 10–18.[17] T. Liu and D. Tao, “Classiﬁcation with noisy labels by importancereweighting,”

IEEE Transactions on pattern analysis and machineintelligence , vol. 38, no. 3, pp. 447–461, 2015.[18] S. Sukhbaatar, J. Bruna, M. Paluri, L. Bourdev, and R. Fergus,“Training convolutional networks with noisy labels,” in

ICLRWorkshop , 2015.[19] S. Reed, H. Lee, D. Anguelov, C. Szegedy, D. Erhan, and A. Ra-binovich, “Training deep neural networks on noisy labels withbootstrapping,” in

ICLR Workshop , 2015.[20] S. Azadi, J. Feng, S. Jegelka, and T. Darrell, “Auxiliary imageregularization for deep cnns with noisy labels,” in

ICLR , 2016.[21] J. Goldberger and E. Ben-Reuven, “Training deep neural-networksusing a noise adaptation layer,” in

ICLR , 2017.[22] G. Patrini, A. Rozza, A. Krishna Menon, R. Nock, and L. Qu,“Making deep neural networks robust to label noise: A losscorrection approach,” in

CVPR , 2017, pp. 1944–1952.[23] Y. Wang, A. Kucukelbir, and D. M. Blei, “Robust probabilisticmodeling with bayesian data reweighting,” in

ICML , 2017, pp.3646–3655.[24] M. Ren, W. Zeng, B. Yang, and R. Urtasun, “Learning to reweightexamples for robust deep learning,” in

ICML , 2018.[25] D. Hendrycks, M. Mazeika, D. Wilson, and K. Gimpel, “Usingtrusted data to train deep networks on labels corrupted by severenoise,” in

NeurIPS , 2018, pp. 10 456–10 465.[26] B. Han, J. Yao, G. Niu, M. Zhou, I. Tsang, Y. Zhang, andM. Sugiyama, “Masking: A new perspective of noisy supervision,”in

NeurIPS , 2018, pp. 5836–5846.[27] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, “mixup:Beyond empirical risk minimization,” in

ICLR , 2018.[28] Z. Zhang and M. Sabuncu, “Generalized cross entropy loss fortraining deep neural networks with noisy labels,” in

NeurIPS ,2018, pp. 8778–8788.[29] X. Ma, Y. Wang, M. E. Houle, S. Zhou, S. M. Erfani, S.-T. Xia,S. Wijewickrema, and J. Bailey, “Dimensionality-driven learningwith noisy labels,” in

ICML , 2018.[30] E. Arazo, D. Ortego, P. Albert, N. E. O’Connor, and K. McGuinness,“Unsupervised label noise modeling and loss correction,” in

ICML ,2019.[31] D. Hendrycks, K. Lee, and M. Mazeika, “Using pre-training canimprove model robustness and uncertainty,” in

ICML , 2019.[32] N. Charoenphakdee, J. Lee, and M. Sugiyama, “On symmetriclosses for learning from corrupted labels,” in

ICML , 2019.[33] S. Thulasidasan, T. Bhattacharya, J. Bilmes, G. Chennupati, andJ. Mohd-Yusof, “Combating label noise in deep learning usingabstention,” in

ICML , 2019.[34] J. Shu, Q. Xie, L. Yi, Q. Zhao, S. Zhou, Z. Xu, and D. Meng, “Meta-weight-net: Learning an explicit mapping for sample weighting,”in

NeurIPS , 2019, pp. 1919–1930.[35] A. K. Menon, A. S. Rawat, S. J. Reddi, and S. Kumar, “Can gradientclipping mitigate label noise?” in

ICLR , 2020.

SURVEY OF LABEL-NOISE REPRESENTATION LEARNING, NOVEMBER 2020 20 [36] D. T. Nguyen, C. K. Mummadi, T. P. N. Ngo, T. H. P. Nguyen,L. Beggel, and T. Brox, “Self: Learning to ﬁlter noisy labels withself-ensembling,” in

ICLR , 2020.[37] J. Li, R. Socher, and S. C. Hoi, “Dividemix: Learning with noisylabels as semi-supervised learning,” in

ICLR , 2020.[38] Y. Lyu and I. W. Tsang, “Curriculum loss: Robust learning andgeneralization against label corruption,” in

ICLR , 2020.[39] B. Han, G. Niu, X. Yu, Q. Yao, M. Xu, I. Tsang, and M. Sugiyama,“Sigua: Forgetting may make learning with noisy labels morerobust,” in

ICML , 2020.[40] L. Jiang, D. Huang, M. Liu, and W. Yang, “Beyond synthetic noise:Deep learning on controlled noisy labels,” in

ICML , 2020.[41] T. Mitchell,

Machine learning , 1997.[42] M. Mohri, A. Rostamizadeh, and A. Talwalkar,

Foundations ofmachine learning . MIT press, 2018.[43] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining andharnessing adversarial examples,” in

ICLR , 2015.[44] Z.-H. Zhou, “A brief introduction to weakly supervised learning,”

National Science Review , vol. 5, no. 1, pp. 44–53, 2018.[45] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. Berg, andL. Fei-Fei, “Imagenet large scale visual recognition challenge,”

IJCV , vol. 115, pp. 211–252, 2015.[46] D. Arpit, S. Jastrz˛ebski, N. Ballas, D. Krueger, E. Bengio, M. S.Kanwal, T. Maharaj, A. Fischer, A. Courville, Y. Bengio et al. , “Acloser look at memorization in deep networks,” in

ICML , 2017.[47] X. Zhu and A. B. Goldberg, “Introduction to semi-supervisedlearning,”

Synthesis lectures on artiﬁcial intelligence and machinelearning , vol. 3, no. 1, pp. 1–130, 2009.[48] A. Tarvainen and H. Valpola, “Mean teachers are better role mod-els: Weight-averaged consistency targets improve semi-superviseddeep learning results,” in

NeurIPS , 2017.[49] T. Miyato, S.-i. Maeda, M. Koyama, and S. Ishii, “Virtual ad-versarial training: a regularization method for supervised andsemi-supervised learning,”

IEEE transactions on pattern analysisand machine intelligence , vol. 41, no. 8, pp. 1979–1993, 2018.[50] D. Berthelot, N. Carlini, I. Goodfellow, N. Papernot, A. Oliver, andC. A. Raffel, “Mixmatch: A holistic approach to semi-supervisedlearning,” in

NeurIPS , 2019.[51] C. Elkan and K. Noto, “Learning classiﬁers from only positiveand unlabeled data,” in

KDD , 2008.[52] R. Kiryo, G. Niu, M. C. Du Plessis, and M. Sugiyama, “Positive-unlabeled learning with non-negative risk estimator,” in

NeurIPS ,2017.[53] Y.-G. Hsieh, G. Niu, and M. Sugiyama, “Classiﬁcation frompositive, unlabeled and biased negative data,” in

ICML , 2019.[54] T. Ishida, G. Niu, and M. Sugiyama, “Binary classiﬁcation frompositive-conﬁdence data,” in

NeurIPS , 2018.[55] T. Ishida, G. Niu, W. Hu, and M. Sugiyama, “Learning fromcomplementary labels,” in

NeurIPS , 2017.[56] X. Yu, T. Liu, M. Gong, and D. Tao, “Learning with biasedcomplementary labels,” in

ECCV , 2018.[57] T. Ishida, G. Niu, A. Menon, and M. Sugiyama, “Complementary-label learning for arbitrary losses and models,” in

ICML , 2019.[58] L. Feng, T. Kaneko, B. Han, G. Niu, B. An, and M. Sugiyama,“Learning with multiple complementary labels,” in

ICML , 2020.[59] N. Lu, G. Niu, A. Menon, and M. Sugiyama, “On the minimalsupervision for training any binary classiﬁer from only unlabeleddata,” in

ICLR , 2019.[60] N. Lu, T. Zhang, G. Niu, and M. Sugiyama, “Mitigating overﬁttingin supervised classiﬁcation from two unlabeled datasets: Aconsistent risk correction approach,” in

AISTATS , 2020.[61] M. Kearns, “Efﬁcient noise-tolerant learning from statisticalqueries,”

Journal of the ACM , vol. 45, no. 6, pp. 983–1006, 1998.[62] A. Menon, B. Van Rooyen, and N. Natarajan, “Learning frombinary labels with instance-dependent corruption,”

Machine Learn-ing , vol. 107, p. 1561–1595, 2018.[63] J. Cheng, T. Liu, K. Ramamohanarao, and D. Tao, “Learning withbounded instance-and label-dependent label noise,” in

ICML ,2020.[64] G. Patrini, F. Nielsen, R. Nock, and M. Carioni, “Loss factorization,weakly supervised learning and label noise robustness,” in

ICML ,2016, pp. 708–717.[65] N. Golowich, A. Rakhlin, and O. Shamir, “Size-independentsample complexity of neural networks,” in

COLT , 2018.[66] M. Anthony and P. Bartlett,

Neural network learning: Theoreticalfoundations . Cambridge University Press, 2009. [67] M. Li, M. Soltanolkotabi, and S. Oymak, “Gradient descent withearly stopping is provably robust to label noise for overparame-terized neural networks,” in

AISTATS , 2020.[68] B. van Rooyen and R. C. Williamson, “A theory of learning withcorrupted labels,”

Journal of Machine Learning Research , vol. 18,no. 1, pp. 8501–8550, 2017.[69] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, “Un-derstanding deep learning requires rethinking generalization,” in

ICLR , 2017.[70] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna,“Rethinking the inception architecture for computer vision,” in

CVPR , 2016.[71] M. Lukasik, S. Bhojanapalli, A. K. Menon, and S. Kumar, “Doeslabel smoothing mitigate label noise?” in

ICML , 2020.[72] A. G. Frodesen, O. Skjeggestad, and H. Toefte, “Probability andstatistics in particle physics,” 1979.[73] L. Wasserman,

All of statistics: a concise course in statistical inference .Springer Science & Business Media, 2013.[74] M. Wainwright and M. Jordan, “Graphical models, exponentialfamilies, and variational inference, ser,”

Foundations and Trends inMachine Learning , vol. 1, 2008.[75] X. Xia, T. Liu, N. Wang, B. Han, C. Gong, G. Niu, and M. Sugiyama,“Are anchor points really indispensable in label-noise learning?”in

NeurIPS , 2019.[76] I. Misra, C. Lawrence Zitnick, M. Mitchell, and R. Girshick, “Seeingthrough the human reporting bias: Visual classiﬁers from noisyhuman-centric labels,” in

CVPR , 2016.[77] J. Krause, B. Sapp, A. Howard, H. Zhou, A. Toshev, T. Duerig,J. Philbin, and L. Fei-Fei, “The unreasonable effectiveness of noisydata for ﬁne-grained recognition,” in

ECCV , 2016.[78] Y. Li, J. Yang, Y. Song, L. Cao, J. Luo, and L.-J. Li, “Learning fromnoisy labels with distillation,” in

ICCV , 2017.[79] C. G. Northcutt, T. Wu, and I. L. Chuang, “Learning with conﬁdentexamples: Rank pruning for robust classiﬁcation with noisy labels,”in

UAI , 2017.[80] Y. Kim, J. Yim, J. Yun, and J. Kim, “Nlnl: Negative learning fornoisy labels,” in

ICCV , 2019.[81] P. H. Seo, G. Kim, and B. Han, “Combinatorial inference againstlabel noise,” in

NeurIPS , 2019.[82] T. Kaneko, Y. Ushiku, and T. Harada, “Label-noise robust genera-tive adversarial networks,” in

CVPR , 2019.[83] A. Lamy, Z. Zhong, A. K. Menon, and N. Verma, “Noise-tolerantfair classiﬁcation,” in

NeurIPS , 2019.[84] J. Yao, H. Wu, Y. Zhang, I. W. Tsang, and J. Sun, “Safeguardeddynamic label regression for noisy supervision,” in

AAAI , 2019.[85] Y. Grandvalet and Y. Bengio, “Semi-supervised learning byentropy minimization,” in

NeurIPS , 2005.[86] D.-H. Lee, “Pseudo-label: The simple and efﬁcient semi-supervised learning method for deep neural networks,” in

ICMLWorkshop , 2013.[87] Q. Yao, H. Yang, B. Han, G. Niu, and J. T. Kwok, “Searchingto exploit memorization effect in learning with noisy labels,” in

ICML , 2020.[88] A. Gretton, A. Smola, J. Huang, M. Schmittfull, K. Borgwardt, andB. Schölkopf, “Covariate shift by kernel mean matching,”

Datasetshift in machine learning , vol. 3, no. 4, p. 5, 2009.[89] Y. Bengio, J. Louradour, R. Collobert, and J. Weston, “Curriculumlearning,” in

ICML , 2009.[90] S. Laine and T. Aila, “Temporal ensembling for semi-supervisedlearning,” in

ICLR , 2017.[91] S. Branson, G. Van Horn, and P. Perona, “Lean crowdsourcing:Combining humans and machines in an online system,” in

CVPR ,2017.[92] A. Vahdat, “Toward robustness against label noise in trainingdeep discriminative neural networks,” in

NeurIPS , 2017.[93] H.-S. Chang, E. Learned-Miller, and A. McCallum, “Active bias:Training more accurate neural networks by emphasizing highvariance samples,” in

NeurIPS , 2017.[94] A. Khetan, Z. C. Lipton, and A. Anandkumar, “Learning fromnoisy singly-labeled data,”

ICLR , 2018.[95] D. Tanaka, D. Ikami, T. Yamasaki, and K. Aizawa, “Joint opti-mization framework for learning with noisy labels,” in

CVPR ,2018.[96] Y. Wang, W. Liu, X. Ma, J. Bailey, H. Zha, L. Song, and S.-T. Xia,“Iterative learning with open-set noisy labels,” in

CVPR , 2018.[97] S. Jenni and P. Favaro, “Deep bilevel learning,” in

ECCV , 2018.

SURVEY OF LABEL-NOISE REPRESENTATION LEARNING, NOVEMBER 2020 21 [98] Y. Wang, X. Ma, Z. Chen, Y. Luo, J. Yi, and J. Bailey, “Symmetriccross entropy for robust learning with noisy labels,” in

ICCV , 2019.[99] J. Li, Y. Song, J. Zhu, L. Cheng, Y. Su, L. Ye, P. Yuan, andS. Han, “Learning from large-scale noisy web data with ubiquitousreweighting for image classiﬁcation,”

IEEE Transactions on PatternAnalysis and Machine Intelligence , 2019.[100] Y. Xu, P. Cao, Y. Kong, and Y. Wang, “L_dmi: A novel information-theoretic loss function for training deep nets robust to label noise,”in

NeurIPS , 2019.[101] Y. Liu and H. Guo, “Peer loss functions: Learning from noisylabels without knowing noise rates,” in

ICML , 2020.[102] X. Ma, H. Huang, Y. Wang, S. Romano, S. Erfani, and J. Bailey,“Normalized loss functions for deep learning with noisy labels,”2020.[103] X. Yu, B. Han, J. Yao, G. Niu, I. W. Tsang, and M. Sugiyama, “Howdoes disagreement help generalization against label corruption?”in

ICML , 2019.[104] E. Malach and S. Shalev-Shwartz, “Decoupling" when to update"from" how to update",” in

NeurIPS , 2017.[105] P. Chen, B. Liao, G. Chen, and S. Zhang, “Understanding andutilizing deep neural networks trained with noisy labels,” in

ICML ,2019.[106] Q. Yao and M. Wang, “Taking human out of learning applica-tions: A survey on automated machine learning,” arXiv preprintarXiv:1810.13306, Tech. Rep., 2018.[107] F. Hutter, L. Kotthoff, and J. Vanschoren, Eds.,

Automated MachineLearning: Methods, Systems, Challenges . Springer, 2018.[108] D. Bahri, H. Jiang, and M. Gupta, “Deep k-nn for noisy labels,” in

ICML , 2020.[109] A. Veit, N. Alldrin, G. Chechik, I. Krasin, A. Gupta, and S. Be-longie, “Learning from noisy large-scale datasets with minimalsupervision,” in

CVPR , 2017.[110] B. Zhuang, L. Liu, Y. Li, C. Shen, and I. Reid, “Attend in groups:a weakly-supervised deep learning framework for learning fromweb data,” in

CVPR , 2017.[111] K.-H. Lee, X. He, L. Zhang, and L. Yang, “Cleannet: Transferlearning for scalable image classiﬁer training with label noise,” in

CVPR , 2018, pp. 5447–5456.[112] S. Guo, W. Huang, H. Zhang, C. Zhuang, D. Dong, M. R. Scott,and D. Huang, “Curriculumnet: Weakly supervised learning fromlarge-scale web images,” in

ECCV , 2018.[113] J. Deng, J. Guo, N. Xue, and S. Zafeiriou, “Arcface: Additiveangular margin loss for deep face recognition,” in

CVPR , 2019.[114] X. Wang, S. Wang, J. Wang, H. Shi, and T. Mei, “Co-mining: Deepface recognition with noisy labels,” in

Proceedings of the IEEEinternational conference on computer vision , 2019.[115] J. Huang, L. Qu, R. Jia, and B. Zhao, “O2u-net: A simple noisylabel detection approach for deep neural networks,” in

Proceedingsof the IEEE International Conference on Computer Vision , 2019.[116] J. Han, P. Luo, and X. Wang, “Deep self-learning from noisy labels,”in

ICCV , 2019.[117] H. Harutyunyan, K. Reing, G. V. Steeg, and A. Galstyan, “Improv-ing generalization by controlling label-noise information in neuralnetwork weights,” in

ICML , 2020.[118] H. Wei, L. Feng, X. Chen, and B. An, “Combating noisy labels byagreement: A joint training method with co-regularization,” in

CVPR , 2020.[119] Z. Zhang, H. Zhang, S. O. Arik, H. Lee, and T. Pﬁster, “Distillingeffective supervision from severe label noise,” in

CVPR , 2020.[120] T. Xiao, T. Xia, Y. Yang, C. Huang, and X. Wang, “Learning frommassive noisy labeled data for image classiﬁcation,” in

CVPR ,2015, pp. 2691–2699.[121] A. Berthon, B. Han, G. Niu, T. Liu, and M. Sugiyama, “Conﬁdencescores make instance-dependent label-noise learning possible,” arXiv preprint arXiv:2001.03772 , 2020.[122] Y. Wang, D. Zou, J. Yi, J. Bailey, X. Ma, and Q. Gu, “Improvingadversarial robustness requires revisiting misclassiﬁed examples,”in

ICLR , 2020.[123] J. Zhang, X. Xu, B. Han, G. Niu, L. Cui, M. Sugiyama, andM. Kankanhalli, “Attacks which do not kill training make ad-versarial learning stronger,” in

ICML , 2020.[124] J. Zhang, B. Han, L. Wynter, K. H. Low, and M. Kankanhalli,“Towards robust resnet: A small step but a giant leap,” in

IJCAI ,2019.[125] B. Han, Y. Pan, and I. W. Tsang, “Robust plackett–luce model fork-ary crowdsourced preferences,”

Machine Learning , vol. 107, no. 4,pp. 675–702, 2018. [126] Y. Pan, B. Han, and I. W. Tsang, “Stagewise learning for noisy k-arypreferences,”

Machine Learning , vol. 107, no. 8-10, pp. 1333–1361,2018.[127] F. Liu, J. Lu, B. Han, G. Niu, G. Zhang, and M. Sugiyama,“Butterﬂy: A panacea for all difﬁculties in wildly unsuperviseddomain adaptation,” arXiv preprint arXiv:1905.07720 , 2019.[128] X. Yu, T. Liu, M. Gong, K. Zhang, K. Batmanghelich, and D. Tao,“Label-noise robust domain adaptation,” in

ICML , 2020.[129] S. Wu, X. Xia, T. Liu, B. Han, M. Gong, N. Wang, H. Liu, andG. Niu, “Multi-class classiﬁcation from noisy-similarity-labeleddata,” arXiv preprint arXiv:2002.06508 , 2020.[130] W. L. Hamilton, R. Ying, and J. Leskovec, “Representation learningon graphs: Methods and applications,”

IEEE Data EngineeringBulletin , 2017.[131] C. Wang, B. Han, S. Pan, J. Jiang, G. Niu, and G. Long, “Cross-graph: Robust and unsupervised embedding for attributed graphswith corrupted structure,” in

ICDM , 2020.[132] Y.-H. Wu, N. Charoenphakdee, H. Bao, V. Tangkaratt, andM. Sugiyama, “Imitation learning from imperfect demonstration,”in

ICML , 2019.[133] D. S. Brown, W. Goo, P. Nagarajan, and S. Niekum, “Extrapolatingbeyond suboptimal demonstrations via inverse reinforcementlearning from observations,” in

ICML , 2019.[134] J. Audiffren, M. Valko, A. Lazaric, and M. Ghavamzadeh, “Max-imum entropy semi-supervised inverse reinforcement learning,”in

IJCAI , 2015.[135] V. Tangkaratt, B. Han, M. E. Khan, and M. Sugiyama, “Variationalimitation learning with diverse-quality demonstrations,” in

ICML ,2020.[136] T. Bylander, “Learning linear threshold functions in the presenceof classiﬁcation noise,” in

COLT , 1994, pp. 340–347.[137] M. Dredze, K. Crammer, and F. Pereira, “Conﬁdence-weightedlinear classiﬁcation,” in

ICML , 2008, pp. 264–271.[138] Y. Freund, “A more robust boosting algorithm,” arXiv preprintarXiv:0905.2138 , 2009.[139] N. Cesa-Bianchi, S. Shalev-Shwartz, and O. Shamir, “Onlinelearning of noisy data,”

IEEE Transactions on Information Theory ,vol. 57, no. 12, pp. 7907–7931, 2011.

SURVEY OF LABEL-NOISE REPRESENTATION LEARNING, NOVEMBER 2020 22

Bo Han is an Assistant Professor of ComputerScience at Hong Kong Baptist University, and aVisiting Scientist at RIKEN Center for AdvancedIntelligence Project (RIKEN AIP). He was aPostdoc Fellow at RIKEN AIP (2019-2020). Hereceived his Ph.D. degree in Computer Sciencefrom University of Technology Sydney (2015-2019). During 2018-2019, he was a ResearchIntern with the AI Residency Program at RIKENAIP. His research interests lie in machine learningand deep learning, especially weakly-supervisedlearning and adversarial learning. He has served as area chairs ofNeurIPS’20 and ICLR’21, senior program committee of IJCAI’21, andprogram committees of ICML, AISTATS, UAI, AAAI, IJCAI and ACML. Hereceived the RIKEN BAIHO Award (2019), RGC Early Career Scheme(2020) and NSFC Young Scientists Fund (2020).

Quanming Yao is a senior scientist in 4Paradigm.He is also the founder and currently the leader ofthe company’s machine learning research team.He obtained his Ph.D. degree at the Depart-ment of Computer Science and Engineering ofHong Kong University of Science and Technol-ogy (HKUST) in 2018 and received his bachelordegree at HuaZhong University of Science andTechnology (HUST) in 2013. He is a receiptof Wunwen Jun Prize of Excellence Youth ofArtiﬁcial Intelligence (issued by CAAI), the runnerup of Ph.D. Research Excellence Award (School of Engineering, HKUST),and a winner of Google Fellowship (in machine learning). Currently,his main research topics are Automated Machine Learning (AutoML)and neural architecture search (NAS). He has authored 37 top-tierconference/journal papers, including JMLR, IEEE TPAMI, ICML andNeurIPS, and with a total citations of 1104 (from 2015 by Google Scholar).He was an Area Chair for IJCAI 2021, Senior Program Committee forIJCAI 2020 and AAAI 2020-2021; and a guest editor of IEEE TPAMIAutoML special issue in 2019.

Tongliang Liu is a Lecturer (Assistant Profes-sor) with the School of Computer Science atthe University of Sydney. He is also a VisitingScientist at RIKEN AIP. His current researchinterests include weakly supervised learning andadversarial learning. He has authored and co-authored more than 60 research articles includingthe NeurIPS, ICML, CVPR, ECCV, AAAI, IJCAI,KDD, ICME, IEEE T-PAMI, T-NNLS, and T-IP, withbest paper awards, e.g., the 2019 ICME BestPaper Award. He is a recipient of Discovery EarlyCareer Researcher Award (DECRA) from Australian Research Council(ARC) and was shortlisted for the J. G. Russell Award by AustralianAcademy of Science (AAS) in 2019.

Gang Niu is currently a research scientist(indeﬁnite-term) at RIKEN Center for AdvancedIntelligence Project. He received the PhD de-gree in computer science from Tokyo Institute ofTechnology. Before joining RIKEN as a researchscientist, he was a senior software engineer atBaidu and then an assistant professor at theUniversity of Tokyo. He has published more than60 journal articles and conference papers, includ-ing 14 NeurIPS (1 oral and 3 spotlights) and 17ICML papers. He has also served or will serveas an area chair 10 times, including AISTATS 2019, ICML 2019 & 2020,NeurIPS 2019 & 2020 and ICLR 2021.

Ivor W. Tsang is Professor of Artiﬁcial Intelli-gence, at University of Technology Sydney. Heis also the Research Director of the AustralianArtiﬁcial Intelligence Institute. In 2013, Prof Tsangreceived his prestigious Australian ResearchCouncil Future Fellowship for his research re-garding Machine Learning on Big Data. In 2019,his paper titled "Towards ultrahigh dimensionalfeature selection for big data" received the Inter-national Consortium of Chinese MathematiciansBest Paper Award. In 2020, Prof Tsang wasrecognized as the AI 2000 AAAI/IJCAI Most Inﬂuential Scholar in Australiafor his outstanding contributions to the ﬁeld of Artiﬁcial Intelligencebetween 2009 and 2019. His works on transfer learning granted himthe Best Student Paper Award at International Conference on ComputerVision and Pattern Recognition 2010 and the 2014 IEEE Transactionson Multimedia Prize Paper Award. In addition, he had received theprestigious IEEE Transactions on Neural Networks Outstanding 2004Paper Award in 2007. Prof Tsang serves as a Senior Area Chair forNeural Information Processing Systems and Area Chair for InternationalConference on Machine Learning, and the Editorial Board for JournalMachine Learning Research, Machine Learning, and IEEE Transactionson Pattern Analysis and Machine Intelligence.

James T. Kwok (F’17) received the PhD degreein computer science from the Hong Kong Uni-versity of Science and Technology, in 1996. Hewas with the Department of Computer Science,Hong Kong Baptist University, Hong Kong, as anassistant professor. He is currently a professorwith the Department of Computer Science andEngineering, Hong Kong University of Scienceand Technology. His research interests includekernel methods, machine learning, and artiﬁcialneural networks. He received the IEEE Outstand-ing 2004 Paper Award, and the Second Class Award in Natural Sciencesby the Ministry of Education, People’s Republic of China, in 2008. Hehas been a program cochair for a number of international conferences,and served as an associate editor for the IEEE Transactions on NeuralNetworks and Learning Systems from 2006-2012. Currently, he is anassociate editor for the Neurocomputing journal. He is a fellow of theIEEE.

Masashi Sugiyama is Director of RIKEN Centerfor Advanced Intelligence Project and Professorat the University of Tokyo. He received the PhDdegree in computer science from Tokyo Instituteof Technology. His research is designing sta-tistical data analysis algorithms for challengingproblems. He (co)-authored machine learningmonographs such as Machine Learning in Non-Stationary Environments (MIT Press), DensityRatio Estimation in Machine Learning (Cam-bridge University Press), Statistical Reinforce-ment Learning (Chapman and Hall/CRC), Introduction to Statistical Ma-chine Learning (Morgan Kaufmann), and Variational Bayesian LearningTheory (Cambridge University Press). He served as Program Co-chair forthe Neural Information Processing Conference, International Conferenceon Artiﬁcial Intelligence and Statistics, and Asian Conference on MachineLearning. He serves as an Associate Editor for the IEEE Transactionson Pattern Analysis and Machine Intelligence, and an Editorial BoardMember for the Machine Learning Journal and Frontiers of ComputerScience. He received the Japan Academy Medal in 2017.

SURVEY OF LABEL-NOISE REPRESENTATION LEARNING, NOVEMBER 2020 23 A PPENDIX

Early Stage.

Before delving into label-noise representationlearning, we should brieﬂy overview some of milestoneworks in label-noise statistical learning. Starting from 1988,Angluin et al. prove that a learning algorithm can handleincorrect training examples robustly, when the noise rateis less than one half under the random noise model [1].Bylander further demonstrate that linear threshold functionsare polynomially learnable in the presence of classiﬁcationnoise [136]. Lawrence and Schölkopf construct a kernelFisher discriminant to formulate label-noise problem as aprobabilistic model [12], which is solved by ExpectationMaximization algorithm. Although the above works exploreto tackle noisy labels theoretically and empirically, Bartlettet al. justify that most loss functions are not completelyrobust to label noise [13]. It means that classiﬁers based onlabel-noise robust algorithms are still affected by label noise.During this period, a lot of works emerged and con-tributed to this area. For example, Crammer et al. proposean online Passive-Aggressive perceptron algorithm to copewith label noise [14]. Dredze et al. propose conﬁdenceweighted learning to weigh trusted labels more [137]. Freundpropose a boosting algorithm to combat against random labelnoise [138]. To handle label noise theoretically, Cesa-Bianchiet al. propose an online learning algorithm, leveragingunbiased estimates of the gradient of the loss [139]. Until2013, Natarajan et al. formally formulate an unbiased riskestimator for binary classiﬁcation with noisy labels [5]. Thiswork is very important to the area, since it is the ﬁrst work toprovide guarantees for risk minimization under random labelnoise. Moreover, this work provides an easy way to suitablymodify any given surrogate loss function for handling labelnoise.Meanwhile, Scott et al. study the classiﬁcation problemunder class-conditional noise model, and propose the way tohandle asymmetric label noise [15]. In contrast, van Rooyenet al. propose the unhinge loss to tackle symmetric labelnoise [16]. Liu and Tao propose the method of anchor pointsto estimate the noise rate, and further leverage importancereweighting to design surrogate loss functions for class-conditional label noise [17]. Instead of designing ad-hoclosses, Patrini et al. introduce linear-odd losses, which canbe factorized into an even and an odd loss function [64].More importantly, they estimate the mean operator fromnoisy data, and plug this operator in linear-odd losses forempirical risk minimization, which is resistant to asymmetriclabel noise.It is noted that, we move from label-noise statistical learn-ing to label-noise representation learning after 2015. Thereare two reasons behind this phenomenon. First, label-noisestatistical learning mainly focus on designing theoretically-robust methods for small-scale noisy data. However, suchmethods cannot empirically work well on large-scale noisydata in our daily life, such as Clothing1M [120] emergingfrom 2015. Second, label-noise statistical learning mainlyapplies to shallow and convex models, such as supportvector machines. However, deep and non-convex models,such as convolutional and recurrent neural networks, havebecome trendy and mainstream due to the better empiricalperformance, not only in vision, but also in language, speech and video tasks. Therefore, it is urgent to design label-noiserepresentation learning methods for robustly training of deepmodels with noisy labels.

Emerging Stage.

There are three seminal works in label-noise representation learning with noisy labels. For example,Sukhbaatar et al. introduce an extra but constrained linear“noise” layer on top of the softmax layer, which adapts thenetwork outputs to model the noisy label distribution [18].Reed et al. augment the prediction objective with a notionof consistency via a soft and hard bootstrapping [19], wherethe soft version is equivalent to softmax regression with min-imum entropy regularization and the hard version modiﬁesregression targets using the MAP estimation. Intuitively, thisbootstrapping procedure provides the learner to disagreewith an inconsistent training label, and re-label the trainingdata to improve its label quality. Azadi et al. propose anauxiliary image regularization technique [20]. The key ideais to exploit the mutual context information among trainingdata, and encourage the model to select reliable labels.Followed by seminal works, Goldberger et al. introducea nonlinear “noise” adaptation layer on top of the softmaxlayer [21], which adapts to model the noisy label distribution.Patrini et al. propose forward and backward loss correctionapproaches simultaneously [22]. Based on the corrected loss,they explore a robust two-stage training algorithm. A veryinteresting point is, both Wang et al. and Ren et al. leveragethe same philosophy, namely data reweighting, to learn withlabel noise. However, they tackle from different perspectives.Speciﬁcally, Wang et al. come from a view of Bayesianand propose robust probabilistic modeling [23], where theposterior of reweighted model will identify uncorrupted databut ignore corrupted data. Ren et al. come from a view ofmeta-learning [24], which assigns weights to training samplesbased on their gradient directions. Namely, their methodperforms a meta gradient descent step on the current mini-batch example weights (initialized from zero) to minimizethe loss on a clean unbiased validation set.Besides the above works, there are many importantworks born in 2018, ranging in diverse directions. In highlevel, there are several major directions, such as estimatingtransition matrix, regularization, designing losses and small-loss tricks. Among them, small-loss tricks are inspired bymemorization effects of deep neural networks, where deepmodels will ﬁt easy (clean) patterns ﬁrst but over-ﬁt hard(noisy) patterns eventually. Namely, small-loss tricks regardsmall-loss samples as relatively “clean” samples, and back-propagate such samples to update the model parameters.For example, Jiang et al. is the ﬁrst to leverage small-losstricks to handle label noise [6]. However, they train only asingle network iteratively, which is similar to the self-trainingapproach. Such approach inherits the same inferiority ofaccumulated error caused by the sample-selection bias. Toaddress this issue, Han et al. train two deep neural networkssimultaneously, and back propagates the data selected by itspeer network and updates itself [7].In the context of representation learning, estimatingtransition matrix, regularization and designing losses arestill prosperous for handling label noise. For instance, giventhat a small set of trusted examples are available, Hendryckset al. propose gold loss correction. Namely, they leveragetrusted examples to estimate the (gold) transition matrix