[PDF] Meta Label Correction for Learning with Weak Supervision

Abstract

Leveraging weak or noisy supervision for building effective machine learning models has long been an important research problem. The growing need for large-scale datasets to train deep learning models has increased its importance. Weak or noisy supervision could originate from multiple sources including non-expert annotators or automatic labeling based on heuristics or user interaction signals. Previous work on modeling and correcting weak labels have been focused on various aspects, including loss correction, training instance re-weighting, etc. In this paper, we approach this problem from a novel perspective based on meta-learning. We view the label correction procedure as a meta-process and propose a new meta-learning based framework termed MLC for learning with weak supervision. Experiments with different label noise levels on multiple datasets show that MLC can achieve large improvement over previous methods incorporating weak labels for learning.

Full PDF

PPreprint M ETA L ABEL C ORRECTION FOR L EARNING WITH W EAK S UPERVISION

Guoqing Zheng, Ahmed Hassan Awadallah, Susan Dumais

Microsoft Research { zheng, hassanam, sdumais } @microsoft.com A BSTRACT

Leveraging weak or noisy supervision for building effective machine learningmodels has long been an important research problem. The growing need for large-scale datasets to train deep learning models has increased its importance. Weakor noisy supervision could originate from multiple sources including non-expertannotators or automatic labeling based on heuristics or user interaction signals.Previous work on modeling and correcting weak labels have been focused onvarious aspects, including loss correction, training instance re-weighting, etc. In thispaper, we approach this problem from a novel perspective based on meta-learning.We view the label correction procedure as a meta-process and propose a newmeta-learning based framework termed MLC for learning with weak supervision.Experiments with different label noise levels on multiple datasets show that MLCcan achieve large improvement over previous methods incorporating weak labelsfor learning.

NTRODUCTION

Recent advances in deep learning have enabled several natural language processing models to achieveimpressive performance. At the core of this success lies the availability of large amounts of annotateddata. However, such datasets are not readily available in large scale for many tasks. The problemof learning with weak supervision aims to address this challenge by leveraging weak evidencefor supervision, such as corrupted labels, noisy labels,automatic labels based on heuristics or userinteraction signals, etc.Two major lines of work have been proposed to combine trusted (or gold) labeled data with weakor noisy supervision data for better learning. The ﬁrst approach relies on re-weighting of traininginstances (Ren et al., 2018). It aims to assign proper importance to each sample in the training setsuch that the ones with higher weights will contribute more positively to the model training. Onthe other hand, the second approach relies on the idea of label correction. It aims to correct thenoisy/corrupt labels based on certain assumptions about the weak label generation process. In a sense,label correction is a ﬁner way to incorporate the noisy data samples than simply assigning scalarweights to each training instance and has shown to work well even in the setting where a very small setof clean labels is available. Label correction in previous methods relies on the assumption about theweak label generation process and thus often involves estimating a label corruption matrix (Hendrycks& Gimpel, 2016). However, the label correction estimation is done in separation from the main modellimiting the ﬂow of information between them.In this paper, we address the label correction problem from a novel angle based on meta-learningand propose meta label correction (MLC). Speciﬁcally, we view the label correction procedure as ameta-process, meaning that its objective is to provide corrected labels for the examples with weaklabels. On the other hand, the main supervised model is trained to ﬁt the corrected labels (generatedby the meta-model). Both the meta-model and the main model are learned by optimizing the modelperformances on the gold data set (i.e., a validation set w.r.t. the noisy set) allowing us to co-optimizethe label correction process and the main model process.Meta learning has been successfully used for many applications including hyper-parameter tun-ing (Maclaurin et al., 2015), model selection (Pedregosa, 2016) and neural architecture search (Liuet al., 2018). To the best of our knowledge, MLC is the ﬁrst to utilize a meta model to automatically1 a r X i v : . [ c s . L G ] N ov reprint“tune” noisy labels from data and combine it with trusted labels for better learning. The contributionsof this paper can be summarized as follows: • A new learning framework with weak supervision based on meta learning, is proposedbased on a novel angle by treating the label correction network as a meta process to providereliable labels for the main models to learn; • We conduct experiments on various text classiﬁcation and gray-scale image recognitionexperiments and the proposed methods outperform previous best methods on label correction,demonstrating the power of the proposed method.The rest of the paper is organized as follows: We brieﬂy review the preliminaries on learning withweak supervision, particularly on learning with corrupt/noisy labels 2 and propose a meta-learningbased learning framework for weak supervision in Section 3. Empirical evaluations and analysis areconducted in Section 4 and we conclude the paper in Section 6.

RELIMINARIES

We follow a setup of learning with weak supervision that involves two sets of data: a small set of datawith clean/reliable labels { x i , y i } mi =1 and a large set of data with weak supervision (noisy/corruptedlabels) { x i , y (cid:48) i } Mi =1 . Typically the clean set is much smaller compared to the noisy set ( m (cid:28) M ) dueto scarcity of expert labels and to labeling costs. Training directly on the small clean set often tendsto be sub-optimal, as too little data can easily cause over-ﬁtting. The problems of learning with weaksupervision under this setup can then be formulated as how to build a predictive model f : X → Y with the given two sets. Two major lines of work have been proposed to solve this problem.2.1 L

EARNING WITH LABEL CORRECTION

The ﬁrst line of work aims to correct the weak labels as much as possible by imposing assumptionsof how the noisy labels are generated from its underlying true labels. To be concrete, consider theproblem of classifying the data into k categories, where label correction involves estimating a labelcorruption matrix C k × k whose entry C ij denotes the probability of observing weak label for class i while the underlying true class label is actually j . Gold loss correction (Hendrycks et al., 2018)falls into this category; a key drawback of this line of work is that the label perturbing matrix is oftenestimated in an ad-hoc way and also that the estimation process is separate from the main modelprocess, hence allowing no feedback from the main model to the estimation process.2.2 L EARNING TO RE - WEIGHT TRAINING INSTANCES

Knowing that not all training examples are equally important and useful for building a main model,another line of work for learning with weak supervision is to assign learnable weights to each examplein the training noisy set. The goal is to assign a a proper weight for each training example suchthat the main model would perform well on a separate validation set (the clean set) (Ren et al.,2018). The example weights are essentially hyper-parameters for the main model and can be learnedby formulating a bi-level optimization problem. Due to the meta-learning characteristic of thisframework, the example weights learning and the main model could communicate with each otherand a better model could be learned.

ETA L ABEL C ORRECTION

One advantage of the label correction approach is that it allows us to combine trusted labels and corrected noisy labels in the learning process. Our proposed approach adopts the label correctionmethodology while also co-optimizing the label correction process together with the main modelprocess through an uniﬁed meta learning framework. We achieve that by adopting a meta-learningframework where the meta learner (meta model) tries to correct the noisy labels and the main modeltries to build the best predictive model with corrected labels coming from the meta model, allowingthe meta model and main model to reinforce each other.2reprint x ✁(cid:0) x f(x) loss(y c , f(x)) x loss(y, f(x)) Noisy pair ( , ) Clean pair ( , ) x ✂✄ x y y y c y c =g(x,y ✄☎ f(x) Figure 1: MLC Architecture. Nodes in gray denotes the noisy training examples and green onesdenotes the clean ones.The meta model g α (in orange) takes in a noisy example label pair and tries togenerate a “corrected” label which will be treated as the correct labels to train the main model f w (inblue). Thus the trained main model depends on the corrected labels y c , hence further depends on theparameters of the meta model. The trained main model then will be tested on a separate true cleanset, whose loss is to be minimized. Note that when minimizing the loss on the clean examples, theparameters for the main models are not changed, only to let the loss signal on the clean examples topropogate back to the meta network, thus making g α generate better corrections for x , y (cid:48) . In practice,we won’t be able to always get the trained main model to evaluate on the clean examples, thus k -stepSGD ahead updated version of the main model is used as a “trained” model.We describe the framework in detail as follows. Given a set of clean data examples D = { x , y } m and a set of noisy data examples D (cid:48) = { x , y (cid:48) } M with m much smaller than M . To best utilize theinformation provided by the weak labels, we propose to construct a label correction network (LCN),serving as a meta model , which takes a pair of noisy data example and its weak label as input andproduces a different version of the weak label. Formally, the label correction network (LCN) isdeﬁned as a function with parameters α : y c = g α ( x , y (cid:48) ) (1)to correct the weak label y (cid:48) of example x to its true label. (Note that the subscription in y c emphasizesthat it’s generating a corrected label). Meanwhile, the main model f , that we aim to train and use forprediction after training, is instantiated as another function with parameters w , y = f w ( x ) (2)Without linking the two models, there’s no way to enforce that 1.) the generated label for an examplefrom the meta model g is indeed a meaningful one, let alone a corrected one, since we cannotdirectly train the meta model without clean labels for the noisy examples ; 2) The main model f might be ﬁtting onto arbitrary targets, if the labels provided do not align with the unknown truelabels. Fortunately, the two models can be linked together via a bi-level optimization problem, bythe intuition that if the labels generated by the meta model are of high quality, then we can use thesepairs of examples and their corrected labels as training data to train a good main model, such thatthe main model achieves low loss on a separate set of clean examples . This can be instantiated as thefollowing bi-level optimization problem: min α E ( x ,y ) ∈ D (cid:96) ( y, f w ∗ α ( x )) s.t. w ∗ α = arg min w E ( x ,y (cid:48) ) ∈ D (cid:48) (cid:96) ( g α ( x , y (cid:48) ) , f w ( x )) (3)where (cid:96) () denotes a chosen differentiable loss function to measure the predictive error and thesubscript of w is to emphasize the dependency of the best main model f on α . We term thisframework as Meta Label Correction (MLC); Figure 1 provides an overview of the framework.In this bi-level formulation, since the LCN is parameterized by α , α are the upper parameters (ormeta parameters) while the main model parameters w are the lower parameters (or main parameters).Like many other work involving bi-level optimizations, exact solutions of Problem (3) requiressolving for the optimal w ∗ whenever α is updated. This is often analytically infeasible when the3reprint while not converged do Update meta parameters α by descending ∇ α L D ( w − kη ∇ w L D (cid:48) ( α , w )) Update model parameters w by descending ∇ w L D (cid:48) ( α , w ) end Algorithm 1: MLC - Meta Label Correctionmain model f is complex, such as deep neural networks, and also computationally expensive. Insteadof solving for the optimal for w ∗ for each α , we use a k -step look ahead SGD update for w as anestimate to the optimal main model for a given α w (cid:48) ( α ) ≈ w − kη ∇ w L D (cid:48) ( α , w ) (4)where L D (cid:48) ( α , w ) = E ( x ,y (cid:48) ) ∈ D (cid:48) (cid:96) ( g α ( x , y (cid:48) ) , f w ( x )) , thus the proxy optimization problem turns to min α L D ( w (cid:48) ( α )) = L D ( w ∗ ( α )) = E ( x ,y ) ∈ D (cid:96) ( y, f w (cid:48) ( α ) ( x )) (5)Algorithm 1 outlines an iterative procedure to solve the above proxy problem with k -step look aheadSGD for the main model:The above meta learning algorithm involves computing an expensive second-order partial deriva-tive ∇ α , w L D (cid:48) ( α , w ) followed by a matrix vector product. To speed up training, we propose toapproximate the second order gradients with ﬁnite differences, as follows ∇ α L D ( w − kη ∇ w L D (cid:48) ( α , w )) = − kη ∇ α , w L D (cid:48) ( α , w ) ∇ w (cid:48) L D ( w (cid:48) ) (6) = − kη ∇ α (cid:16) ∇ (cid:62) w L D (cid:48) ( α , w ) ∇ w (cid:48) L D ( w (cid:48) ) (cid:17) (7) ≈ − kη (cid:15) (cid:0) ∇ α L D (cid:48) ( α , w + ) − ∇ α L D (cid:48) ( α , w − ) (cid:1) (8)where w ± = w ± (cid:15) ∇ w (cid:48) L D ( w (cid:48) ) , and w (cid:48) = w (cid:48) − kη ∇ w L D (cid:48) ( α , w ) . Similar approximation strategyis also adopted by meta-learning related tasks. (Liu et al., 2018; Finn et al., 2017)3.1 C ONVERTING A CLASSIFIER TO A LABEL CORRECTION NETWORK

There are multiple ways to build mappings from ( x, y (cid:48) ) to the corrected label with deep neuralnetworks as the desired label correction network. In essence, g α ( x, y (cid:48) ) behaves also like a classiﬁerwith the only difference from conventional classiﬁers, i.e., it also takes the noisy label y (cid:48) as input.To ease the effort of designing and working with the MLC framework, we explored several simplestrategies, which doesn’t require heavy-weight modiﬁcations to existing architectures. The one that isadopted in all our experiments is by constructing the LCN as a weighted combination from a classiﬁer h ( x ) and the weak label y (cid:48) itself, i.e. g ( x, y (cid:48) ) ≡ λ ( x ) h ( x ) + (1 − λ ( x )) y (cid:48) (9)where λ is a data dependent scalar controlling the mixing weights. We found it helps to have aseparate λ for each class, hence different weak classes fed in will be paired with different weights.If doing so, this only requires modifying the last layer of the classiﬁer h ( x ) , to output a vector ofdimension C ( C dimensions for the class logits, and the rest C dimensions for the weak labelweights λ ), instead of C for the original classiﬁer (where C is number of classes),3.2 S OFT CROSS ENTROPY LOSS FOR LEARNING WITH WEAK SUPERVISION

In the classiﬁcation scenario, when a clean label is given to a data example typically the cross entropyloss is used to train the classiﬁer. Here, we demonstrate that with a soft label (generated from the labelcorrection network), how the soft cross entropy loss could be beneﬁcial for the weakly supervisedsetting. Denote the the soft label as q , where q is a dense vector with (cid:80) i q i = 1 , typically resultedfrom a softmax layer and denote the predicting probability of the main classiﬁer as p with (cid:80) i p i = 1 .In this setup, the original cross entropy loss deﬁned for hard labels can be naively extended as CE soft ( p , q ) = − (cid:88) i q i log p i = (cid:88) i q i log q i p i − (cid:88) i q i log q i = KL ( q , p ) + entropy ( q ) (10)4reprintMinimizing this loss w.r.t the parameters of the main model is equivalent to (with the meta model ﬁxed,thus q ﬁxed) minimizing the KL divergence between the (soft) label and the predicting distribution,similar to the hard label case. And when updating the parameters of the meta model, minimizingthis loss function is now equivalent to (with the main model ﬁxed, thus p ﬁxed) minimizing both theKL divergence between the (soft) label and the predicting distribution, and also the entropy of thesoft labels predicted by the meta model, since we would like to have labeling distribution as close todiscrete as possible.3.3 k - STEP LOOK AHEAD

SGD

LOOK AHEAD IN META MODEL LEARNING

We found it’s crucial to use a value of k greater than 1 for MLC to ensure model convergence,particularly in the early phases of training, when both the main model and the meta model are close torandom predictors and lacks conﬁdence in their outputs. Using k = 1 is likely to confuse both modelsand thus they won’t converge. This is not the case for previous similar works with meta-learning,however we ﬁnd this to be crucial, as the main model in GLC is not directly trained on any cleanexamples, thus slightly more explorations from the main model is likely to help training convergence.We’ll explore this aspect in the coming section. Due to this requirement, in all our experiments, weuse scheduling for k starting from 1500 and decreasing to 500 towards the end of model training. XPERIMENTS

To test the performance of MLC, we conduct experiments on a set of classiﬁcation tasks, both from thetext and vision domains, and compare results with previous state-of-the-art approaches for learningwith weak/noisy labels, under different weak label scenarios.4.1 W

EAK SUPERVISION GENERATION

To generate weak supervision data, for each data set we test on, we sample a portion of the entiretraining set as the clean set. The noisy dataset is generated by corrupting the labels of all the remainingdata points based on one of the following two(three) settings: • Uniform mixture (

UNIF ) • Flipped labels (

FLIP ) • Weak labels from trained weak classiﬁers (

WEAK )The ﬁrst two methods follow the same procedure adopted by (Hendrycks et al., 2018)by eithercorrupting uniformly all classes or by ﬂipping a label to a different class. To generate the corruptedlabels from the true labels, we ﬁrst devise a corruption probability categorical distribution for eachtrue class, hence for all classes the corruption probability forms a label corruption matrix C . Then,for an example with true label i , we sample the corrupted label from the categorical distributionparameterized by the i th row of C . Note that this is a simpliﬁed assumption assuming that thecorrupted label does not depend on the data example itself, however we still use this to ensure afair comparison to (Hendrycks et al., 2018) where the same process was used for generating noisydata.To create a noisy datasets with different levels of noise, we take a convex combination of anidentity matrix and the corruption matrix, with the coefﬁcient of the latter serving as an indicator ofthe noise level (Hendrycks et al., 2018).To also study scenarios where the noise could be dependent on both the data and the label, weintroduce a third more realistic method: WEAK. In this method, weak labels are provided by separate(weak) predictive models that depend on both the data and the labels. To generate different levelsof noise, we train multiple weak predictive models with varying accuracies where a lower accuracycorresponds to a higher noise level. Note that in all noise levels, the weak predictive model isperforming better than random prediction.Note that all method are not aware of this true label corruption probability nor do they have knowledgeabout which data sample in the noisy set is actually corrupted. Since UNIF and FLIP are similar toone another, we report results based on UNIF only and leave the FLIP results for the appendix.5reprint4.2 B ASELINE METHODS

We test MLC mainly against the current state-of-the-art model (Hendrycks et al., 2018) for labelcorrection (denoted by GLC hereafter) in various settings where the labels in the noisy set arecorrupted by different noise levels, as well as the different ratios the clean set is sampled from theentire training set. Note that GLC was shown to perform consistently better than other modelsthat combine the clean and weak labels using methods such as distillation Li et al. (2017). Forcompleteness, we also compare with the forward loss correction method proposed in (Sukhbaataret al., 2014) (denoted by Forward hereafter). In lieu with meta learning for learning with weaksupervision, we also compare with the method of learning weights of training examples to learn arobust classiﬁer (Ren et al., 2018) (denoted by L2R hereafter).4.3 D

ATA SETS AND IMPLEMENTATION DETAILS

To ensure fair comparison with previous methods as much as possible, we experiment on a broad setof data collections from both , with our best effort to match the pre-processing and hyper-parametersetting from previous methods when experimenting with them on new datasets that were not used inthe original papers.We test on 10 different collections with varying data sizes. To compare with GLC, we test on all threetext collections used by (Hendrycks et al., 2018) and on the MNIST dataset. The dataset are::

MNIST : The MNIST dataset contains × grayscale images of the digits 0-9. The training sethas 60,000 images and the test set has 10,000 images. For preprocessing, we rescale the pixels to theinterval [0, 1]. We train a 2-layer fully connected network with 128 hidden dimensions. We trainwith Adam for 20000 iterations using batches of size 128 and a learning rate of 0.001 for the mainmodel and 0.0001 for the meta model. Twitter : The Twitter Part of Speech dataset (Gimpel et al., 2011) contains 1,827 tweets annotatedwith 25 POS tags. This is split into a training set of 1,000 tweets, a development set of 327 tweets,and a test set of 500 tweets. We use the development set to augment the training set. We use thesame pretrained 50-dimensional word vectors as in (Hendrycks et al., 2018), and for each token, weconcatenate word vectors in a ﬁxed window centered on the token. These form our training and testset. We use a window size of 3, and train a 2-layer fully connected network with hidden size 256, anduse the GELU nonlinearity (Hendrycks & Gimpel, 2016). We train with Adam (Kingma & Ba, 2014)for 20000 iterations with batch size 128 and learning rate of 0.001 for the main model and 0.0001 forthe meta model. Wwe use (cid:96) weight decay with λ = 3 × − on all the weights. SST : The Stanford Sentiment Treebank dataset consists of single sentence movie reviews (Socheret al., 2013). We use the 2-class version (i.e. SST2), which has 6,911 reviews in the training set, 872in the development set, and 1,821 in the test set. We follow the same data and model setups as ing (Hendrycks et al., 2018); the classiﬁer is a word-averaging model with an afﬁne output layer. We usethe Adam optimizer for 10000 epochs with batch size 50 and learning rate 0.001. For regularization,we use (cid:96) weight decay with λ = 1 × − on the output layer. IMDB : The IMDB dataset contains 50k movie reviews from IMDB, with 25k positive and 25knegative. We use a one-layer LSTM (Hochreiter & Schmidhuber, 1997) for both main model and themeta model.The above datasets are relatively small text collections, though meaningful to demonstrate the differentmethods for learning with weak supervision with simple classiﬁer architectures. We also include arange of 6 large scale text classiﬁcation benchmark datasets including:

AG News : AG is a text classiﬁcation dataset derived from a large collection of news articles gatheredfrom more than 2000 news sources. Each news article is categorized to 1 of 4 classes, including World,Sports, Business and Sci/Tech. . This dataset has 120,000 training examples and 7,600 examples fortesting. Amazon Reviews (Amazon-2 and Amazon-5) : The Amazon-2 and Amazon-5 datasets containrandomly sampled customer reviews from Amazon. The Amazon-2 is a binary polarity ratingclassiﬁcation dataset while Amazon-5 is a rating classiﬁcation dataset on a scale from 1 to 5. The Yelp Reviews (Yelp-2 and Yelp-5) : The Yelp-2 and Yelp-5 datasets are constructed from the YelpDataset Challenge 2015 data, for binary polarity rating classiﬁcation and 5-way rating classiﬁcation,respectively. Yelp-2 contains 560,000 training samples and 38,000 testing samples and Yelp-5contains in total of 650,000 training samples and 50,000 testing samples.

DBpedia

DBpedia is a crowd-sourced project aiming to extract structured information fromWikipedia. The DBpedia dataset was covers 14 non-overlapping ontology classes from DBPe-dia. Each class contains 40,000 training samples and 5,000 testing samples. Hence, the full datasethas 560,000 training samples and 70,000 testing samples.For all the large scale text classiﬁcation datasets (AG, Amazon-2 and -5, Yelp-2 and 5 and DBpedia),we adopt a pre-trained BERT-base (Devlin et al., 2018) model for both the main network and metanetwork. This ensures that we can test the ability of MLC in the weakly supervised setting withstrong state-of-the-art base models.We implement all models and experiments in PyTorch . To ensure fair comparison, we adopt thesame main network architecture as much as possible from previous best methods with comparablenumber of parameters. A brief overview of the neural net architectures used in various settings islisted in Table 1 (Refer to the appendix for a detailed description of the model architectures). Codefor reproducing the results in this paper will be made publicly available.4.4 M AIN R ESULTS

MLC with MLP, LSTM:

We investigate multiple settings with an extensive set of different conﬁgu-rations, i.e., two noise types, different noise levels, and different clean ratios. Table 2 presents theaveraged accuracies across all these conﬁgurations with each one repeated for 5 times. Notice that theresults vary per dataset when the news is generated using UNIF (i.e. noise is independent of the data).On the other hand, we notice that the performance of all methods seems to drop when we use WEAK(noise depends on the data and the label). This shows that this is a more realistic and challengingsettings. We also observe that MLC performs consistently better in this case.Table 2: Mean accuracies over an extensive set of experimental conﬁgurations. Each cell representsan average over 2 noise types (UNIF and WEAK), 3 clean ratios(0.1%, 1.0% and 5%), 11 noise levelsfor UNIF ( 0 - 1.0 with 0.1 step and 3 different weak classiﬁers for WEAK. Every experiment wasrepeated 5 times

Datasets Twitter SST IMDB MNIST

UNIF

Forward 0.484

L2R 0.763 0.614 0.702 0.905MLC

WEAK

Forward 0.226 0.631 0.626 0.407GLC 0.295 0.615 https://pytorch.org MLC with BERT.

Table 3 presents the error rates of MLC on 6 large text data sets with pre-trainedBERT-base as its main model and meta models. Note that these are much larger scale dataset and thatthe baselines and the base models for both the meta and the main learner are using Bert. We noticethat MLC consistently outperforms the baselines (except for L2R on the AG dataset).Table 3: Error rates comparison on 6 large text data sets

Datasets AG Yelp-2 Yelp-5 Amazon-2 Amazon-5 DBpedia

Fully supervised ( labeled examples) (120k) (560k) (650k) (3.6m) (3m) (560k)BERT

LARGE (Xie et al., 2019) - 1.89 29.32 2.63 34.17 0.64SSL ( labeled examples) (20) (2.5k) (20) (2.5k) (140)BERT

BASE-512 - 13.60 41.00 26.75 44.09 2.58BERT

LARGE-512 - 10.55 38.90 15.54 42.30 1.68WSL ( labeled examples, p = 0 . ) (60) (20) (2.5k) (20) (2.5k) (140)GLC - BERT BASE-128

BASE-128

ETAILED R ESULTS

We investigate how the noise levels in the weak labels affect MLC training. Due to space limitations,we only present detailed results on Twitter and MNIST. Detailed experiments on SST and IMDB canbe found in the appendix.

Twitter . Figure 2 presents the detailed performances with different clean data ratio and label noiselevels. For the UNIF setting, both loss correction methods (GLC and MLC) works better than usingonly clean data to train the classiﬁer, emphasizing the importance of incorporating those weaklysupervised examples. With and only clean data, MLC achieves consistently higher accuraciesover the range of high noise levels, implying the robustness of MLC with severe noise present. In theWEAK setting, where the weak labels are generated by weak classiﬁers, GLC performs worse thanMLC since it assumes that the noisy labels are only dependent on the true label but not on the data. Incontrast, MLC gains signiﬁcant edge over the other methods as MLC doesn’t make such assumptions. T e s t a cc u r a c y TWITTER, unif, 1.0% clean

ForwardGLCL2RMLC 0.00 0.25 0.50 0.75 1.00Noise level0.00.20.40.60.81.0 T e s t a cc u r a c y TWITTER, unif, 5.0% clean

ForwardGLCL2RMLC 0.00 0.25 0.50 0.75 1.00Noise level0.00.20.40.60.81.0 T e s t a cc u r a c y TWITTER, unif, 10.0% clean

GLCL2RMLC1 2 3Weak classifier index0.00.20.40.60.81.0 T e s t a cc u r a c y TWITTER, weak, 1.0% clean

ForwardGLCL2RMLC 1 2 3Weak classifier index0.00.20.40.60.81.0 T e s t a cc u r a c y TWITTER, weak, 5.0% clean

ForwardGLCL2RMLC 1 2 3Weak classifier index0.00.20.40.60.81.0 T e s t a cc u r a c y TWITTER, weak, 10.0% clean

ForwardGLCL2RMLC

Figure 2: Results on Twitter. All numbers reported are accuracies on the test set.

MNIST . Figure 3 presents the detailed performances with different gold data ratio and corruptionlevels. Similar trends could be observed as previously seen in Twitter and SST. On the UNIF setting,MLC is not as good as GLC and L2R; however in the upper range of noise levels, MLC catches up8reprint T e s t a cc u r a c y MNIST, unif, 0.1% clean

ForwardGLCL2RMLC 0.00 0.25 0.50 0.75 1.00Noise level0.00.20.40.60.81.0 T e s t a cc u r a c y MNIST, unif, 1.0% clean

ForwardGLCL2RMLC 0.00 0.25 0.50 0.75 1.00Noise level0.00.20.40.60.81.0 T e s t a cc u r a c y MNIST, unif, 5.0% clean

ForwardGLCL2RMLC1 2 3Weak classifier index0.00.20.40.60.81.0 T e s t a cc u r a c y MNIST, weak, 0.1% clean

ForwardGLCL2RMLC 1 2 3Weak classifier index0.00.20.40.60.81.0 T e s t a cc u r a c y MNIST, weak, 1.0% clean

ForwardGLCL2RMLC 1 2 3Weak classifier index0.00.20.40.60.81.0 T e s t a cc u r a c y MNIST, weak, 5.0% clean

ForwardGLCL2RMLC

Figure 3: Results on MNIST. All numbers reported are accuracies on the test set. Best results interms of mean accuracies are printed in black.

Loss on noisy data k =1000 k =800 k =200 k =100 k =1 0 5000 10000 15000 20000Steps012 Loss on clean data k =1000 k =800 k =200 k =100 k =1 0 5000 10000 15000 20000Steps0.00.51.01.5 Meta model output entropy k =1000 k =800 k =200 k =100 k =1 0 5000 10000 15000 20000Steps0.20.40.6 Test accuracy k =1000 k =800 k =200 k =100 k =1 Figure 4: (a) Loss curve on noisy data; (b) Loss curve on clean data; (c) Entropy of the label correctiondistribution from the meta model; (d) Test set accuracy changes over the iterationsand leads the way for both GLC and L2R. Moreover, in the WEAK setting, GLC loses to MLC againdue to its simplistic assumption about the noises with a large margin.4.6 A

NALYSIS AND ABLATION STUDIES

In this section, we tap into the details of how MLC behaves in terms of training dynamics and whatthe meta networks learns.4.6.1 T

RAINING DYNAMICS

Figure 4 shows the training progress for one run on the MNIST data sets. We monitor a set of differentmetrics in training, including the loss function on the noisy data (thus with corrected labels), lossfunction on clean data, the entropy of the output distribution from the meta-model (since it’s a softlabel). Another key factor is the parameter k for the look ahead SGD. It turned out that with k = 1 the model basically diverges, thus picking a value larger than 1 is crucial to MLC training.4.6.2 M ETA NET EVALUATION

After training, besides the main model that serve as the predictive for inference, we also obtain themeta model, a trained label correction network. In this section, we investigate what actually the metamodel learns after convergence. To achieve this we follow the

UNIF setting, i.e., we corrupt thelabels for examples in the test set according to the label corruption matrix used in the weak labelgeneration process and feed the corrupted test pair into the LCN to check it could recover the correctlabel. Note that by doing this we ensure that the MLC framework doesn’t see the instance in training.It’s clear that after training, both main and meta models have the ability to predict correct labels. Themain network could be used for prediction, while the other serve as a good label correction network.9reprint A cc u r a c y main modelmeta model 0.00 0.25 0.50 0.75 1.00Noise level0.0050.0100.0150.020 M A E w . r . t g o l d C GLCMLC

Figure 5: (Left) We test the label correction ability of the trained meta model from MLC as a classiﬁertaking a pair of data and its noisy label as input and assess its accuracy. For reference, we also plotthe accuracies for corresponding trained main model; (Right) Comparison of discrepancies of theestimated label corruption matrix against the ground truth one. Both GLC and MLC are shown.By measuring the MAE of the estimated label corruption matrix, we verify that the corrected labeldistribution aligns well to the ground truth for the UNIF setting (so is GLC, however GLC cannot beused to correct future unseen examples).

ELATED W ORK

Labeled data largely determines whether a machine learning system can perform well on a task or not,as noisy label or corrupted labels could cause dramatic performance drop (Nettleton et al., 2010). Theproblem gets even worse when an adversarial rival intentionally injects noises into the labels (Reedet al., 2014). Thus, understanding, modeling, correcting, and learning with noisy labels has been ofinterest at large in the research communities (Natarajan et al., 2013; Fr´enay & Verleysen, 2013).Several works (Mnih & Hinton, 2012; Patrini et al., 2017; Sukhbaatar et al., 2014; Larsen et al., 1998)have attempted to address the weak labels by modifying the model’s architecture or by implementinga loss correction. (Sukhbaatar et al., 2014) introduced a stochastic variant to estimate label corruption,however the methods have to have access to the true labels, rendering it inapplicable when no truelabels are present. A forward loss correction adds a linear layer to the end of the model and theloss is adjusted accordingly to incorporate learning about the label noise. (Patrini et al., 2017) alsomake use of the forward loss correction mechanism, and propose an estimate of the label corruptionestimation matrix which relies on strong assumptions, and does not make use of clean labels thatmight be available for a portion of the data set..Following (Charikar et al., 2017), we assume that during training the model has access to a small setof clean labels besides a large set of weak labels. This assumption has been leveraged by others forthe purpose of label noise robustness, most notably (Veit et al., 2017; Li et al., 2017; Xiao et al., 2015;Ren et al., 2018). (Veit et al., 2017) use human-veriﬁed labels to train a label cleaning network byestimating the discrepancies between the noisy and clean labels in a multi-label classiﬁcation setting.This assumes that, for a subset of the data, both trusted and noisy labels are available. This workavoids this limitation by proposing a meta learning approach that does not require trusted and noisydata to be available for the same instances.

ONCLUSIONS

In this paper, we address the problem of learning with weak supervision from a meta-learningperspective. Speciﬁcally, we propose to use a meta network to correct the noisy labels from thenoisy data set, and a main classiﬁer network is trained to ﬁt the example to a provided label, i.e.,corrected labels for the noisy examples and true labels for the clean examples. The meta network andmain network are jointly optimized in a bi-level optimization framework; to address the computationchallenge, we employ a k-step ahead SGD update for the model weights of the main model. Empiricalexperiments on several benchmark datasets including text and graysacle images demonstrates theeffectiveness of MLC. 10reprint R EFERENCES

Moses Charikar, Jacob Steinhardt, and Gregory Valiant. Learning from untrusted data. In

Proceedingsof the 49th Annual ACM SIGACT Symposium on Theory of Computing , pp. 47–60. ACM, 2017.Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deepbidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 , 2018.Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation ofdeep networks. In

Proceedings of the 34th International Conference on Machine Learning-Volume70 , pp. 1126–1135. JMLR. org, 2017.Benoˆıt Fr´enay and Michel Verleysen. Classiﬁcation in the presence of label noise: a survey.

IEEEtransactions on neural networks and learning systems , 25(5):845–869, 2013.Kevin Gimpel, Nathan Schneider, Brendan OConnor, Dipanjan Das, Daniel Mills, Jacob Eisenstein,Michael Heilman, Dani Yogatama, Jeffrey Flanigan, and Noah A Smith. Part-of-speech taggingfor twitter: Annotation, features, and experiments.

ACL HLT 2011 , pp. 42, 2011.Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus). arXiv preprintarXiv:1606.08415 , 2016.Dan Hendrycks, Mantas Mazeika, Duncan Wilson, and Kevin Gimpel. Using trusted data to traindeep networks on labels corrupted by severe noise. In

Advances in Neural Information ProcessingSystems , pp. 10477–10486, 2018.Sepp Hochreiter and J¨urgen Schmidhuber. Long short-term memory.

Neural computation , 9(8):1735–1780, 1997.Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 , 2014.Jan Larsen, L Nonboe, Mads Hintz-Madsen, and Lars Kai Hansen. Design of robust neural networkclassiﬁers. In

Proceedings of the 1998 IEEE International Conference on Acoustics, Speech andSignal Processing, ICASSP’98 (Cat. No. 98CH36181) , volume 2, pp. 1205–1208. IEEE, 1998.Yuncheng Li, Jianchao Yang, Yale Song, Liangliang Cao, Jiebo Luo, and Li-Jia Li. Learning fromnoisy labels with distillation. In

The IEEE International Conference on Computer Vision (ICCV) ,Oct 2017.Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Differentiable architecture search. arXivpreprint arXiv:1806.09055 , 2018.Dougal Maclaurin, David Duvenaud, and Ryan Adams. Gradient-based hyperparameter optimizationthrough reversible learning. In

International Conference on Machine Learning , pp. 2113–2122,2015.Volodymyr Mnih and Geoffrey E Hinton. Learning to label aerial images from noisy data. In

Proceedings of the 29th International conference on machine learning (ICML-12) , pp. 567–574,2012.Nagarajan Natarajan, Inderjit S Dhillon, Pradeep K Ravikumar, and Ambuj Tewari. Learning withnoisy labels. In

Advances in neural information processing systems , pp. 1196–1204, 2013.David F Nettleton, Albert Orriols-Puig, and Albert Fornells. A study of the effect of different typesof noise on the precision of supervised learning techniques.

Artiﬁcial intelligence review , 33(4):275–306, 2010.Giorgio Patrini, Alessandro Rozza, Aditya Krishna Menon, Richard Nock, and Lizhen Qu. Makingdeep neural networks robust to label noise: A loss correction approach. In

Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition , pp. 1944–1952, 2017.Fabian Pedregosa. Hyperparameter optimization with approximate gradient. arXiv preprintarXiv:1602.02355 , 2016. 11reprintScott Reed, Honglak Lee, Dragomir Anguelov, Christian Szegedy, Dumitru Erhan, and AndrewRabinovich. Training deep neural networks on noisy labels with bootstrapping. arXiv preprintarXiv:1412.6596 , 2014.Mengye Ren, Wenyuan Zeng, Bin Yang, and Raquel Urtasun. Learning to reweight examples forrobust deep learning. arXiv preprint arXiv:1803.09050 , 2018.Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Ng, andChristopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank.In

Proceedings of the 2013 conference on empirical methods in natural language processing , pp.1631–1642, 2013.Sainbayar Sukhbaatar, Joan Bruna, Manohar Paluri, Lubomir Bourdev, and Rob Fergus. Trainingconvolutional networks with noisy labels. arXiv preprint arXiv:1406.2080 , 2014.Andreas Veit, Neil Alldrin, Gal Chechik, Ivan Krasin, Abhinav Gupta, and Serge Belongie. Learningfrom noisy large-scale datasets with minimal supervision. In

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , pp. 839–847, 2017.Tong Xiao, Tian Xia, Yi Yang, Chang Huang, and Xiaogang Wang. Learning from massive noisylabeled data for image classiﬁcation. In

Proceedings of the IEEE conference on computer visionand pattern recognition , pp. 2691–2699, 2015.Qizhe Xie, Zihang Dai, Eduard Hovy, Minh-Thang Luong, and Quoc V Le. Unsupervised dataaugmentation for consistency training. arXiv preprint arXiv:1904.12848 , 2019.12reprint

A A

DDITIONAL RESULTS

A.1 SST

AND

IMDB

SST . Figure 6 presents the detailed performances with different gold data ratio and corruption levels.On this binary classiﬁcation task, it’s surprising to observe that using clean data solely is onlyachieving results that are slightly better than random guessing (an accuracy of 0.5). Again withthe help of label correction for the noisy examples, the performance boosts by quite a margin withGLC. So does MLC over GLC, demonstrating the potential power of using a meta network as a labelcorrection procedure. T e s t a cc u r a c y SST, unif, 0.1% clean

ForwardGLCL2RMLC 0.00 0.25 0.50 0.75 1.00Noise level0.00.20.40.60.81.0 T e s t a cc u r a c y SST, unif, 1.0% clean

ForwardGLCL2RMLC 0.00 0.25 0.50 0.75 1.00Noise level0.00.20.40.60.81.0 T e s t a cc u r a c y SST, unif, 5.0% clean

ForwardGLCL2RMLC0.00 0.25 0.50 0.75 1.00Noise level0.00.20.40.60.81.0 T e s t a cc u r a c y SST, flip, 0.1% clean

ForwardGLCL2RMLC 0.00 0.25 0.50 0.75 1.00Noise level0.00.20.40.60.81.0 T e s t a cc u r a c y SST, flip, 1.0% clean

ForwardGLCL2RMLC 0.00 0.25 0.50 0.75 1.00Noise level0.00.20.40.60.81.0 T e s t a cc u r a c y SST, flip, 5.0% clean

ForwardGLCL2RMLC1 2 3Weak classifier index0.00.20.40.60.81.0 T e s t a cc u r a c y SST, weak, 0.1% clean

ForwardGLCL2RMLC 1 2 3Weak classifier index0.00.20.40.60.81.0 T e s t a cc u r a c y SST, weak, 1.0% clean

ForwardGLCL2RMLC 1 2 3Weak classifier index0.00.20.40.60.81.0 T e s t a cc u r a c y SST, weak, 5.0% clean

ForwardGLCL2RMLC

Figure 6: Results on SST. All numbers reported are accuracies on the test set. For references, usinggold data only to train a model yields test accurucies of 0.541?, 0.647? and 0.741?, for three golddata ratios respectively.

IMDB . Figure 7 presents the detailed results on all three noisy settings.13reprint T e s t a cc u r a c y IMDB, unif, 0.1% clean

ForwardGLCL2R 0.00 0.25 0.50 0.75 1.00Noise level0.00.20.40.60.81.0 T e s t a cc u r a c y IMDB, unif, 1.0% clean

ForwardGLCL2R 0.00 0.25 0.50 0.75 1.00Noise level0.00.20.40.60.81.0 T e s t a cc u r a c y IMDB, unif, 5.0% clean

ForwardGLCL2R0.00 0.25 0.50 0.75 1.00Noise level0.00.20.40.60.81.0 T e s t a cc u r a c y IMDB, flip, 0.1% clean

ForwardGLCL2R 0.00 0.25 0.50 0.75 1.00Noise level0.00.20.40.60.81.0 T e s t a cc u r a c y IMDB, flip, 1.0% clean

ForwardGLCL2R 0.00 0.25 0.50 0.75 1.00Noise level0.00.20.40.60.81.0 T e s t a cc u r a c y IMDB, flip, 5.0% clean

ForwardGLCL2R1 2 3Weak classifier index0.00.20.40.60.81.0 T e s t a cc u r a c y IMDB, weak, 0.1% clean

MLC 1 2 3Weak classifier index0.00.20.40.60.81.0 T e s t a cc u r a c y IMDB, weak, 1.0% clean

MLC 1 2 3Weak classifier index0.00.20.40.60.81.0 T e s t a cc u r a c y IMDB, weak, 5.0% clean