Constrained Labeling for Weakly Supervised Learning
CConstrained Labeling for Weakly Supervised Learning
Chidubem Arachie , Bert Huang Department of Computer Science, Virginia Tech Department of Computer Science and Data Intensive Studies Center, Tufts [email protected], [email protected]
Abstract
Curation of large fully supervised datasets has become oneof the major roadblocks for machine learning. Weak supervi-sion provides an alternative to supervised learning by trainingwith cheap, noisy, and possibly correlated labeling functionsfrom varying sources. The key challenge in weakly supervisedlearning is combining the different weak supervision signalswhile navigating misleading correlations in their errors. In thispaper, we propose a simple data-free approach for combiningweak supervision signals by defining a constrained space forthe possible labels of the weak signals and training with arandom labeling within this constrained space. Our method isefficient and stable, converging after a few iterations of gradi-ent descent. We prove theoretical conditions under which theworst-case error of the randomized label decreases with therank of the linear constraints. We show experimentally thatour method outperforms other weak supervision methods onvarious text- and image-classification tasks.
Recent successful demonstrations of machine learning havecreated an explosion of interest. The key driver of thesesuccesses is the progress in deep learning. Researchers indifferent fields and industries are applying deep learning totheir work with varying degrees of success. Training deeplearning models typically requires massive amounts of data,and in most cases this data needs to be labeled for supervisedlearning. The process of collecting labels for large trainingdatasets is often expensive and can be a major bottleneck forpractical machine learning.To enable machine learning when labeled data is not avail-able, researchers are increasingly turning to weak supervision.Weakly supervised learning involves training models usingnoisy labels. Using multiple sources or forms of weak super-vision is common, as it provides diverse information to themodel. However, each source of weak supervision has its ownbias that can be transmitted to the model. Different weak su-pervision signals can also conflict, overlap, or—in the worstcase—make dependent errors. Thus, a naive combination ofthese weak signals would hurt the quality of a learned model.The key problem then is how to reliably combine varioussources of weak signals to train an accurate model.
Copyright c (cid:13)
To solve this problem, we propose constrained label learn-ing (CLL), a method that processes various weak supervisionsignals and combines them to produce high-quality traininglabels. The idea behind CLL is that, given the weak supervi-sion, we can define a constrained space for the labels of theunlabeled examples. The space will contain the true labels ofthe data, and any other label sampled from the space shouldbe sufficient to train a model. We construct this space usingthe expected error of the weak supervision signals, and thenwe select a random vector from this space to use as traininglabels. Our analysis shows that, the space of labels consideredby CLL improves to be tighter around the true labels as weinclude more information in the weak signals and that CLLis not confounded by redundant weak signals.CLL takes as input (1) a set of unlabeled data examples,(2) multiple weak supervision signals that label a subset ofdata and can abstain from labeling the rest, and (3) a corre-sponding set of expected error rates for the weak supervisionsignals. While the weak supervision signals can abstain onvarious examples, we require that the combination of theweak signals have full coverage on the training data. Theexpected error rates can be estimated if the weak supervisionsignals have been tested on historical data or a domain experthas knowledge about their performance. In cases where theexpected error rates are unavailable, they can be treated as ahyperparameter.We implement CLL as a stable, quickly converging, convexoptimization over the candidate labels. CLL thus scales muchbetter than many other weak supervision methods. We showin Section 4 experiments that compare the performance ofCLL to other weak supervision methods. On a syntheticdataset, CLL trained with a constant error rate is only afew percentage points from matching the performance ofsupervised learning on a test set. On real text and imageclassification tasks, CLL achieves superior performance overexisting weak supervision methods on test data.
Weakly supervised learning has gained prominence in re-cent years due to the need to train models without accessto manually labeled data. The recent success of deep learn-ing has exacerbated the need for large-scale data annotation,which can be prohibitively expensive. One weakly supervisedparadigm, data programming, allows users to define labeling a r X i v : . [ c s . L G ] S e p unctions that noisily label a set of unlabeled data (Bach et al.2019; Ratner et al. 2017, 2016). Data programming then com-bines the noisy labels to form probabilistic labels for the databy using a generative model to estimate the accuracies anddependencies of the noisy/weak supervision signals. Thisapproach underlies the popular software package Snorkel (Ratner et al. 2017). Our method is related to this approach inthat we use different weak signal sources and compile theminto a single (soft) labeling. However, unlike Snorkel’s meth-ods, we do not train a generative model and avoid the needfor probabilistic modeling assumptions. Recently, SnorkelMeTaL was proposed for solving multi-task learning prob-lems with hierarchical structure (Ratner et al. 2018). A userprovides weak supervision for the hierarchy of tasks whichis then combined in an end-to-end framework.Another recently developed approach for weakly super-vised learning is adversarial label learning (ALL) (Arachieand Huang 2019a). ALL was developed for training binaryclassifiers from weak supervision. ALL trains a model toperform well in the worst case for the weak supervision bysimultaneously optimizing model parameters and adversariallabels for the training data in order to satisfy the constraintthat the error of the weak signals on the adversarial labelsbe within provided error bounds. The authors also recentlyproposed Stoch-GALL (Arachie and Huang 2019b), an exten-sion for multi-class classification that incorporates precisionbounds. Our work is related to ALL and Stoch-GALL in thatwe use the same error definition the authors introduced. How-ever, the expected errors we use do not serve as upper boundconstraints for the weak signals. Additionally, CLL avoidsthe adversarial setting that requires unstable simultaneous op-timization of the estimated labels and the model parameters.Lastly, while ALL and Stoch-GALL require weak super-vision signals to label every example, we allow for weaksupervision signals that abstain on different data subsets.Crowdsourcing has become relevant to machine learningpractitioners as it provide a means to train machine learningmodels using labels collected from different crowd workers(Carpenter 2008; Gao, Barbier, and Goolsby 2011; Karger,Oh, and Shah 2011; Khetan, Lipton, and Anandkumar 2017;Liu, Peng, and Ihler 2012; Platanios et al. 2020; Zhou et al.2015; Zhou and He 2016). The key machine learning chal-lenge when crowdsourcing is to effectively combine the dif-ferent labels obtained from human annotators. Our work issimilar in that we try to combine different weak labels. How-ever, unlike most methods for crowdsourcing, we cannotassume that the labels are independent of each other. Instead,we train the model to learn while accounting for dependenciesbetween the various weak supervision signals.Ensemble methods such as boosting (Schapire et al. 2002)combine different weak learners (low-cost, low-powered clas-sifiers) to create classifiers that outperform the various weaklearners. These weak learners are not weak in the same senseas weak supervision. These strategies are defined for fullysupervised settings. Although recent work has proposed lever-aging unlabeled data to improve the accuracies of boostingmethods (Balsubramani and Freund 2015), our settings dif-fers since we do not expect to have access to labeled data.A growing set of weakly supervised applications includes web knowledge extraction (Bunescu and Mooney 2007; Hoff-mann et al. 2011; Mintz et al. 2009; Riedel, Yao, and McCal-lum 2010; Yao, Riedel, and McCallum 2010), visual imagesegmentation (Chen et al. 2014; Xu, Schwing, and Urta-sun 2014), and tagging of medical conditions from healthrecords (Halpern, Horng, and Sontag 2016). As better weaklysupervised methods are developed, this set will expand toinclude other important applications.We will show an estimation method that is connected tothose developed to estimate the error of classifiers without la-beled data (Dawid and Skene 1979; Jaffe et al. 2016; Madani,Pennock, and Flake 2005; Platanios, Blum, and Mitchell2014; Platanios, Dubey, and Mitchell 2016; Steinhardt andLiang 2016). These methods rely on statistical relationshipsbetween the error rates of different classifiers or weak signals.Unlike these methods, we show in our experiments that wecan train models even when we do not learn the error rates ofclassifiers. We show that using a maximum error estimate ofthe weak signals, CLL learns to accurately classify.Like our approach, many other methods incorporate hu-man knowledge or side information into a learning objective.These methods, including posterior regularization (Druck,Mann, and McCallum 2008) and generalized expectation(GE) criteria and its variants (Mann and McCallum 2008,2010), can be used for semi- and weakly supervised learning.They work by providing parameter estimates as constraintsto the objective function of the model so that the label dis-tribution of the trained model tries to match the constraints.In our approach, we incorporate human knowledge as errorestimates into our algorithm. However, we do not use the con-straints for model training. Instead, we use them to generatetraining labels that satisfy the constraints, and these labelscan then be used downstream to train any model.
The goal of constrained label learning (CLL) is to returnaccurate training labels for the data given the weak supervi-sion signals. The estimation of these labels should be awareof the correlation among the weak supervision signals andshould not be confounded by it. Toward this goal, we use theweak signals’ expected error to define a constrained spaceof possible labelings for the data. Any vector sampled fromthis space can then be used as training labels. We considerthe setting in which the learner has access to a training setof unlabeled examples, and a set of weak supervision sig-nals from various sources that provide approximate indi-cators of the target classification for the data. Along withthe weak supervision signals, we are provided estimates ofthe expected error rates of the weak signals. Formally, letthe data be X = [ x , . . . , x n ] . These examples have cor-responding labels y = [ y , . . . , y n ] ∈ { , } n . For multi-label classification, where each example may be labeledas a member of K classes, we expand the label vector toinclude an entry for each example-class combination, i.e., y = [ y (1 , , . . . , y (2 , , y (1 , , . . . , y ( n − ,K ) , y ( n,K ) ] , where y ij is the indicator of whether the i th example is in class j . We represent the labels as a vector for later notational conve-nience, even though it may be more naturally arranged as a matrix. .2 0.1 0.0 0.6 0.8 ∅ ∅ ∅ ∅ ∅ ∅ ∅ ∅ ∅ ∅ ∅ ∅ ∅ ∅ ∅ ∅ ∅ ∅ ∅ ∅∅ ∅ ∅ ∅ ∅ ∅ ∅ ∅ ∅ ∅∅ ∅ ∅ ∅ ∅ ∅ ∅ ∅ ∅ ∅ ∅ ∅ ∅ ∅ ∅ ∅ ∅ ∅ ∅ ∅ :w :w :w :w :y T : Class 1 Class 2 Class 3
Example:
Figure 1:
Illustration of weak signals and label vectorized structure.For multi-class problems, we arrange the label vector so that itcontains indicators for each example belonging to each class. Theweak signals use the same indexing scheme. In this illustration,weak signals w and w estimate the probability of each examplebelonging to class 1 and abstain on estimating membership in allother classes. See Fig. 1 for an illustration of this arrangement.With weak supervision, the training labels y are unavail-able. Instead, we have access to m weak supervision signals { w , . . . , w m } , where each weak signal w ∈ [0 , n is rep-resented as a vector of indicators that each example is ineach class. The weak signals can choose to abstain on someexamples. In that case, they assign a null value ∅ to thatexample’s entry. In practice, weak signals for multi-classproblems typically only label one class at a time, such as aone-versus-rest classification rule, so they effectively abstainon all out-of-class entries. The weak signals can be soft labels(probabilities) or hard labels (class assignments) of the data.In conjunction with the weak signals, the learner also receivesthe expected error rates of the weak signals (cid:15) = [ (cid:15) , . . . , (cid:15) m ] .In practice, the error rates of the weak signals are estimatedor treated as a hyperparameter. The expected empirical errorof a weak signal w i is (cid:15) i = 1 n i (cid:0) ( w (cid:54) = ∅ ) w (cid:62) i (1 − y k ) + ( w (cid:54) = ∅ ) (1 − w i ) (cid:62) y k (cid:1) = 1 n i (cid:0) ( w (cid:54) = ∅ ) (1 − w i ) (cid:62) y k + w (cid:62) i ( w (cid:54) = ∅ ) (cid:1) , (1)where n i = (cid:80) ( w i (cid:54) = ∅ ) and ( w i (cid:54) = ∅ ) is an indicator functionthat returns on examples the weak signals label (i.e., do notabstain on). Hence, we only calculate the error of the weaksignals on the examples they label.Analogously to Eq. (1), we can express the expected errorof all weak signals for the label vector as a system of linearequations in the form Ay = c . To do this, we define eachrow in A as A i = ( w i (cid:54) = ∅ ) (1 − w i ) , (2)a linear transformation of a weak signal w . Each entry in thevector c is the difference between the expected error of theweak signal and the sum of the weak signal, i.e., c i = n i (cid:15) i − w (cid:62) i ( w (cid:54) = ∅ ) . (3)Valid label vectors then must be in the space { ˜ y | A ˜ y = c ∧ ˜ y ∈ [0 , n } . (4)The true label y is not known. Thus, we want to findtraining labels ˜ y that satisfy the system of linear equations. Algorithm
Having defined the space of possible labelings for the datagiven the weak signals, we explain here how we efficientlysample a vector of training labels from the space. First, we ini-tialize a random ˜ y from a uniform distribution ˜ Y ∼ U (0 , n .Then we minimize a quadratic penalty on violations of theconstraints defining the space. The objective function is min ˜ y ∈ [0 , n (cid:107) A ˜ y − c (cid:107) . (5)The solution to this quadratic objective function givesus feasible labels for the training data when they exist. Incases where the error estimates make an infeasible space, thisquadratic penalty acts as a squared slack. We solve Eq. (5) it-eratively using projected Adagrad (Duchi, Hazan, and Singer2011), clipping ˜ y values to [0 , n between gradient updates.One advantage of this approach is that it is fast and efficient,even for large datasets. Our algorithm is a simple quadraticconvex optimization that converges to a unique optimum foreach initialization of ˜ y . In our experiments, it converges afteronly a few iterations of gradient descent. We fix the numberof iterations of gradient descent to for all our experiments.The full algorithm is summarized in the appendix. Analysis
We start by analyzing the case where we have the true error (cid:15) , in which case the true label vector y for CLL is a solutionin the feasible space. Although the true error rates are notavailable in practice, this ideal setting is the motivating casefor the CLL approach. To begin the analysis, consider anextreme case: if A is a square matrix with full rank, thenthe only valid label ˜ y in the space is the true label, ˜ y = y .Normally, A is usually underdetermined, which means wehave more data examples than weak signals. In this case, thereare many solutions for ˜ y , so we can analyze this space tounderstand how distant any feasible vector is from the vectorof all incorrect labels. Since label vectors are constrainedto be in the unit box, the farthest possible label vector fromthe true labels is (1 − y ) . The result of our analysis is thefollowing theorem, which addresses the binary classificationcase with non-abstaining weak signals. Theorem 1.
For any ˜ y ∈ [0 , n such that A ˜ y = c , itsEuclidean distance from the negated label vector (1 − y ) ∈{ , } n is bounded below by || ˜ y − (1 − y ) || ≥ n || A + (1 − (cid:15) ) || , (6) where A + is the Moore-Penrose pseudoinverse of A .Proof. We first relax the constrained space by removing the [0 , n box constraints. We can then analyze the projectiononto the feasible space: min ˜ y || (1 − y ) − ˜ y || s.t. A ˜ y = c . (7)Define a vector z := ˜ y − y . We can rewrite the distance as min z || (1 − y ) − z || s.t. Az = 0 . (8)he minimization is a projection of (1 − y ) onto the nullspace of A . Since the null and row spaces of a matrix arecomplementary, (1 − y ) decomposes into (1 − y ) = P row (1 − y ) + P null (1 − y ) , where P row and P null are orthogonal projections into the rowand null spaces of A , respectively. We can use this decompo-sition to rewrite the distance of interest: || (1 − y ) − P null (1 − y ) || = || (1 − y ) − ((1 − y ) − P row (1 − y )) || = || P row (1 − y ) || . (9)For any vector v , its projection into the row space of matrix A is A + Av , where A + is the Moore-Penrose pseudoinverseof A . The distance of interest is thus || A + A (1 − y ) || . Wecan use the definition of A to further simplify. Let W bethe matrix of weak signals W = [ w , . . . , w m ] (cid:62) . Then thedistance is || A + (1 − W )(1 − y ) || = || A + ((1 − W ) (cid:126) n − − W ) y ) || = || A + ( n − W (cid:126) n − Ay ) || . (10)Because Ay = c = n (cid:15) − W (cid:126) n , terms cancel, yielding thebound in the theorem: || A + ( n − W (cid:126) n − n (cid:15) + 2 W (cid:126) n ) || = || A + ( n − n (cid:15) ) || = n || A + (1 − (cid:15) ) || . (11)This bound provides a quantity that is computable in prac-tice. However, to gain an intuition about what factors af-fect its value, the distance formula can be further analyzedby using the singular-value decomposition (SVD) formulafor the pseudoinverse. Consider SVD A = U Σ V (cid:62) . Then A + = V Σ + U (cid:62) , where the pseudoinverse Σ + contains thereciprocal of all nonzero singular values along the diagonal(and zeros elsewhere). The distance simplifies to n || V Σ + U (cid:62) (1 − (cid:15) ) || = n || Σ + U (cid:62) (1 − (cid:15) ) || , (12)since V is orthonormal. Furthermore, let p = U (cid:62) (1 − (cid:15) ) ,i.e., p is a rotation of the centered error rates of the weaksignals with the same norm as (1 − (cid:15) ) . From this change ofvariables, we can decompose the distance into n || Σ + p || = n (cid:113) σ p + . . . + σ m p m , (13)where σ j is the j th singular value of Σ + .As this distance grows toward √ n , the space of possiblelabelings shrinks toward zero, at which point the only feasiblelabel vectors are close to the true labels y . Equation (13)indicates that the distance increases roughly as the rank of A increases, in which case the number of non-zero singularvalues in Σ + increases, irrespective of how many actual weaksignals are given. Thus, redundancy in the weak supervisiondoes not affect the performance of CLL. The other key factorin the distance is how far from 0.5 the errors (cid:15) are. These E rr o r CLLMajority Vote
Figure 2:
Error of CLL estimated labels compared to majority voteas we increase the rank of A by replacing redundant weak signalswith linearly independent weak signals. quantities can be interpreted as the diversity and numberof the weak signals (corresponding to the rank) and theiraccuracies (the magnitude of p ).Though the analysis is for length- n label vectors, it isstraightforwardly extended to multi-label settings with length- ( nK ) . And with careful indexing and tracking of the abstain-ing indicators, the same form of analysis can apply for ab-staining weak signals.Figure 2 shows an empirical validation of Theorem 1 on asynthetic experiment. We plot the error of the labels returnedby CLL and majority voting as we change the rank of A . Westart with 100 redundant weak signals by generating a matrix A whose 100 columns contain copies of the same weaksignal, giving it a rank of 1. We then iteratively increase therank of A by replacing copies of the weak signal with randomvectors from the uniform distribution. The error of CLL labelsapproaches zero as the rank of the matrix increases while themajority vote error does not improve significantly. See theappendix for more details on this synthetic data setup. Error Estimation
In our analysis, we assume that the expected error rates of theweak signals are available. This may be the case if the weaksignals have been evaluated on historical data or if an expertprovides the error rates. In practice, users typically defineweak supervision signals whose error rates are unknown.In this section, we discuss two approaches to handle suchsituations. We test these estimation techniques on real andsynthetic data in our experiments, finding that CLL with thesestrategies forms a powerful weakly supervised approach.
Agreement Rate Method
Estimating the error rates of bi-nary classifiers using their agreement rates was first proposedby Platanios, Blum, and Mitchell (2014). They propose twodifferent objective functions for solving the error rates ofclassifiers using their agreement rates as constraints. An im-plementation of data programming (Ratner et al. 2016) solvesfor the accuracies of the weak signals using the differencebetween the agreement and disagreement rates of the weaksignals. They make the assumption that, if each weak sig-nal is conditionally independent given the training labels,then they can relate the accuracy of the weak signals to amatrix containing the difference between the agreement anddisagreement rates of the signals. To estimate the accuracies, .0 0.2 0.4 0.6 0.8Error rates0.30.40.50.60.7 A cc u r a c y SST-2 (binary)0.0 0.1 0.2 0.3 0.4 0.5Error rates0.20.30.40.50.6 A cc u r a c y F-MNIST (multiclass)
Figure 3:
Accuracy of constrained label learning as we increase theerror rates from 0 to 1 on binary and 0 to 0.5 on multiclass datasets(SST-2 and Fashion-MNIST). they solve a matrix-completion problem to find a low-rankfactorization for the weak signal accuracies. We implementedthis method and report its performance in our synthetic ex-periment (see Section 4). The one-vs-all form of the weaksignals on our real datasets violates the assumption that eachweak signal makes prediction on all the classes, so we cannotuse the agreement rate method on our real data.
Maximum Error Rate
In our experiments with realdatasets, we train CLL using a pessimistic estimate of themaximum error rate of the weak signals. We assume that,in the worst case, the weak signals are only slightly betterthan random for binary classification and have one-vs-allbaseline error rate for multiclass classification. We assigna single constant value to all the weak signals as estimatesof their maximum error rates. Similar ideas were used forALL (Arachie and Huang 2019a), which was shown to learneffectively with a constant for the error rates of all weaksignals on their binary classification datasets. We test thisapproach in our experiments and extend it to handle weaksupervision signals that abstain and multi-class data. Figure 3plots the accuracy of generated labels as we increase theerror-rate parameter. On the binary-class SST-2 dataset, thelabel accuracy remains similar if the error rate is set between0 and 0.5 and drops for values at least . . On the multiclassFashion-MNIST data, we notice similar behavior where thelabel accuracies are similar between 0.05 and 0.1 and dropwith larger values. We surmise that this behavior mirrors thetype of weak supervision signals we use in our experiments.The weak signals in our real experiments are one-vs-all sig-nals; hence a baseline signal (guessing 0 on all examples)will have an error rate of K . Performance deteriorates whenthe error rate is worse than this baseline rate. We test constrained label learning on a variety of tasks on textand image classification. First, we measure the test accuracyof CLL on a synthetic dataset and compare its performanceto that of supervised learning and other baselines. Second,we validate our approach on real datasets.For all our experiments, we compare CLL to other weaklysupervised methods: data programming (DP) (Ratner et al.
Method Test AccuracyCLL (Agr. rate (cid:15) ) ± (cid:15) ) 0.630 ± ± ± (cid:15) ) 0.675 ± ± Table 1:
Classification accuracies of the different methods on syn-thetic data using dependent weak signals. We report the mean andstandard deviation over three trials.Method Test AccuracyCLL (Agr. rate (cid:15) ) ± (cid:15) ) 0.978 ± ± ± (cid:15) ) 0.985 ± ± Table 2:
Classification accuracies of the different methods on syn-thetic data using independent weak signals. We report the mean andstandard deviation over three trials
Synthetic Experiment
We construct a toy dataset for a binary classification taskwhere the data has 200 randomly generated binary featuresand 20,000 examples, 16,000 for training and 4,000 for test-ing. Each feature vector has between 50% to 70% correlationwith the true label. We define two scenarios for our syntheticexperiments. We run the methods using (1) dependent weaksignals and, (2) independent weak signals. In both experi-ments, we use weak signals that have at most cover-age on the data and conflicts on their label assignments. Thedependent weak signals were constructed by generating oneweak signal that is copied noisily 9 times (randomly flipping of the labels). The original weak signal labeled of the data points and had an accuracy in [0 . , . . So, onaverage, we expect to perturb of its labels on the copies.The independent weak signals are randomly generated tohave accuracies in the range [0 . , . .We report in Table 1 and Table 2 the label and test accu-racy from running CLL using true error rates for the weaksignals, error rates estimated via agreement rate described inSection 3, and error rates using a maximum error rate con- atasets CLL Data Programming Majority Vote CLL True (cid:15) IMDB ± ± ± ± ± ± ± ± Table 3:
Label accuracies of CLL compared to other weak supervision methods on different text classification datasets. We report the meanand standard deviation over three trials. CLL is trained using (cid:15) = 0.4 on the text classification dataset and (cid:15) = K on TREC-6 dataset.Datasets CLL Data Programming Majority Vote CLL True (cid:15) Supervised LearningIMDB ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 4:
Test accuracies of CLL compared to other weak supervision methods on different text classification datasets. We report the mean andstandard deviation over three trials. CLL is trained using (cid:15) = 0.4 on the text classification dataset and (cid:15) = K on TREC-6 dataset. stant set to 0.4 as the expected error for all the weak signals.CLL trained using the true (cid:15) obtains the highest test accu-racy compared to the other baselines, and its performancealmost matches that of supervised learning in Table 2. Withthe true bounds, CLL slightly outperforms CLL trained usingestimated and constant (cid:15) . More interestingly, the results in Ta-ble 1 show that our method outperforms other baselines thatare strongly affected by the dependence in the weak signals.The generative model of data programming assumes that theweak signals are independent given the true labels, but thisis not the case in this setup as the weak signals are stronglydependent. Thus the conditional independence violation hurtsits performance and essentially reduces it to performing amajority vote on the labels.Since our evaluation in Fig. 3 demonstrated that CLL is notvery sensitive to the choice of error rate, we use maximumerror rate from the plots as the error rates of the weak signals.We set (cid:15) = 0.4 on binary class datasets and (cid:15) = K on multi-class datasets. We choose these values because lower (cid:15) valueswill overconstrain the problem such that the feasible spacemay not contain the true labels. Real Experiments
The data sets for our real experiments and their weak signalgeneration process are described below. Additional detailsabout the data and the weak signals are in the appendix.
IMDB
The IMDB dataset (Maas et al. 2011) is used forsentiment analysis. The data contains reviews of differentmovies, and the task is to classify user reviews as either posi-tive or negative in sentiment. We provide weak supervisionby measuring mentions of specific words in the movie re-views. We created a set of positive words that weakly indicatepositive sentiment and negative words that weakly indicatenegative sentiment. We chose these keywords by looking atsamples of the reviews and selecting popular words used inthem. Many reviews could contain both positive and negativekeywords, and in these cases, the weak signals will conflicton their labels. We split the dataset into training and testingsubsets, where any example that contains one of our key- words is placed in the training set. Thus, the test set consistsof reviews that are not labeled by any weak signal , makingit important for the weakly supervised learning to general-ize beyond the weak signals. The dataset contains 50,000reviews, of which 35,790 are used for training and 14,210are test examples.
SST-2
The Stanford Sentiment Treebank (SST-2) is anothersentiment analysis dataset (Socher et al. 2013) containingmovie reviews. Like the IMDB dataset, the goal is to classifyreviews from users as having either positive or negative senti-ment. We use similar keyword-based weak supervision butwith different keywords. We use the standard train-test splitprovided by the original dataset. While the original trainingdata contained 6,920 reviews, our weak signals only cover3,998 examples. Thus, we used the reduced data size to trainour model. We use the full test set of 1,821 reviews.
YELP-2
We used the Yelp review dataset containing userreviews of businesses from the Yelp Dataset Challenge in2015. Like the IMDB and SST-2 dataset, the goal is to clas-sify reviews from users as having either positive or negativesentiment. We converted the star ratings in the dataset by con-sidering reviews above 3 stars rating as positive and negativeotherwise. We used the same weak supervision generatingprocess as in SST-2. We sampled 50,000 reviews for train-ing and 10,000 for testing from the original data set. Ourweak signals only cover 45,370 data points, thus, we used thereduced data size to train our model.
TREC-6
TREC is a question classification dataset consist-ing of fact-based questions divided into different categories(Li and Roth 2002). The task is to classify questions to predictwhat category the question belongs to. We use the six-classversion (TREC-6) from which we use 4,988 examples fortraining and 500 for testing. The weak supervision we usecombines word mentions with other heuristics we definedto analyze patterns of the question and assign a class labelbased on certain patterns.
SVHN
The Street View House Numbers (SVHN) (Netzeret al. 2018) dataset represents the task of recognizing digitson real images of house numbers taken by Google Street atasets CLL Data Programming Average Stoch-GALL CLL True (cid:15)
SVHN ± ± ± ± ± ± Table 5:
Label accuracies of CLL compared to other weak supervision methods on image datasets. We report the mean and standard deviationover three trials. CLL is trained using (cid:15) = K on the datasets.Datasets CLL Data Programming Average Stoch-GALL CLL True (cid:15) Supervised LearningSVHN ± ± ± ± ± ± ± ± ± ± ± ± Table 6:
Test accuracies of CLL compared to other weak supervision methods on image datasets. We report the mean and standard deviationover three trials. CLL is trained using (cid:15) = K on the datasets. View. Each image is a × RGB vector. The dataset has10 classes and has 73,257 training images and 26,032 testimages. We define 50 weak signals for this dataset. For thisimage classification dataset, we augment 40 other human-annotated weak signals (four per class) with ten pseudolabelpredictions of each class from a model trained on 1% of thetraining data. The human-annotated weak signals are nearest-neighbor classifiers where a human annotator is asked to markdistinguishing features about an exemplar image belongingto a specific class. We then calculate pairwise Euclideandistances between the pixels in the marked region acrossimages. We convert the Euclidean scores to probabilities (softlabels for the examples) via a logistic transform. Throughthis process, an annotator is guiding the design of a simpleone-versus-rest classifier, where images most similar to thereference image are more likely to belong to its class.
Fashion-MNIST
The Fashion-MNIST dataset (Xiao, Ra-sul, and Vollgraf 2017) represents the task of recognizingarticles of clothing where each example is a × grayscaleimage. The images are categorized into 10 classes of clothingtypes where each class contains 6,000 training examples and1,000 test examples. We used the same format of weak su-pervision signals as in the SVHN dataset (pseudolabels andhuman-annotated nearest-neighbor classifiers). Models
For the text analysis tasks, we use 300-dimensionalGloVe vectors (Pennington, Socher, and Manning 2014) asfeatures for the text classification tasks. Then we train asimple two-layer neural network with 512 hidden units andReLU activation in its hidden layer. The model for the imageclassification tasks is a six-layer convolutional neural networkmodel with a 3 × Results
Tables 3 and 4 list the performance of the variousweakly supervised methods on text classification datasets,while Tables 5 and 6 list the performance of various weaklysupervised methods on image classification datasets. Consid-ering both types of accuracy, CLL is able to output labels forthe training data that train high-quality models for the testset. Even when its label accuracy is lower than competingmethods, it still achieves superior test accuracy. We surmise that CLL label accuracy is lower than competing methods onsome datasets because it learns from pessimistic maximumerror estimate. CLL has better label accuracy when trainedwith the true error rate, but there is only a small performancedifference on the held out test set. Generally, CLL is ableto learn robust labels from the weak signals, and it seemsto pass this information to the learning algorithm to help itgeneralize on unseen examples. For example, on the IMDBdataset, we used keyword-based weak signals that only occuron the training data. The model trained using CLL labelsperforms better on the test set than models trained with labelslearned from data programming or majority vote. On the digitrecognition task (SVHN), CLL outperforms the best com-pared method (average) by over percentage points on thetest data. CLL is able to better synthesize information fromthe low-quality human-annotated signals combined with thehigher-quality pseudolabel signals. We introduced constrained label learning (CLL), a weaklysupervised learning method that combines different weaksupervision signals to produce probabilistic training labelsfor the data. CLL defines a constrained space for the labelsof the training data by requiring that the errors of the weaksignals agree with the provided error estimates. CLL is fastand converges after a few iterations of gradient descent. Ourtheoretical analysis shows that the accuracy of our estimatedlabels increases as we add more linearly independent weaksignals. This analysis is consistent with the intuition that theconstrained-space interpretation of weak supervision avoidsovercounting evidence when multiple redundant weak sig-nals provide the same information, since they are linearly de-pendent. Our experiments compare CLL against other weaksupervision approaches on different text and image classifi-cation tasks. The results demonstrate that CLL outperformsthese methods on most tasks. Interestingly, we are able toperform well when we train CLL using a constant maximumerror estimate for the weak signals. This shows that CLL is ro-bust and not too sensitive to inaccuracy in the error estimates.In future work, we aim to theoretically analyze the behaviorof this approach in such settings where the error rates areunreliable, with the hope that theoretical understanding willsuggest new approaches that are even more robust. roader impact
We developed a general-purpose algorithm that we hope willmake machine learning more accessible to lower resourceusers. Weak supervision in general provides an alternativeparadigm for training machine learning models when train-ing labels are not available. Our algorithm could fail if theinformation from the weak signals is substantially less thanthe noise. In this case, labels obtained using our algorithmmay not train a good model. Moreover, machine learning insettings where reliable labels are unavailable may be moreprone to bias than in settings where high-quality labels areavailable for auditing and analysis.
Acknowledgments
We thank NVIDIA for their support through the GPU GrantProgram and Amazon for their support via the AWS CloudCredits for Research program. Also, support for this researchwas provided in part by a grant from the U.S. Department ofTransportation, University Transportation Centers Programto the Safety through Disruption University TransportationCenter (69A3551747115).
References
Arachie, C.; and Huang, B. 2019a. Adversarial Label Learn-ing. In
Proc. of the AAAI Conf. on Artif. Intelligence , 3183–3190.Arachie, C.; and Huang, B. 2019b. Stochastic Gen-eralized Adversarial Label Learning. arXiv preprintarXiv:1906.00512 .Bach, S. H.; Rodriguez, D.; Liu, Y.; Luo, C.; Shao, H.; Xia,C.; Sen, S.; Ratner, A.; Hancock, B.; and Alborzi, H. 2019.Snorkel drybell: A case study in deploying weak supervisionat industrial scale. In
Intl. Conf. on Manag. of Data , 362–375.Balsubramani, A.; and Freund, Y. 2015. Scalable semi-supervised aggregation of classifiers. In
Advances in NeuralInformation Processing Systems , 1351–1359.Bunescu, R. C.; and Mooney, R. 2007. Learning to extractrelations from the web using minimal supervision. In
Pro-ceedings of the Annual Meeting of the Association for Com-putational Linguistics , volume 45, 576–583.Carpenter, B. 2008. Multilevel bayesian models of cate-gorical data annotation.
Unpublished manuscript
Proc. of the IEEE Conf. on Comp. Vis. andPattern Recognition , 3198–3205.Dawid, A. P.; and Skene, A. M. 1979. Maximum likelihoodestimation of observer error-rates using the EM algorithm.
Applied Statistics
Proceedings of the 31st Annual Intl. ACM SIGIR Conf. onResearch and Dev. in Information Retrieval , 595–602. Duchi, J.; Hazan, E.; and Singer, Y. 2011. Adaptive subgradi-ent methods for online learning and stochastic optimization.
Journal of Machine Learning Research
IEEE Intelligent Systems
Proceedings of theConference on Machine Learning for Healthcare , 209–225.Hoffmann, R.; Zhang, C.; Ling, X.; Zettlemoyer, L.; andWeld, D. S. 2011. Knowledge-based weak supervision forinformation extraction of overlapping relations. In
Proc.of the Annual Meeting of the Assoc. for Comp. Linguistics:Human Language Tech. , 541–550.Jaffe, A.; Fetaya, E.; Nadler, B.; Jiang, T.; and Kluger, Y.2016. Unsupervised ensemble learning with dependent clas-sifiers. In
Artificial Intelligence and Statistics , 351–360.Karger, D. R.; Oh, S.; and Shah, D. 2011. Iterative learningfor reliable crowdsourcing systems. In
Advances in neuralinformation processing systems , 1953–1961.Khetan, A.; Lipton, Z. C.; and Anandkumar, A. 2017.Learning from noisy singly-labeled data. arXiv preprintarXiv:1712.04577 .Li, X.; and Roth, D. 2002. Learning question classifiers. In
Proceedings of the 19th international conference on Compu-tational linguistics-Volume 1 , 1–7. Association for Computa-tional Linguistics.Liu, Q.; Peng, J.; and Ihler, A. T. 2012. Variational infer-ence for crowdsourcing. In
Advances in neural informationprocessing systems , 692–700.Maas, A. L.; Daly, R. E.; Pham, P. T.; Huang, D.; Ng, A. Y.;and Potts, C. 2011. Learning word vectors for sentimentanalysis. In
Proceedings of the 49th annual meeting of theassociation for computational linguistics: Human languagetechnologies-volume 1 , 142–150. Association for Computa-tional Linguistics.Madani, O.; Pennock, D. M.; and Flake, G. W. 2005. Co-validation: Using model disagreement on unlabeled data tovalidate classification algorithms. In
Advances in neuralinformation processing systems , 873–880.Mann, G. S.; and McCallum, A. 2008. Generalized expec-tation criteria for semi-supervised learning of conditionalrandom fields.
Proceedings of ACL-08: HLT
Journal of Machine Learning Research
11: 955–984.Mintz, M.; Bills, S.; Snow, R.; and Jurafsky, D. 2009. Distantsupervision for relation extraction without labeled data. In
Proc. of the Annual Meeting of the Assoc. for Comp. Ling.
Netzer, Y.; Wang, T.; Coates, A.; Bissacco, A.; Wu, B.; andNg, A. 2018. The Street View House Numbers (SVHN)Dataset. Technical report, Accessed 2016-08-01.[Online].Pennington, J.; Socher, R.; and Manning, C. 2014. GloVe:Global vectors for word representation. In
Proceedings of the014 conference on empirical methods in natural languageprocessing (EMNLP) , 1532–1543. Doha, Qatar.Platanios, E. A.; Al-Shedivat, M.; Xing, E.; and Mitchell, T.2020. Learning from Imperfect Annotations. arXiv preprintarXiv:2004.03473 .Platanios, E. A.; Blum, A.; and Mitchell, T. 2014. Estimatingaccuracy from unlabeled data. In
Proceedings of the ThirtiethConf. on Uncertainty in Artificial Intelligence , 682–691.Platanios, E. A.; Dubey, A.; and Mitchell, T. 2016. Estimat-ing accuracy from unlabeled data: A bayesian approach. In
International Conference on Machine Learning , 1416–1425.Ratner, A.; Hancock, B.; Dunnmon, J.; Goldman, R.; andRé, C. 2018. Snorkel metal: Weak supervision for multi-tasklearning. In
Proceedings of the Second Workshop on DataManagement for End-To-End Machine Learning , 1–4.Ratner, A. J.; Bach, S. H.; Ehrenberg, H. R.; and Ré, C.2017. Snorkel: Fast training set generation for informationextraction. In
Proceedings of the 2017 ACM Intl. Conf. onManagement of Data , 1683–1686. ACM.Ratner, A. J.; De Sa, C. M.; Wu, S.; Selsam, D.; and Ré,C. 2016. Data programming: Creating large training sets,quickly. In
Advances in Neural Info. Proc. Sys. , 3567–3575.Riedel, S.; Yao, L.; and McCallum, A. 2010. Modeling rela-tions and their mentions without labeled text. In
Joint Euro.Conf. on Mach. Learn. and Knowledge Disc. in Databases .Schapire, R. E.; Rochery, M.; Rahim, M.; and Gupta, N. 2002.Incorporating prior knowledge into boosting. In
Intl. Conf.on Machine Learning , volume 2, 538–545.Socher, R.; Perelygin, A.; Wu, J.; Chuang, J.; Manning, C. D.;Ng, A. Y.; and Potts, C. 2013. Recursive deep models forsemantic compositionality over a sentiment treebank. In
Proceedings of the 2013 conference on empirical methods innatural language processing , 1631–1642.Steinhardt, J.; and Liang, P. S. 2016. Unsupervised riskestimation using only conditional independence structure. In
Adv. in Neural Information Processing Systems , 3657–3665.Xiao, H.; Rasul, K.; and Vollgraf, R. 2017. Fashion-MNIST:a novel image dataset for benchmarking machine learningalgorithms. arXiv preprint arXiv:1708.07747 .Xu, J.; Schwing, A. G.; and Urtasun, R. 2014. Tell me whatyou see and I will show you where it is. In
Proc. of the IEEEConf. on Computer Vis. and Pattern Recog. , 3190–3197.Yao, L.; Riedel, S.; and McCallum, A. 2010. Collectivecross-document relation extraction without labelled data. In
Proc. of the Conf. on Empirical Methods in Natural LanguageProcessing , 1013–1023.Zhou, D.; Liu, Q.; Platt, J. C.; Meek, C.; and Shah, N. B. 2015.Regularized minimax conditional entropy for crowdsourcing. arXiv preprint arXiv:1503.07240 .Zhou, Y.; and He, J. 2016. Crowdsourcing via Tensor Aug-mentation and Completion. In
IJCAI , 2435–2441. ppendices
A Reproducibility
We describe here algorithm and experiment details to helpreaders reproduce our experiments.The synthetic data used for generating Fig. 2 is a binaryclassification dataset with 100 randomly generated examplescontaining 20 binary features. The weak signals are randombinary predictions for the labels where each weak signal errorrate is calculated using the true labels of the data.Two sources of randomness exist in our experiments: (1)the random initialization of the initial label in constrainedlabel learning (CLL) and (2) the random initialization ofmodel weights when learning from any weakly supervisedmethod. To account for this randomness, we run each experi-ment three times and report averages and standard deviationsof the accuracies we measure. In the tables in the main pa-per, the bolded numbers represent the best method amongthe comparable results. (We do not compare weakly super-vised method with fully supervised methods. And we alsodo not compare against CLL when it has access to the trueerror rates, since that information should not be available inpractical settings.)
Algorithm
Algorithm 1
Randomized Constrained Labeling
Require:
Dataset X = [ x , . . . , x n ] , weak signals [ w , . . . , w m ] ,and expected error (cid:15) = [ (cid:15) , . . . , (cid:15) m ] for the signals.1: Define A from Eq. (2) and c from Eq. (3) using the weak signalsand expected errors.2: Initialize ˜ y as ˜ y ∼ U (0 , n while not converged do
4: Update ˜ y with its gradient from Eq. (5)5: Clip ˜ y to [0 , n end while return estimated labels ˜ y Dataset Details and Weak Supervision
In this section, we describe the datasets we use in the experi-ments and the weak supervision we provide to the learningalgorithms. Table 7 summarizes the key statistics about thesedatasets.We provide below the keywords and heuristics we usedto generate the weak signals in our experiments. For somesignals, we used multiple words since individual word have
Dataset No. classes No. weak signals Train Size Test SizeIMDB 2 10 35,790 14,210SST-2 2 14 3,998 1,821Yelp 2 11 45,370 10,000TREC-6 6 30 2,946 500Fashion-MNIST 10 100 60,000 10,000
Table 7: Summary of datasets, including the number of weaksignals used for training. little coverage in the data. Multiple words signals are repre-sented as nested lists while single words signals are shownas single lists.
IMBD
We used 5 positive keywords representing 5 positivesignals and 5 negative keywords as 5 negative signals. Thepositive signals are [good, wonderful, amazing, excellent,great] while the negative signals are [bad, horrible, sucks,awful, terrible].
SST-2
Similar to IMDB, however we use 7 positive signalsand 7 negative signals that contain nested lists of keywords.The positive signals are • good, great, nice, delight, wonderful • love, best, genuine, well, thriller • clever, enjoy, fine, deliver, fascinating • super, excellent, charming, pleasure, strong • fresh, comedy, interesting, fun, entertain, charm, clever • amazing, romantic, intelligent, classic, stunning • rich, compelling, delicious, intriguing, smartwhile the negative signals are • bad, better, leave, never, disaster • nothing, action, fail, suck, difficult • mess, dull, dumb, bland, outrageous • slow, terrible, boring, insult, weird, damn • drag, no, not, awful, waste, flat • horrible, ridiculous, stupid, annoying, painful • poor, pathetic, pointless, offensive, silly. YELP
We used the same 7 positive signals and the first 4negative signals as in SST-2 above.