Error Rate Bounds in Crowdsourcing Models
aa r X i v : . [ s t a t . M L ] J u l ERROR RATE BOUNDS IN CROWDSOURCING MODELS
By Hongwei Li , Bin Yu , and Dengyong Zhou Department of Statistics, UC Berkeley ,Department of EECS, UC Berkeley andMicrosoft Research, Redmond Crowdsourcing is an effective tool for human-powered computa-tion on many tasks challenging for computers. In this paper, we pro-vide finite-sample exponential bounds on the error rate (in proba-bility and in expectation) of hyperplane binary labeling rules underthe Dawid-Skene crowdsourcing model. The bounds can be appliedto analyze many common prediction methods, including the majorityvoting and weighted majority voting. These bound results could beuseful for controlling the error rate and designing better algorithms.We show that the oracle Maximum A Posterior (MAP) rule approx-imately optimizes our upper bound on the mean error rate for anyhyperplane binary labeling rule, and propose a simple data-drivenweighted majority voting (WMV) rule (called one-step WMV) thatattempts to approximate the oracle MAP and has a provable theoret-ical guarantee on the error rate. Moreover, we use simulated and realdata to demonstrate that the data-driven EM-MAP rule is a goodapproximation to the oracle MAP rule, and to demonstrate that themean error rate of the data-driven EM-MAP rule is also bounded bythe mean error rate bound of the oracle MAP rule with estimatedparameters plugging into the bound.
1. Introduction.
There are many tasks that can be easily carried out by people buttend to be hard for computers, e.g. image annotation and visual design. When these tasksrequire large scale data processing, outsourcing them to experts or well-trained peoplemay be too expensive. Crowdsourcing has recently emerged as a powerful alternative. Itoutsources tasks to a distributed group of people (usually called workers) who might beinexperienced in these tasks. However, if we can appropriately aggregate the outputs froma crowd, the aggregated results could be as good as the results by an expert [1, 5, 6, 8, 9,10, 11, 12, 13].The flaws of crowdsourcing are apparent. Each worker is paid purely based on how manytasks that he/she has completed (for example, one cent for labeling one image). No groundtruth is available to evaluate how well he/she has performed in the tasks. So some workersmay randomly submit answers independent of the questions when the tasks assigned to them ∗ Email: [email protected] (Hongwei Li), [email protected] (Bin Yu) and [email protected] (Dengyong Zhou)
Keywords and phrases:
Crowdsourcing, Dawid-Skene model, Error rate bounds, EM algorithm LI, YU AND ZHOU are beyond their expertise. Moreover, workers are usually not persistent. Some workers maycomplete many tasks, while the others may only finish very few tasks even just one.In spite of these drawbacks, is it still possible to get reliable answers in a crowdsourcingsystem? The answer is yes. In fact, majority voting (MV) has been able to generate fairlyreasonable results [5, 6, 9, 13]. However, majority voting treats each worker’s result as equalin quality. It does not distinguish a spammer from a diligent worker. So we can expect thatmajority voting can be significantly improved upon.The first improvement over majority voting might date back to Dawid and Skene [1].They assumed that each worker is associated with an unknown confusion matrix. Eachoff-diagonal element represents misclassification rate from one class to the other, while thediagonal elements represent the accuracy in each class. According to the observed labelsby the workers, the maximum likelihood principle is applied to jointly estimate unobservedtrue labels and worker confusion matrices. The likelihood function is non-convex. However,a local optimum can be obtained by using the Expectation-Maximization (EM) algorithm,which can be initialized by majority voting.Dawid and Skene’s approach [1] can be straight-forwardly extended by assuming truelabels to be generated from a logistic model [6] or putting prior over worker confusionmatrices [5, 6]. One may simplify the assumption made in [1] to consider a symmetricconfusion matrix [4, 6], which we call the Symmetric Dawid-Skene model.Recently, significant efforts have been made to analyze error rate for the algorithms in theliterature. In [4], Karger et al. provided asymptotic error bounds for their iterative algorithmand also majority voting. However, the error bound for their specific iterative algorithmcannot be generalized to other prediction rules in crowdsourcing and the asymptotic boundsmay not be that practical since we always only have finite number of tasks. Additionally,their results depend on the assumption that the same number of items were assigned toeach worker, and the same number of workers labeled each item. According to the analysisin [5] by Liu et al., this assumption is restrictive in practical case. Ho et al. [2] formulatedthe adaptive task assignment problem, which considers adaptively assigning workers todifferent types of tasks according to their performance, into an optimization problem, andprovided performance guarantee of their algorithm. Meanwhile, they provided a mean errorrate bound for weighted majority voting, which can be viewed as a special case of ourgeneral bound of mean error rate for hyperplane rules.In this paper, we focus on providing bounds on the error rate under crowdsourcingmodels of which the effectiveness on real data has been evaluated in [1, 5, 6]. Our maincontributions are as follows. We derive error rate bounds in probability and in expecta-tion for a finite number of workers and instances under the Dawid-Skene model (with theSymmetric Dawid-Skene model as a special case). Moreover, we provide error bounds forthe oracle Maximum A Posterior (MAP) rule and a data-driven weighted majority voting(WMV), and show that the oracle MAP rule approximately optimizes the upper boundon the mean error rate for any hyperplane rule. Under the Symmetric Dawid-Skene model,
RROR RATE BOUNDS IN CROWDSOURCING MODELS we use simulation to demonstrate that the data-driven EM-MAP rule approximates theoracle MAP rule well. To the best of our knowledge, this is the first work which focuses onthe error rate analysis on general prediction rules under the practical Dawid-Skene modelfor crowdsourcing, which can be used for analyzing error rate and sample complexity ofalgorithms like those in [2, 4].
2. Problem setting and formulation.
We focus on binary labeling in this paper.Assume that a set of workers are assigned to label certain items that are available on theInternet. For instance, whether an image of an animal is that of a cat or a dog, or if a faceimage is male or female.Formally, suppose we have M workers, and N items. For convenience, we denote [ M ] = { , · · · , M } and [ N ] = { , · · · , N } . The label matrix is denoted by Z ∈ {± , } M × N ,in which Z ij is the label of the j -th item given by the i -th worker. It will be 0 if thecorresponding label is missing. Throughout the paper, we use y j as the true label for j -th item, and ˆ y j as the predicted label for the j -th item by an algorithm. At the sametime, any parameter with a hat ˆ is an estimate for this parameter. Let π = P ( y j = 1)for any j ∈ [ N ] denote the prevalence of label “+” in the true labels of the items. Weintroduce the indicator matrix T = ( T ij ) M × N , where T ij = 1 indicates that entry ( i, j )is observed, i.e., the i -th worker has labeled the j -th item, and T ij = 0 indicates entry( i, j ) is unobserved. Note that T and Z are observed together in our crowdsourcing setting,and both the number of items labeled by each worker and the number of workers assignedto each item are random. The sampling probability matrix is denoted by Q = ( q ij ) M × N ,where q ij = P ( T ij = 1), i.e., q ij is the probability that the i th worker labels the j th item.When q ij = q i ∈ (0 , , ∀ i ∈ [ M ] , j ∈ [ N ], we call it sampling with probability vector ~q = ( q , · · · , q M ). If q ij = q ∈ (0 , , ∀ i ∈ [ M ] , j ∈ [ N ], then we call it sampling withconstant probability q . We will discuss two models that are widely used for modeling the quality of the workers[4,5, 6, 13]. They were first proposed by Dawid and Skene [1]:
Dawid-Skene model.
We distinguish the accuracy of workers on the positive classand the negative class. Some workers might work better on labeling the items with truelabel “+”, and some might work better at labeling the items with true label “ − ”. The truepositive rate (sensitivity) and the true negative rate (specificity) are denoted as followsrespectively: for i = 1 , , · · · , Mp + i := P ( Z ij = 1 | y j = 1 , T ij = 1) and p − i := P ( Z ij = − | y j = − , T ij = 1) . (2.1)Then the parameter set will be Θ = n(cid:8) p + i , p − i (cid:9) Mi =1 , Q, π o under this model. Symmetric Dawid-Skene model.
We assume that the i -th worker labels the itemcorrectly with a fixed probability w i = P ( Z ij = y j | y j , T ij = 1), i.e., p + i = p − i = w i . In this LI, YU AND ZHOU case, no matter whether an item is from positive class or negative class, the worker labelsit with the same accuracy. Therefore, the parameter set is Θ = n { w i } Mi =1 , Q, π o . Under both models above, the posterior probability of the label for each item to be “+”is defined as: ρ j = P ( y j = 1 | Z, T, Θ) , ∀ j ∈ [ N ] . Given an estimation or a prediction rule,suppose that its predicted label for item j is ˆ y j , then our objective is to minimize the errorrate. Since the error rate is random, we are also interested in its expected value (i.e., the mean error rate ). Formally, error rate and its expected value are:ER = 1 N N X j =1 I (ˆ y j = y j ) and E [ER] = 1 N N X j =1 P (ˆ y j = y j ) . (2.2)The rest of the paper is organized as follows. In Section 3, we present a finite-samplebound on the error rate of a hyperplane rule in probability and also in expectation under theDawid-Skene model. In Section 4, we apply our analysis to the label inference by MaximumLikelihood method, illustrate the bound on the oracle MAP rule and present a bound ofa simple data-driven weighted majority voting under the Symmetric Dawid-Skene model.Experimental results on simulated and real-world dataset are presented in section 5. Notethat the proofs are deferred to the supplementary materials.
3. Error rate bounds.
In this section, we provide finite-sample bounds on error rateof any hyperplane rule under the Dawid-Skene model in high probability and in expectation.3.1.
Bounds on the error rate in high probability .
A hyperplane prediction or estimationrule is a rectified linear function of the observation matrix Z in a high dimensional space:given an unnormalized weight vector ν ∈ R M (independent of Z ) and an shift constant a ,for the j th item, the rule estimates its label as ˆ y j = sign( P Mi =1 ν i Z ij + a ) . In the rest ofthis paper, we call it hyperplane rule . It is a very general rule with special cases includingmajority voting (MV) which has ν i = 1 for all i and a = 0.Next we present two general theorems to provide finite-sample error rate bounds forhyperplane rules under the Dawid-Skene model. Before that, we introduce some notationsas follows:Λ + j = M X i =1 q ij ν i (2 p + i −
1) + a ! and Λ − j = M X i =1 q ij ν i (2 p − i − − a ! (3.1) t = min j ∈ [ N ] Λ + j ∧ Λ − j || ν || and t = max j ∈ [ N ] Λ + j ∨ Λ − j || ν || φ ( x ) = e − x x ∈ R and D( x || y ) = x ln xy + (1 − x ) ln 1 − x − y , x, y ∈ (0 , , where || · || is L norm, x ∧ y is min { x, y } and x ∨ y is max { x, y } . RROR RATE BOUNDS IN CROWDSOURCING MODELS Theorem . For a given sampling probability matrix Q = ( q ij ) M × N , let ˆ y j be theestimate by the hyperplane rule with weight vector ν and shift constant a . For any ǫ ∈ (0 , ,we have(1) when t ≥ q ǫ , P (cid:16) N P Nj =1 I (ˆ y j = y j ) ≤ ǫ (cid:17) ≥ − e − ND ( ǫ || φ ( t )) ; (2) when t ≤ − q − ǫ , P (cid:16) N P Nj =1 I (ˆ y j = y j ) ≤ ǫ (cid:17) ≤ e − ND ( ǫ || − φ ( t )) . Remark.
In fact, we have t || ν || ≤ E hP Mi =1 ν i Z ij + a i ≤ t || ν || , ∀ j ∈ [ N ], thus t and t are two very important quantities for controlling error rate if a hyperplane rule withfixed weights. Note that for a fixed sampling probability matrix, if the weights are positive,then the better the worker over random guessing (or the bigger 2 p + i −
1) for ““+”” labelsand the larger the shift a , the larger the Λ + j . Similarly we can interpret Λ − j . Usually we arefree to choose ν and a , and in some situation we can also control Q , so the most importantfactors that we cannot control are p + i and p − i .To control the probability of error rate exceeding [ − ǫ, ǫ ] by δ , we have to solve theequation exp {− N D ( ǫ || φ ( t )) } = δ , which cannot be solved analytically, so we need toconsider about a method which can tell us what’s the minimum t for bounding the errorrate with probability at least 1 − δ . The next theorem serves this purpose. For notationconvenience, we define two constants C and G for ǫ, δ ∈ (0 , C ( ǫ, δ ) = 1 + exp (cid:18) ǫ (cid:20) H e ( ǫ ) + 1 N ln 1 δ (cid:21)(cid:19) and G ( ǫ, δ ) = 1 + exp (cid:18) − ǫ (cid:20) H e ( ǫ ) + 1 N ln 1 δ (cid:21)(cid:19) , (3.2)where H e ( ǫ ) = − ǫ ln ǫ − (1 − ǫ ) ln(1 − ǫ ). Theorem . Under the same setting as in Theorem 1, for ∀ ǫ, δ ∈ (0 , , we have(1) if t ≥ p C ( ǫ, δ ) , then P (cid:16) N P Nj =1 I (ˆ y j = y j ) ≤ ǫ (cid:17) ≥ − δ. (2) If t ≤ − p G ( ǫ, δ ) , then P (cid:16) N P Nj =1 I (ˆ y j = y j ) ≤ ǫ (cid:17) < δ. To gain insights, we consider a simple and common method – majority voting (MV)with constant probability q sampling entries under the Symmetric Dawid-Skene model, i.e., p + i = p − i = w i , ∀ i ∈ [ M ]. In this case, the weight of each worker is the same. The resultbelow follows from Theorem 1 by taking q ij = q , p + i = p − i = w i , ν i = 1 and a = 0. Corollary . Under the Symmetric Dawid-Skene model, for majority voting withconstant probability sampling q ∈ (0 , , if ¯ w ≥ + q q M ln ǫ , then P (cid:16) N P Nj =1 I (ˆ y j = y j ) ≤ ǫ (cid:17) ≥ − e − N D ( ǫ || ψ ) , where ψ = e − Mq ( ¯ w − . . Remark.
This result implies that for the error rate to be small, the average accuracyof workers ¯ w has to be better than random guessing by Ω( q − M − . ). This requirement LI, YU AND ZHOU will be easier to satisfy with more workers(larger M ), and each worker labels more items( q close to 1).3.2. Bounds on the error rate in expectation.
One is often interested in bounding themean error rate for a general hyperplane rule, since the mean error rate gives the expectedproportion of items wrongly labeled.
Theorem . (Mean error rate bounds under the Dawid-Skene model) Underthe same setting as in Theorem 1, with c and σ defined as follows c = k ν k ∞ k ν k and σ = max j ∈ [ N ] || ν || " M X i =1 ν i q ij (cid:0) − q ij (2 p + i − (cid:1)! ∨ M X i =1 ν i q ij (cid:0) − q ij (2 p − i − (cid:1)! , (3.3) (1) if t ≥ , then N P Nj =1 P (ˆ y j = y j ) ≤ min n exp (cid:16) − t (cid:17) , exp (cid:16) − t σ + ct / (cid:17)o ; (2) if t ≤ , then N P Nj =1 P (ˆ y j = y j ) ≥ − min n exp (cid:16) − t (cid:17) , exp (cid:16) − t σ − ct / (cid:17)o . Remark.
In fact, c || ν || is an upper bound on | ν i Z ij | , ∀ i ∈ [ M ] , j ∈ [ N ], and σ || ν || isan upper bound on the variance of (cid:16)P Mi =1 ν i Z ij + a (cid:17) , which is rectified to predict the labelof the j th item.The next corollary covers the mean error rate bounds under the Symmetric Dawid-Skene model,which can be derived from last theorem by letting p + i = p − i = w i and enlarging σ for clarity. Corollary . Under the Symmetric Dawid-Skene model, for a given sampling proba-bility matrix Q = ( q ij ) M × N , assuming the prediction rule is a hyperplane rule with weight ν and shift constant a , then with c defined as in (3.3) and σ ′ = max j ∈ [ N ] 1 || ν || (cid:16)P Mi =1 ν i q ij (cid:17) , t ′ = min j ∈ [ N ] || ν || M X i =1 q ij ν i (2 w i − − | a | ! and t ′ = max j ∈ [ N ] || ν || M X i =1 q ij ν i (2 w i −
1) + | a | ! , (3.4) (1) if t ′ ≥ , then N P Nj =1 P (ˆ y j = y j ) ≤ min (cid:26) exp (cid:16) − t ′ (cid:17) , exp (cid:18) − t ′ ( σ ′ + ct ′ / ) (cid:19)(cid:27) ; (2) if t ′ ≤ , then N P Nj =1 P (ˆ y j = y j ) ≥ − min (cid:26) exp (cid:16) − t ′ (cid:17) , exp (cid:18) − t ′ ( σ ′ − ct ′ / ) (cid:19)(cid:27) . Due to the complicated forms, the results above might not be very intuitive. Let us lookat the majority voting case by applying ν i = 1 for all i and a = 0 to the first part of boundin Corollary 5. Corollary . For the majority voting under the Symmetric Dawid-Skene model andconstant probability sampling q ∈ (0 , , RROR RATE BOUNDS IN CROWDSOURCING MODELS (1) if ¯ w > , then N P Nj =1 P (ˆ y j = y j ) ≤ e − Mq ( ¯ w − ) ; (2) if ¯ w < , then N P Nj =1 P (ˆ y j = y j ) ≥ − e − Mq ( ¯ w − ) . Remark.
The mean error rate of MV will exponentially decay with M increase if theaverage accuracy of labeling by the workers is better than random guessing, and the gapbetween the average accuracy and 0.5 plays an important role in the bound. In particular,it implies that (1) if lim M →∞ ¯ w > , then lim M →∞ (cid:16) N P Nj =1 P (ˆ y j = y j ) (cid:17) = 0; and (2) iflim M →∞ ¯ w < , then lim M →∞ (cid:16) N P Nj =1 P (ˆ y j = y j ) (cid:17) = 1 . As mentioned earlier, a main result in [4] is very similar to our result in Corollary 6.In order to compare, we rewrite their results with our notations, and the upper bound ofMV in [4] (page 5, (3)) will be exp (cid:0) − M q ( ¯ w − . (cid:1) . It is different from ours by 2 q in theexponents. If q > .
5, our bound will be tighter. It is worthy of mentioning that the resultsin [4] is asymptotic (with N → ∞ ), while ours here is applicable to both asymptotic andfinite sample situation. And they assumed the number of items labeled by each worker isthe same and the number of workers assigned to each item is also the same, while we didnot make that assumption.
4. Data-driven EM-MAP rule and one-step weighted majority voting.
If weknow the posterior probability ρ j (defined in section 2) of the label of each item, thenthe Bayesian classifier predict ˆ y j = 2I ( ρ j > . −
1. Thus if we estimate the posterior ρ j well, we can apply the same rule to predict the true label with the estimated posteriorprobability. One natural way to approach it is to apply the Maximum Likelihood methodto the observed label matrix in order to estimate the parameter set Θ and consequently theposterior.4.1. Maximum A Posteriori (MAP) rule and the oracle MAP rule.
As in [1], we canapply the EM algorithm to obtain the maximum likelihood estimate for the parameters andthe posterior ˆ ρ j . With ˆ ρ j , each item can be assigned with the label which has the largestposterior, that is, the prediction function with MAP rule is ˆ y j = 2I ( ˆ ρ j > . − , where ˆ ρ j is the estimated posterior probability. We call the method above the EM-MAP rule .However, the EM algorithm cannot guarantee convergence to the global maximum of thelikelihood function. The estimated parameters might not be close to the true parametersand similarly for the estimated posterior. It is known that EM algorithm is generally hardto conduct error rate analysis due to its iterative nature. Nevertheless, we can consider the oracle MAP rule , which knows the true parameters and thus uses the true posterior ρ j inMAP rule to label items, i.e., ˆ y j = 2I ( ρ j > . − . We can apply the mean error ratebounds from section 3.2 to the oracle MAP rule.
Theorem . For the oracle MAP rule knowing the true parameters
Θ = n { w i } Mi =1 , Q, π o ,its prediction function is ˆ y j = 2 I ( ρ j > . − , then under the Symmetric Dawid-Skene model, LI, YU AND ZHOU the oracle MAP rule is a hyperplane rule, i.e., ˆ y j = sign (cid:16)P Mi =1 ν i Z ij + a (cid:17) with ν i =ln w i − w i and a = ln π − π , thus the same mean error rate bounds and their conditions asin Corollary 5 also hold here. Remark.
Although it is hard to obtain performance guarantee for the EM-MAP rule,empirically it has almost the same performance as the oracle MAP rule in simulation when¯ w > .
5, which will be shown in Section 5. This suggests that the bound on the mean errorrate of the oracle MAP rule could be useful for estimating the error rate of the EM-MAPrule in practice as we will do in Sec. 5.4.2.
The oracle MAP rule and oracle bound-optimal rule.
In this section, we will ex-plore the relationship between the oracle MAP rule and the error rate bound under theSymmetric Dawid-Skene model for clarity. Meanwhile, we consider the situation that theentries in the observed label matrix are sampled with a constant probability q for simplicity.Let us look closely at the mean error rate bound in Corollary 5.(1). When sampling witha constant probability q , the bound is monotonously decreasing w.r.t. t ′ on [0, ∞ ) and σ ′ = q , so optimizing the upper bound is equivalent to maximizing t ′ ,( ν ⋆ , a ⋆ ) = argmax ν ∈ R M ,a ∈ R t ′ = argmax ν ∈ R M ,a ∈ R q M X i =1 ν i || ν || (2 w i − − | a ||| ν || ! ⇒ Oracle bound-optimal rule: ν ⋆i ∝ w i − a ⋆ = 0 . (4.1)The prediction function of the oracle MAP rule is ˆ y j = sign (cid:16)P Mi =1 ln (cid:16) w i − w i (cid:17) · Z ij + ln π − π (cid:17) , which is a hyperplane rule with weight ν oracMAP i = ln w i − w i and a oracMAP = ln π − π .Since by Taylor expansion, ln x − x = (4 x −
2) + O (cid:16)(cid:0) x − (cid:1) (cid:17) , we see that the weight ofthe oracle bound-optimal rule is the first order Taylor expansion of the oracle MAP rule.Similar result and conclusion hold for the Dawid-Skene model as well, but we omit themdue to space limitations.By observing that the oracle MAP rule is very close to the oracle bound-optimal rule,the oracle MAP rule approximately optimizes the upper bound of the mean error rate. Thisfact also indicates that our bound is meaningful since the oracle MAP rule is the oracleBayes classifier.4.3. Error rate bounds on one-step weighted majority voting.
Weighted majority voting(WMV) is a hyperplane rule with a shift constant a = 0, and if the weights of workers are thesame, it will degenerate to majority voting. From Section 4.2, we know that bound-optimalstrategy of choosing weight is ν i ∝ w i − w i , we can put more weights to the “better” workers and downplay the “spammers”( thoseworkers with accuracy close to random guessing). This strategy can potentially improvethe performance of majority vote and result in a better estimate for w i . This inspires us to RROR RATE BOUNDS IN CROWDSOURCING MODELS design an iterative WMV method as follows: (Step 1) Use majority voting to estimatelabels, which are treated as “golden standard”. (Step 2)
Use the current “golden standard”to estimate the worker accuracy w i for all i and set ν i = 2 w i − i . (Step 3) Usethe current weight v in WMV to estimate updated “golden standard”, and then return to (Step 2) until converge.Empirically, this iterative WMV method converges fast. But it also suffers from the localoptimal trap as EM does, and is generally hard to analyze its error rate. However, we areable to obtain the error rate bound in the next theorem for a “naive” version of it – one-stepWMV (osWMV), which executes (Step 1) to (Step 3) only once (i.e., without returningto (Step 2) after (Step 3) ). Theorem . Under the Symmetric Dawid-Skene model, with label sampling probability q = 1 , let ˆ y wmvj be the label predicted by one-step WMV for the j th item, if ¯ w ≥ + M + q ( M −
1) ln 22 M , the mean error rate of one-step weighted majority voting will be: N N X j =1 P (cid:0) ˆ y wmvj = y j (cid:1) ≤ exp (cid:18) − M N ˜ σ (1 − η ) M N + ( M + N ) (cid:19) , (4.2) where ˜ σ = q M P Mi =1 ( w i − ) and η = 2 exp (cid:16) − M ( ¯ w − − M ) M − (cid:17) The proof of this theorem is deferred to the supplementary. It is non-trivial to prove sincethe dependency among the weights and labels makes it hard to apply the concentration ap-proach used in proving the previous results. Instead, a martingale-difference concentrationbound has to be used.
Remarks. (1) In the exponent of the bound, there are several important factors: ˜ σ represents how far the accuracies of workers are away from random guessing, and it is aconstant smaller than 1; η will be close to 0 given a reasonable M . (2) The condition on¯ w requires that ¯ w − is Ω( M − . ), which is easier to satisfy with M large if the averageaccuracy in the crowd population is better than random guessing. This condition ensuresmajority voting approximating true labels, thus with more items labeled, we can get abetter estimate about the workers’ accuracies, then the one-step WMV will improve theperformance with better weights. (3) We count now on how M and N affect the bound butdefer the formal mathematical analysis to the supplementary after proving the theorem:first, when both M and N increase but MN = r is a constant or decreases, the error ratebound decreases. This makes sense because with the number of items labeled per workerincreasing, ˆ w i will be more accurate, the weights will be closer to oracle bound optimalrule. Second, when M is fixed and N increases, i.e., the number of items labeled increases,the upper bound on the error rate decreases. Third, when N is fixed and M increases,thebound decreases when M < √ N and then increases when M beyond √ N . Intuitively,when M is larger than N and M increases, the fluctuation of prediction score function LI, YU AND ZHOU P Mi =1 (2 ˆ w i − Z ij , where ˆ w i is the estimated accuracy of i th worker, will be large. Thisincreases the chance to make more prediction errors. When M is reasonably small (comparedwith N ) but it increases, i.e., with more people labeling each item, the accuracy of majorityvote will be improved according to Corollary 6 , then the gain on the accuracy of estimateˆ w i leads the weights in one-step WMV to be closer to oracle bound-optimal rule.
5. Experiments.
In this section, we present numerical experiments on simulated datafor comparing the EM-MAP rule with the oracle MAP rule. Meanwhile, we plug the pa-rameters estimated by EM into the bound of the oracle MAP rule in Theorem 7, whichwe call
MAP plugin bound . Note that the MAP plugin bound is not a true bound butan estimated one. We also compare the EM-MAP with majority voting and MAP pluginbound on real-world data when oracle MAP is not available. Furthermore, we simulate theone-step WMV and MV, then compare their bounds.
Simulated data
Average accuracy of workers E rr o r r a t e M=11, N=300
EM−MAPoracle MAPMAP plugin lowerBoundMAP plugin upperBound
RTE dataset (Snow et. al.)
Sampling proportion E rr o r r a t e EM−MAPMAP pluggin boundMajority VotingMV plugin bound
Simulation of one step WMV
Average accuracy of workers E rr o r r a t e ( l og sc a l e ) Majority VotingMV upper boundone step WMVosWMV upper bound (a) (b) (c)
Fig 1 . (a) Comparison of the EM-MAP rule and the oracle MAP rule by simulation. We plug the parametersestimated by EM into the oracle MAP bound and plot the plugin bound. (b) Using the Snow et al. RTEdataset [9] to compare the mean error rate of EM-MAP, majority voting and their bounds by plugging inthe parameters estimated by comparing to the ground truth. (c) Comparison of one-step weighted majorityvoting (osWMV), majority voting (MV) and their bounds by simulation.
Simulated data.
The simulation is run under the Symmetric Dawid-Skene model witha constant sampling probability q = 0 . Beta ( a, b ) with b = 2 . We control a such that the expected accuracy of the workers varies from 0 to 1 with a step size 0.02.The error rates are displayed in Fig. 1(a). Each error rate is averaged over 100 random datagenerations.Fig. 1(a) shows that when the average accuracy of workers is better than random guessing( ¯ w > . w > . RROR RATE BOUNDS IN CROWDSOURCING MODELS when ¯ w < .
5, the error rate of the oracle MAP rule is close to 0, while the error rate of theEM-MAP rule is close to 1. This is because if all the workers have low labeling accuracy, theEM algorithm cannot recognize this since there is no ground truth. While the oracle knowsthe true accuracies of workers, it can simply flip the labels from the workers. The pluginbounds are obtained by plugging the estimated worker accuracies by EM into the oracleMAP bound in Theorem 7. When ¯ w > .
5, we compute the MAP plugin upper bound , andthe lower bound when ¯ w < .
5. Note that whether ¯ w > . w > . Real data.
Fig.1(b) shows the result of the EM-MAP rule and majority voting on alanguage processing dataset from Snow et al. [9]. The dataset is collected by asking workersto perform recognizing textual entailment (RTE) task, i.e., for each question the worker ispresented with two sentences and given a binary choice of whether the second hypothesissentence can be inferred from the first. There are 164 workers and 800 sentence pairs withground truth collected from experts. Totally, 8000 labels are collected and on average eachworker labeled 8000/(164 × q i for the i th worker) through dividingthe number of items they labeled by 800. Then a control variable — the sampling proportion x , is a probability we further sample from the available labels. For example, if x = 0 . i th worker labeled 20 items, then we further sample these 20 labels with Bernoulli(0.6) for each one. If an item labeled by i th worker has not been selected by further sampling, wetreat it as missing label. In this way, we can control the sampling vector ~q (see section 2) withvarying x from 0 to 1 (with step size 0.05 in simulation). Note that when x = 1, we use all the8000 labels. For each x , we repeat sampling label matrix with Bernoulli ( x ) on each availablelabel, running the EM-MAP rule and MV, computing the plugin bounds by plugging theestimated workers’ accuracies (compared with ground truth) into the upper bound of oracleMAP rule and the MV upper bound under the Symmetric Dawid-Skene model, for 40 times.In the end, we average the error rate of the EM-MAP rule and MV, the MAP plugin upperbound and MV bound respectively. From Fig. 1(b), we can see that the MAP plugin boundapproaches the error rate of the EM-MAP rule with x increasing.Fig. 1.(c) shows the simulation of one-step WMV and the comparison of its bounds withmajority voting. We simulate 15 workers and 3000 items. The way we simulate is the sameas what we did in Fig. 1(a) described above. The differences are that we let the averageaccuracy of workers start from the minimum ¯ w required in Theorem 8 instead of 0 andwe run one-step WMV, MV and compute their respective bounds like what we did for theEM-MAP rule. We can see from Fig. 1 (c) that both the bound and the measured errorrate exhibited “cross” phenomenon – majority voting is better than one-step WMV in thevery beginning and then with average accuracy increasing, one-step WMV predominates LI, YU AND ZHOU
MV because based on the “well” estimated accuracy from MV, one-step WMV can weighteach worker according to how good he/she is. Note that the error rate is in log-scale. Thereason that the tail of the error rate curves suddenly drop is because we only have finite N , thus error rate cannot be arbitrary close to 0.
6. Conclusion.
In this paper, we have provided bounds on error rate of general hyper-plane labeling rules (in probability and in expectation) under the Dawid-Skene crowdsourcingmodel that includes the Symmetric Dawid-Skene model as a special case. Optimizing themean error rate bound under the Dawid-Skene model leads to a prediction rule that isa good approximation to the oracle MAP rule. A data-driven WMV (one-step WMV) isproposed to approximate the oracle MAP with a theoretical guarantee on its error rate.Through simulations under the Symmetric Dawid-Skene model (for simplicity) and simu-lations based on real data, we have three findings: (1) the EM-MAP rule is close to theoracle MAP rule with superior performance in terms of error rate, (2) the plugin bound forthe oracle MAP rule is also applicable to the EM-MAP rule, and (3) the error rate of theone-step WMV is shown to be bounded well by the theoretical bound.To the best of our knowledge, this is the first extensive work on error rate bounds forgeneral prediction rules under the practical Dawid-Skene model for crowdsourcing. Ourbounds are useful for explaining the effectiveness of different prediction rules/functions.In the future, we plan to extend our results to the multiple-labeling situation and exploreother types of crowdsourcing applications.
References. [1] A. P. Dawid and A. M. Skene. Maximum Likelihood Estimation of Observer Error-Rates Using theEM Algorithm.
Journal of the Royal Statistical Society. , 28(1):20–28, 1979.[2] C. Ho, S. Jabbari, and J. W. Vaughan. Adaptive Task Assignment for Crowdsourced Classification. In
ICML , 2013.[3] R. Jin and Z. Ghahramani. Learning with Multiple Labels. In
NIPS , 2002.[4] D. R. Karger, S. Oh, and D. Shah. Iterative learning for reliable crowdsourcing systems. In
NIPS ,2011.[5] Q. Liu, J. Peng, and A. Ihler. Variational Inference for Crowdsourcing. In
NIPS , 2012.[6] V. C. Raykar, S. Yu, L. H. Zhao, C. Florin, L. Bogoni, and L. Moy. Learning From Crowds.
Journalof Machine Learning Research , 11:1297–1322, 2010.[7] C. S. Sheng and F. Provost. Get Another Label? Improving Data Quality and Data Mining UsingMultiple, Noisy Labelers Categories and Subject Descriptors.
SIGKDD , pages 614–622, 2008.[8] P. Smyth, U. Fayyad, M. Burl, P. Perona, and P. Baldi. Inferring Ground Truth from SubjectiveLabelling of Venus Images. In
NIPS , 1995.[9] R. Snow, B. O. Connor, D. Jurafsky, and A. Y. Ng. Cheap and Fast - But is it Good ? EvaluatingNon-Expert Annotations for Natural Language Tasks.
EMNLP , 2008.[10] P. Welinder, S. Branson, S. Belongie, and P. Perona. The Multidimensional Wisdom of Crowds. In
NIPS , 2010.[11] J. Whitehill, P. Ruvolo, T. Wu, J. Bergsma, and J. Movellan. Whose Vote Should Count More :Optimal Integration of Labels from Labelers of Unknown Expertise. In
NIPS , 2009.[12] Y. Yan, R. Rosales, G. Fung, M. Schmidt, G. Hermosillo, L. Bogoni, L. Moy, and J. G. Dy. ModelingRROR RATE BOUNDS IN CROWDSOURCING MODELS annotator expertise : Learning when everybody knows a bit of something. In ICML , volume 9, pages932–939, 2010.[13] D. Zhou, J. Platt, S. Basu, and Y. Mao. Learning from the Wisdom of Crowds by Minimax Entropy.In
NIPS , 2012.
Acknowledgements.
We thank Riddhipratim Basu and Qiang Liu for helpful discussions.
SUPPLEMENTARY MATERIAL
Supplement A: Missing proofs in this paper ∼ hwli/SupplementaryCrowdsourcing.pdf). We have put all the miss-ing proofs in the supplementary file, which can be downloaded from the link above.
367 Evans Hall,Department of StatisticsUniversity of CaliforniaBerkeley, CA 94720-1776, USAE-mail: [email protected]@stat.berkeley.edu