[PDF] Finding the Ground-Truth from Multiple Labellers: Why Parameters of the Task Matter

Abstract

Employing multiple workers to label data for machine learning models has become increasingly important in recent years with greater demand to collect huge volumes of labelled data to train complex models while mitigating the risk of incorrect and noisy labelling. Whether it is large scale data gathering on popular crowd-sourcing platforms or smaller sets of workers in high-expertise labelling exercises, there are various methods recommended to gather a consensus from employed workers and establish ground-truth labels. However, there is very little research on how the various parameters of a labelling task can impact said methods. These parameters include the number of workers, worker expertise, number of labels in a taxonomy and sample size. In this paper, Majority Vote, CrowdTruth and Binomial Expectation Maximisation are investigated against the permutations of these parameters in order to provide better understanding of the parameter settings to give an advantage in ground-truth inference. Findings show that both Expectation Maximisation and CrowdTruth are only likely to give an advantage over majority vote under certain parameter conditions, while there are many cases where the methods can be shown to have no major impact. Guidance is given as to what parameters methods work best under, while the experimental framework provides a way of testing other established methods and also testing new methods that can attempt to provide advantageous performance where the methods in this paper did not. A greater level of understanding regarding optimal crowd-sourcing parameters is also achieved.

Full PDF

FF INDING THE G ROUND -T RUTH FROM M ULTIPLE L ABELLERS :W HY P ARAMETERS OF THE T ASK M ATTER

A P

REPRINT

Robert McCluskey

Data Science Department, CaspianNewcastle-upon-Tyne, UK

[email protected]

Amir Enshaei

Data Science Department, CaspianWolfson Childhood Cancer Research CentreNewcastle-upon-Tyne, UK

[email protected]

Bashar Awwad Shiekh Hasan

Data Science Department, CaspianNewcastle-upon-Tyne, UK

[email protected]

February 18, 2021 A BSTRACT

Employing multiple workers to label data for machine learning models has become increasinglyimportant in recent years with greater demand to collect huge volumes of labelled data to traincomplex models while mitigating the risk of incorrect and noisy labelling. Whether it is large scaledata gathering on popular crowd-sourcing platforms or smaller sets of workers in high-expertiselabelling exercises, there are various methods recommended to gather a consensus from employedworkers and establish ground-truth labels. However, there is very little research on how the variousparameters of a labelling task can impact said methods. These parameters include the numberof workers, worker expertise, number of labels in a taxonomy and sample size. In this paper,Majority Vote, CrowdTruth and Binomial Expectation Maximisation are investigated against thepermutations of these parameters in order to provide better understanding of the parameter settingsto give an advantage in ground-truth inference. Findings show that both Expectation Maximisationand CrowdTruth are only likely to give an advantage over majority vote under certain parameterconditions, while there are many cases where the methods can be shown to have no major impact.Guidance is given as to what parameters methods work best under, while the experimental frameworkprovides a way of testing other established methods and also testing new methods that can attemptto provide advantageous performance where the methods in this paper did not. A greater level ofunderstanding regarding optimal crowd-sourcing parameters is also achieved. K eywords Labels · Crowdsourcing · Expectation-maximisation · CrowdTruth · Majority vote

For supervised learning, the gathering of precisely labelled data is crucial to the success of any model, althoughit is often the most labour-intensive part of any project that requires external resources[1]. For highly specialisedtasks, Subject Matter Experts (SME) are often employed to provide labels, while crowd-sourcing platforms such asMechanical Turk (MTurk) have also become popular for gathering high volumes of data quickly, when the task canbe completed by a layman, as can be seen in the growth of models like ImageNet[2] and their ability to sometimesprovide labels similar to that of a “gold standard”[3]. Obtaining this “gold standard” is crucial to the production of ahigh-performance model. a r X i v : . [ c s . A I] F e b PREPRINT - F

EBRUARY

18, 2021As well as increasing the speed at which data is gathered, having multiple workers label the same data can removethe problem of a single worker being considered infallible, thus reducing the risk of incorrect labels due to humanerror, whether it be due to lack of expertise in labelling a particular data point, misinterpreting a task[4] or applying theincorrect label accidentally. There also remains the possibility that workers simply have different interpretations of thelabels they apply[5]. Whether it is crowd-sourced workers or experts, we still run into an issue of which workers weshould trust in what instances, and inferring which label can be considered the "ground-truth" for each datum in samplebecomes a problem when workers give different labels to a particular data point. While there are methodologies toresolve these problems, little is known about the circumstances under which these methods provide the most beneﬁt andwhether some methods are better than others under certain task conditions.Crowd-sourcing platforms have grown in popularity due to the ease of gathering data for a relatively low cost.However, accuracy from each worker is not assured and the anonymity of the workers means that the person who wantsthe data must put blind faith into others who may provide poor quality labels. Halpin & Blanco[6] considered twodifferent types of poor quality workers: those who deliberately select answers as quickly as possible in order to getpaid - referred to as "bad faith" workers - and those who are "unsuitable" workers, due to their lack of understanding orpoor ability. The "bad faith" workers may also be known as "spammers"[7] or “adversarial” workers[8]. The growth ofcrowd-sourcing platforms has come at the cost of more malicious workers sabotaging machine learning projects forﬁnancial gain[9]. Detecting these “bad faith” workers is of critical concern for the users of crowd-sourcing platforms,as they waste time and money while also negatively impacting the ﬁnal decision. On the other hand, some workersmay have misunderstood the task or have limited experience of labelling the given data, which can be provided withstronger guidance or training. However, when working with a high number of anonymous workers, it is difﬁcult todisseminate between the two. Detecting poor workers early in a data gathering exercise is of the utmost importancewhen considering efﬁciency and cost effectiveness[10].Crowd-sourcing platforms, where a high number of workers can be employed, are most suitable for tasks that canbe completed by a layman, but tasks requiring a high level of expertise often need SMEs. Owing to a smaller pool ofhighly skilled expertise, recruiting SMEs is more challenging and costly[11] and crowd-sourcing platforms may notbe able to provide access. When a task is of a certain degree of difﬁculty or complexity, experts can often disagreeamong themselves - something that is often seen in the diagnosis of medical conditions[12, 13, 14] due to the highlyspecialised nature of the task, and reliance on individual professional experience, adding more difﬁculty to the datagathering process. As an example, compare a task where a worker must label whether an image is a cat or a dog with atask where a worker must label whether a cancer diagnosis is benign or malignant. The former task has more peoplewho will likely be able to contribute knowledgeably, while the latter necessitates specialised training and expertise.Even in these tasks where we consider people as “experts”, seeing noise in label sets is inevitable and this can have asevere impact on any model if not controlled correctly[15]. One possible control is employing a single expert, or an“oracle”, to make ﬁnal decisions on data entries where there is disagreement amongst non-experts[16]. This, however,creates an extra overhead and also relies upon the knowledge of the “oracle” to be entirely accurate, which is a ﬂawedconcept and does not entirely rule out errors in the ﬁnal label set.One of the reasons for disagreements in expert labelling is that even within different ﬁelds of expertise, someare better at identifying certain things than others. Additionally, inherent subjectivity ensures there will always besome variation in answers[17], as well as bias within each worker[18]. Valizadegan, Nguyen & Hauskrecht[19], whoattempted to adapt popular consensus methods based on large crowds to work with smaller sets of experts, discussed theimpact on expertise labelling as a result of SMEs having different levels of expertise, utilities, knowledge and subjectivepreference. These various aspects can prove troublesome for those administering the task, as it is often unknown who isright when there are disagreements. Those with specialist knowledge also often use every piece of information availableto them, and this cannot always be accounted for within a machine learning labelling task, which is often a replicationof a real-world setting. Lacking this information, experts may then have to pivot making probable guesses, whichresults in more noise in the data set[20].Whereas some of the previous work has focused on independent aspects of the labelling task like: the number ofworkers, the number of labels, sample size and the expertise levels of the workers, the differing permutations of eachof these parameters has not been fully considered before. More importantly, whether certain ground-truth inferencemethods are better with certain parameters of a labelling task is not well understood. Previous work has suggested thatwhile there are a number of different options for practitioners to choose from in terms of methods and algorithms, theycan largely vary in their performance depending on the data set they are attempting to infer ground-truth from[21]. Forthe number of workers, previous studies have attempted to offer optimisation for recruitment[22], but these have notcontrolled for the other parameters of the labelling task and have not considered if some methods could provide anopportunity to make the crowd-sourcing recruitment even more efﬁcient. In terms of number of labels, binary labeltasks have often been a popular choice in research[19, 23, 24], and while on rare occasions there are tasks with morelabels[25], these often do not look at data sets where there is a double-digit label taxonomy.2

PREPRINT - F

EBRUARY

18, 2021It is also important to understand the scaling of ground-truth inference depending on the number of labels in ataxonomy, as tasks theoretically become more inherently difﬁcult when more label choices are added. Sample size issomething that is often overlooked in machine learning research[26], and time limitations and cost can often preventlarge amounts of data being gathered, meaning there is no guidance for practitioners to maximise their chances ofgetting the most accurate label set with respect to these restrictions. Expertise can be difﬁcult to ascertain, but isvaluable knowledge for any practitioner, as the ideal situation is to get the best possible workers to give more conﬁdencein their answers. However, when this is not an option, as can be seen in crowd-sourcing tasks, it can be of beneﬁt toknow how many workers should be employed in order to mitigate against this risk. Alternatively, it is also beneﬁcial toknow if there are any advantages to ensuring expertise of workers is high before administering a task.The above summarises some of the important considerations regarding interactions between the parameters inlabelling tasks. This paper investigates popular methods for inferring ground-truth from multiple workers and aimsto explore how they are impacted by the various parameters of a labelling task, drawing conclusions about whichmethods are suited to which parameters and providing guidance for when they can be proven to be advantageous whenlabel consensus is sought. Although previous research has looked at comparison of some popular methods, there areno studies, to our knowledge, that consider the impact of all the parameters discussed above or the relationship ofground-truth inference methods with all these parameters. Thus, this paper will attempt to determine the parametersthat popular methods for ground-truth inference work best under and will help provide future researchers with guidanceas to when the chosen popular methods are appropriate and when they are not. As well as this, while previous studieshave focused on the number of workers crowd-sourcing works best at[22], we hope to provide more depth to optimalcrowd-sourcing choices by studying the relationships between all the parameters with respect to the methods, as opposedto the impact of a single parameter.

The easiest and most common way to get the ground-truth label when multiple workers have provided an answer is touse Majority Vote (MV). In its briefest deﬁnition, the answers of workers for each input item are treated as votes, withthe popular vote being considered the ground-truth.Take an example where three workers have labelled a single data item, where the label can be either "red" or "blue".The ﬁrst two workers choose "red", while the ﬁnal worker chooses "blue". When a practitioner comes to assigning alabel to this data item for training a model, the label "red" would be chosen, since more workers have chosen this than“blue”. This means that any mistakes made by a single worker are mitigated. With a binary label choice and an oddnumber of workers, there is always a guarantee of a consensus label, but outside of these resources and task choicesthings can become more difﬁcult. Take the previous example and add a third label - "yellow”. Presume each of thethree workers has picked a different label; there is now no majority to pick from for this data item. Alternatively, havingan even number of workers for the task with a binary option means that there is a possibility of an even 50/50 split inanswers. Let us consider a situation where we have four workers as well as the three labels previously mentioned: twoof our workers select "red", one worker selects "blue" and the ﬁnal worker selects "yellow". This leaves the practitionerin a dilemma; the label "red" now has some agreement among the workers, but it is not a majority exceeding 50%.When no majority is found, the practitioner is left with three potential options:1. Return to the workers to ask if any are willing to reconsider their answers in the hope a majority can be found.The practitioner must be careful to remain objective in this instance so as not to lead the worker into forcefullychanging their label, and there is no guarantee that a majority will be reached for all conﬂicts.2. Drop the data point so it is not removed from the ﬁnal data set. In contrast to point one, this means that no timeis wasted in trying to "ﬁx" labels where no majority can be reached, and instead this time can be dedicated togetting more labels. The obvious downside of this approach is that data loss is inevitable, and tasks of higherdifﬁculty- which are more likely to see disagreement - may not be understood by the model.3. From one of the selected answers, pick one at random to use as ground-truth. Answer choice can also beweighted depending on how many workers selected it. This means no data is lost, although there is now thepotential of added noise in the labels due to the random selection.One of MV’s key weaknesses is that the expertise of all workers is considered equal across the board, which is aﬂawed assumption for most real-world tasks. If you consider a situation where you have one highly accurate worker andtwo highly inaccurate workers who are consistent with each other (with this information not apparent to the practitioner),we would accept the answers of the two inaccurate workers more often than not, compromising our data set. Even in3

PREPRINT - F

EBRUARY

18, 2021the case of SMEs, expertise levels can vary, and the decisions made by workers can differ when a task has some level ofsubjectivity. Despite the issues with MV, it is a quick method and does have theoretical grounding in "wisdom of thecrowd[27]. This suggests that, when putting a question to a crowd, the answer will average out to an approximate of theground-truth. When employing a high number of crowd-sourcing workers, the hope is that enough of them will give thecorrect answer so that the risk of those who do make an error is reduced. However, the number of people in the crowdhas no rule of thumb and having a high number of workers is often not always possible due to expertise requirements orcosts.

CrowdTruth (CT) 2.0[28] is an open-source framework designed to sit on top of crowd-sourcing platforms such asMTurk and CrowdFlower, offering an automated solution to inferring ground-truth data . In contrast to MV thatenforces agreement between workers, the framework captures inter-annotator disagreement to help resolve ambiguityin the data when disagreements occur among workers. CT considers that there are three main aspects that should beconsidered when inferring ground-truth the workers, input data (media units) and annotations[29]. Upon collection ofthe crowd-sourced labels, metrics are calculated for each of the three components, which in turn have an impact on eachother. For example, if there are 20 workers for a single input data unit and 19 of the workers choose label A, while 1worker selects label B, the worker who chose label B would be penalised heavily for this in their worker quality score,as the high level of agreement for this input data would assign a high quality metric score. Alternatively, if there are tenlabels and worker annotation is evenly split among the workers for an input unit, the worker quality penalty wouldbe lower, while the input data quality score for that item would be low. CT is currently in its 2nd iteration, with theimprovements over the ﬁrst version[30] centred around the metrics so that low quality workers disagreeing does notindicate the input data is ambiguous, and vice-versa where a low quality input does not indicate a poor worker. A naiveexample of this could be in a task where dog breeds are labelled and all the input data is blurred, making it difﬁcult forthe workers to select a correct annotation. If workers give different annotations, it is an indicator that the input dataitself is the issue and the annotations are not a reﬂection of the worker quality, thus offering a statistic that allows thetask administrator to remove problematic input units.Whereas CT provides a number of useful statistics that help give a good overview of various aspects of a task, weare mostly interested in the “Media Unit - Annotation Score” as shown in equation 1, which gives a score for eachinput sample to determine the conﬁdence that the annotation (label) is associated with that item. It is weighted by aworker quality score algorithm, thus boosting the answers of better workers while penalising the score of those whoare considered to have a lower quality, with quality determined from worker agreement across all annotated units bysaid worker. Previous experiments investigated thresholds of the U score, which allows the user to make a judgementabout which units they want to use for training[31]. However, there is no rule of thumb to choosing a “best” annotationthreshold score, and it seems to depend on the data set - with recommendations to experiment with thresholds. U ( s, g ) = (cid:80) w ∈ W n ( w ) sg Q ( w ) (cid:80) w ∈ W Q ( w ) (1) n ( w ) sg refers to worker w ’s labelling for input unit s , with each input unit having a choice of G label. This is weighted bythe “worker quality score” ( Q ) for each worker w that has supplied an answer for that input unit s . The outcome of thisis a ratio of how many workers picked each annotation weighted by their perceived ability - effectively acting like aweighted MV. Q is the product of two other equations: the worker-media unit agreement - which is the average cosinedistance between a single worker’s annotations and the annotations of the input units they have completed - and theworker-worker agreement, which indicates how similar a worker is to all others by calculating pairwise agreement. Bothalgorithms are weighted by the unit quality score algorithm, which means workers are not penalised heavily if the inputitem itself seems ambiguous. Full explanations of all the metrics can be found in the work by Dumitrache et al.[28]. A popular alternative method for inferring ground-truth is Expectation Maximisation (EM), as demonstrated by Dawid& Skene[32]. The method takes initial guesses of the ground-truth labels, usually done with MV, from the answersof each worker w and then infers the error rate of each worker by determining how many times a worker w selectedlabel g when j is the correct one, with g, j ∈ G labels within the taxonomy of label choices G . Doing this producesa square matrix for each worker, w , with each row summing to 1. Therefore, the diagonal of each matrix is the rateat how accurate a worker is for each label and forms the basis of EM, in that it determines how consistent a worker https://github.com/CrowdTruth/CrowdTruth-core PREPRINT - F

EBRUARY

18, 2021is in applying labels. This error rate, along with the marginal probabilities of all labels G , are used to recalculate theground-truth. It continually loops between two stages - an expectation step (E-step) and a maximisation step (M-step) -until it converges to produce ground-truth results for all labels. In general, the E-step relies upon the equations of theM-step, thus the M-step is often computed ﬁrst with an initialised guess for parameters. The following discusses thealgorithms in regard to this analysis. Assuming the ground-truth is unknown for the labels, the error rate, seen in equation 2, for each worker and the marginalprobabilities, seen in equation 3, for each label j are computed in the M-step. Within these equations, n ( w ) sg refers tothe counts of given answers for each item in the sample from a worker, similar to that used in equation 1. Equation. 2takes the inferred ground-truth labels, T , and the counts of each answer given, n . For the ﬁrst instance, T is randomlyinitialised if not known, usually with a majority vote. Equation 3 takes the inferred ground-truth labels T and divides itby each item in the sample, S .EM is capable of handling a worker answering the same sample more than once, but if based on an assumptionof each worker only answering each item in the sample once, each vector for a worker’s answer to a sample can beconsidered as one-hot encoded vector of length G . For example, if worker 1 has three label options and chooses thesecond label, the worker answer vector for that sample is [0 , , . For error rate estimation, the estimated ground-truthlabels are multiplied by the worker answers for each item in the sample, and these are then normalised to produce asquare matrix for each worker w , which gives the error rate of each label. ˆ π ( w ) jg = (cid:80) s T sj n ( w ) sg (cid:80) g (cid:80) s T sj n ( w ) sg (2) (cid:99) P j = (cid:80) s T sj S (3) Finally, the equations in the M-step are used for the E-step, where new probabilistic estimates of the ground-truth labels T are made, as can be seen in equation 4. This gives a probability of a label being the inferred ground-truth for eachitem in the sample. EM then loops between these two steps until it converges. Here, q is introduced to denote thepresumed true label in the set for a sample item, while D refers to the input data of worker answers. p ( T sj = 1 | D ) = (cid:81) Ww =1 (cid:81) Gg =1 (cid:16) π ( w ) jg (cid:17) n ( w ) sg P j (cid:80) Gq =1 (cid:81) Ww =1 (cid:81) Gg =1 (cid:16) π ( w ) qg (cid:17) n ( w ) sg P q (4)The original Dawid & Skene[32] method simply infers the ground-truth of the given data. Many modern methods arebased upon this method - including most notably Raykar et al.[17] - which extends the method to produce a classiﬁer.The previously mentioned work of Valizadegan, Nguyen & Hauskrecht[19] is also noteworthy, as it adapted the workperformed by Raykar et al.[17] to situations where there are fewer experts, focused solely on high expertise tasks whereworker resources may be scarce. The work performed by Raykar et al.[17] is better geared towards a much highernumber of workers, as outlined by the authors. Even the resulting error rate for each worker calculated by EM is usefulin itself and can be used to determine the cost of each worker, something particularly valuable in crowd-sourcing[33]. For the data, two different sets were chosen, with the distributions of labels used to create data sets for the varioustests. With the based data set being D , they are used to create samples sizes for experiments, denoted as S = { , , , , , } . Each item in S can be thought of as the target sample size that needs to be generatedfor experiments in order to test the impact of S on the varying consensus methods. Depending on the label distributions,sample sizes can sometimes be slightly higher or lower than the target S to help ﬁt to the respective distribution.5 PREPRINT - F

EBRUARY

18, 2021

The ﬁrst data set is the Wisconsin Breast Cancer data set , which includes 569 samples where the label is eithermalignant or benign. The data set was selected due to it being readily available and something that is common forbinary classiﬁcation in research. In this instance, we treat malignant as false and benign as true. The data associatedwith each label contains 39 different features, all of which are continuous, although it should be noted that the inputdata to each of the algorithms is the label choices of the workers, not these features. The second data set is the 20 Newsgroups data set , which includes 18846 samples in total and 20 different labels. Thisdata set is used here to generate the different label sets by taking subsets of the data based on the number of labels thatare required for the experiment. The data set is one of few to be readily available containing so many label choices,while the distribution of labels is fairly balanced for each subset of selected labels and is popular in natural languageprocessing applications of machine learning. The data associated with each label is in text format and this is again notused as part of any of the ground-truth methods. Each experiment takes a number from set W = { , , , , , , , , , } , with each w ∈ W denoting thenumber of workers that are generated in a particular experiment. All generated workers are assigned an expertise score λ between two boundaries, lb < λ < ub , where lb & ub are the lower and upper bounds of the expertise, respectively.The ﬁnal expertise score set for all desired w workers is denoted as Λ = { λ , λ , . . . , λ w } . For all higher expertiseexperiments, ub = 0 . , while lb = 0 . . These values are set to ensure that every generated worker is getting atleast half of the answers correct, regardless of the number of labels in the set to provide a measure of what we considera high expertise set, with a limit that suggests they will at least have some minor error. For lower expertise experiments,the ub = 0 . , while the lower bound is found through a lower-bound model ﬁnder, which is discussed in-depth below.0.8 was chosen due to a similar expert generation upper-boundary selected in Ipeirotis et al.[33]. This is done to producemore noise in the experiment and represent a lower expertise set. Each worker is assigned a randomly selected expertise λ that is between the two boundaries. λ signiﬁes how many answers we want to keep as correct for an individual worker.As an example, if λ = 0 . - and our ground-truth labels are T - we want 60% of a worker’s answers to equal theirequivalent value in T , while changing 40% of the answers to not equal their equivalent in T . For the lower-bound model ﬁnder, 100 different random forest models are trained with the entire base data set D g ⊆ D for all label sets, denoted as G = { , , , , , , , } , which are associated with L random labels all chosenwith equal probability of being selected from the set g = { , . . . , G } . While a random forest model has been chosenhere, any classiﬁcation model will sufﬁce, and its selection is simply due to it being a popular method that is widelyused. The main goal is to produce lb for expertise that can simulate a worker who is, at worst, performing slightly betterthan random in their label selection. The experiments presume that workers have good intentions to perform well andare not maliciously selecting the incorrect answer, which is why setting lb to 0 does not make sense. Please note, G = 2 is the Wisconsin Breast Cancer data set, while G (cid:54) = 2 are all subsets of the 20 Newsgroups data set.Based on the selection of G , T denotes the ground-truth of D g . For the Wisconsin Breast Cancer data set, nopre-processing was applied to the data, while for the 20 Newsgroups data, a TF-IDF vector representation for eachrecord was used. As an example, for the experiment with the ﬁrst 4 labels from 20 Newsgroups, all the data labelled g ∈ , , , was gathered (therefore all items not labelled { , , , } are not used) and vectorised. L random labelsare created which match the size of T with each item being selected from the selection of G . Data is split - 70% fortraining and 30% for testing. Training uses the 70% split of D and its associated L , while the remaining 30% of D isused to predict new answers L (cid:48) , with L (cid:48) then compared against the corresponding 30% of T to calculate the weightedf1-score. The mean of all weighted f1-scores for all 100 repetitions is then computed and used as the lb expertise forthe experiment. For each G label set experiment, this lower-bound model is computed once and then used across allpermutations where there is the selection of G labels. The parameters of the random forest models are the defaults inscikit-learn package (version 0.21.3) on Python .The lb expertise scores for each label set in G are displayed in theTable 1. https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_20newsgroups.html https://scikit-learn.org/stable/index.html PREPRINT - F

EBRUARY

18, 2021Table 1: Lower-bound expertise lb for each label set, determined by the average f1-score of 100 repetitions of RandomForest Models on the base data. G (labels in set) Lower-bound expertise – lb2 0.4833 0.3275 0.1897 0.13910 0.09415 0.06320 0.049Table 2: The distribution of label choices for each D g , which are used with S to generate the T ground-truth labels.G (labels in set)Label 2 3 5 7 10 15 200 0.373 0.289 0.17 0.12 0.083 0.055 0.0421 0.627 0.353 0.207 0.146 0.101 0.067 0.0522 0.358 0.21 0.148 0.102 0.068 0.0523 0.209 0.147 0.102 0.067 0.0524 0.204 0.145 0.1 0.066 0.0515 0.148 0.102 0.068 0.0526 0.146 0.101 0.067 0.0517 0.103 0.068 0.0538 0.103 0.068 0.0539 0.103 0.068 0.05310 0.068 0.05311 0.068 0.05312 0.067 0.05213 0.068 0.05314 0.067 0.05215 0.05316 0.04817 0.0518 0.04119 0.033 Each generated sample size S is built from the distributions of labels from D g . For the 20 Newsgroups data, each D only includes the labels up to the point of G and chooses them sequentially. So, for example, when G = 3 , D is reduced to include only the data that is labelled from the set where each individual label g ∈ { , , } , and thedistributions of this set are used to generate S and L . The distributions can be seen in Table 2. Please note that, due tothe samples and rounding, the result of S is not always exact. For example, in some instances it may include 251 labelsinstead of 250 so the distribution of the original D can be best matched. For each experiment a label set G and a sample S is chosen, with lb and ub to determine Λ based on the criteriaof whether it is a lower or higher expertise experiment. S determines the number of T labels we want to producebased on distributions of D . W worker sets are tested on each of these experiments, with ten repetitions for each W .Each item W signiﬁes the number of workers we want to generate answers for. For example, W = 10 means wegenerate ten workers with different answers governed by their expertise score λ w , chosen between the boundaries ofthe experiment. Ten repetitions were chosen to provide a balance between obtaining a good range of scores while notbeing computationally expensive. Initial experiments were run at higher worker sets, but these did not add any valueconsidering their added computational cost. 7 PREPRINT - F

EBRUARY

18, 2021For each of the ten repetitions, a different accuracy score and worker generated answers were created, producing aunique noisy version of T for each worker, denoted by A sw . The noise in the labels is determined by the previouslymentioned λ parameter, selected at random for each worker between the two expertise boundaries, with items selectedto be changed to incorrect also picked at random. For which label to assign to an item selected to be incorrect, eachitem in G is assigned an equal probability of 1/ G , and a label is picked at random until A sw (cid:54) = T s . With A sw computedin an iteration, the matrix is used to infer ground-truth T (cid:48) m , via each of the algorithms M : MV, CT and EM . Allalgorithms are therefore presented with the same generated worker answers in each repetition. The weighted f1-scorecomparing T (cid:48) m to T is then computed for each M . An ANOVA test compared each of the f1-score distribution, createdby the ten repetitions, with p values recorded and tested for signiﬁcance of p < . and p < . .As an example of a single experiment, we consider the lower expertise boundaries for workers, set S = 500 and G = 5 , generating a set of T s labels based on the distribution of labels from D g . Each w ∈ W is then looped throughfor ten repetitions, which can be thought of as creating 3 workers with a different expertise 10 times, then creating 5workers with a different expertise 10 times, and so on. In a single repetition, the workers are created by assigning theman expertise score lb > λ > . . From the ground truth labels T , a noisy version for each worker is created based ona worker’s assigned λ . The concatenated version of all worker’s answers, A sw , is then put through each of the threemethods, taking the label with the highest value for each item in S to produce an inferred ground-truth of T (cid:48) m . Theresults for each method, T (cid:48) m , are then compared to T and a weighted f1-score is calculated and recorded. This processis completed nine more times with all f1-scores recorded for each iteration, followed by moving onto the next workerset item in W . For higher expertise experiments, only S = 500 experiments. As previously mentioned, all experiments set expertise foreach worker . < λ < . . For all experiments, when W = 3 , all methods had a larger range of f1-scores, withthe smallest weighted f1-scores being when G = 3 ∧ W = 3 , with these being MV=0.72, CT=0.72 and EM=0.74,although these were all considered outliers, as the mean f1-scores in this experiment instance were MV=0.91, CT=0.95and EM=0.92. For all experiments using the 20 Newsgroups data, at W ≥ there is no f1-score < 0.99 for all methods,apart from one instance for MV when W = 13 ∧ G = 3 . For experiments where { , , , } ⊂ G , all results where W ≥ produced a perfect f1-score of 1.0 for each iteration, while { , , } ⊂ G also produced perfect f1-scoreswhere W ≥ . Figure 1 shows the results of selected experiments, displaying how more label choices resulted inquicker convergence to a perfect f1-score with fewer workers. There were no instances where CT outperformed EM toa signiﬁcant degree and vice-versa. Selected results where S = 500 are displayed in Fig 2. Green cells below indicate a signiﬁcant difference of p < . in favour of EM, while red cells indicate instances where there was no signiﬁcant difference ( p > . . Of the 70permutations for the high expertise experiment, EM produced a more favourable result in ﬁve instances, with four ofthem being for G = 2 , and the remaining being when G = 3 . Experiment G = 2 ∧ W = 10 produced a signiﬁcantresult where p < . . Selected results where S = 500 are displayed in Fig 3. Green cells below indicate a signiﬁcant difference of p < . in favour of CT while red cells indicate instances where there was no signiﬁcant difference ( p > . . Of the 70permutations for the high experiment, CT produced a more favourable result in three instances. For experiments where G = 2 , experiments where worker sets W = 10 and W = 18 produced signiﬁcant results where p < . , while G = 7 ∧ W = 3 produced a single result where p < . . For the lower expertise boundaries, where each worker is assigned an expertise of lb < λ < . , the full set of S samplesize is used, resulting in 420 permutations. Unlike higher expertise, no method converged to perfect f1-scores, althoughthere continued to be a pattern in all experiments where the greater value for W, the higher the average f1-score, while EM implementation taken from: https://github.com/dallascard/dawid_skene PREPRINT - F

EBRUARY

18, 2021 (a) Higher expertise experiment with two label choices and569 sample size from the Wisconsin Breast Cancer data set.Note that CT has the same performance as MV when thereare three workers. (b) Higher expertise experiment with seven label choicesand 500 sample size from the 20 Newsgroups data set. Com-pared to Fig 1(a), the results converge faster to a near-perfectf1-score with fewer workers.(c) Higher expertise experiment with ﬁfteen label choices and 500sample size from the 20 Newsgroups data set. While this alsoconverges to a near-perfect f1-score with fewer workers than Fig1a, it should be noted that no matter how many labels there areto choose from, having three workers will always produce a largevariance.

Figure 1: Selected results for higher expertise, displaying the quick convergence to near-perfect f1-score for all methodswhen more labels are included.the range of scores for each method tended to get smaller. In all experiments where W = 3 , CT produced identicalresults to MV, with every f1-score in the ten repetitions the same. Among all 420 permutations, EM was more favourable with p < . in 59 (14.0%) instances. A large proportion(80%) of these signiﬁcant results occurred when G ≤ , as the likelihood of EM giving a signiﬁcant advantage overMV falls dramatically as G increases. EM also beneﬁted from a worker set of W ≥ , as only four (6.78%) of thesigniﬁcant results occurred at the smaller worker sets ( W ≤ , with the most occurring when W = 20 , with tensigniﬁcant improvements (seven of these ten were also when G ≤ ). In addition, at W = 40 , when MV tends toget close to f1-score of 1.0, EM was still able to provide an advantage, although it should be noted again that six ofthe seven results occurred when G ≤ , with the remaining result when G = 5 . A similar pattern can be seen forS=250, where all seven of the signiﬁcant results had G ≤ . This was 17.5% of the 40 permutations where S = 250 .9 PREPRINT - F

EBRUARY

18, 2021Figure 2: Displaying p < . results where EM outperformed MV for high expertise experiment in green cells. Resultswhere p > . have been omitted in the red cells.Figure 3: Displaying results where CT outperformed MV with p < . in green cells. Results where p > . havebeen omitted in the red cells.Advantageous performance tended to be more common as S grew, with both S = 1000 and S = 2000 sample sizesproducing 19 signiﬁcant results; 47.5% of all permutations for their respective experiments. Where G ≥ , all three ofthe signiﬁcant results were seen when S = 2000 , with one of these when W = 18 and the other two when W = 20 .Two of the experiments where EM outperformed MV are displayed in Fig 4a and Fig 4b, while one experimentwhere EM never outperformed MV is displayed in Fig 4c. In Fig 4a, all W ≥ have a signiﬁcant improvementof p < . for EM when compared to MV, with W ≥ also being signiﬁcant to p < . . In Fig 4b, for all ≤ W ≤ , EM is a signiﬁcant improvement to a degree of p < . compared to MV. Finally, Fig 4c shows G = 10 ∧ S = 2000 experiments where EM never outperformed MV, which can be seen by boxplots being a similar sizeor bigger to their MV counterparts. Overall, there were 17 experiments where p < . , with 12 of these occurringwhen W ≥ ∧ G = 2 . Experiments where W = 40 ∧ G = 2 ∧ S ≥ produced four signiﬁcant improvementswhere p < . . For 55 (13.1%) of the 420 permutations CT had a signiﬁcant advantage over MV, a similar number to EM, althoughthe instances where p < . against MV tended to differ. Of the experiments that returned a signiﬁcant result alloccurred between ≤ G ≥ apart from one which appeared when G = 20 . The incidence of p < . experimentsfor G grows as there are more labels added, before peaking at G = 5 when the number of signiﬁcant results is 14,and this is then followed by a drop until G = 15 where there are no results where p < . . 58.9% of all signiﬁcantresults occurred when G = 5 ∨ G = 7 . Occurrences of p < . experiments tended to rise steadily as S grew where ≤ S ≤ , until S = 1000 and S = 2000 indicates a similar number of signiﬁcant occurrences compared to S = 500 . Overall, there were six results where p < . .10 PREPRINT - F

EBRUARY

18, 2021 (a) Lower expertise experiment with two label choices and2000 sample size from the Wisconsin Breast Cancer data set.The f1-score for EM when there are ten or more workers ina binary data set always outperforms MV. (b) Lower expertise experiment with three label choicesand 1000 sample size from the Newsgroups data set. Withbetween 8 and 3 and 30 workers, EM outperforms MV.(c) Lower expertise experiment with ten label choices and 1998sample size (S=2000) from the 20 Newsgroups data set. Here, EMdid not manage to outperform MV, which was common for manyof the lower expertise experiments that had a large label choicesize.

Figure 4: Charts displaying 2 instances where EM performed well, and one where it did not offer any real improvementover MV.The pattern in G differs from EM which observed a much steeper drop in signiﬁcant results as G grew, as can beseen below in Fig 5a. The pattern for signiﬁcant occurrences in S and W is somewhat similar between the two methodswhen comparing f1-score distributions to that of MV - although for S , EM appears to beneﬁt more from a larger samplesize, as can be seen in Fig 5b. Overall, 27 (6.4%) of the 420 experiments produced a result where using CT or EM was more favourable than the other,with 14 of these for CT and 13 for EM. However, of the 14 for CT, only 6 of these were results where CT was also moreadvantageous than using MV. All 13 of the EM results where p < . against CT were also results where p < . against MV, which also all occurred when G = 2 ∧ W ≥ . Five of the 14 results where CT was better than EMoccurred when S = 50 . Overall, there were 1.4% of instances where CT was the standout method to use and 3.1% ofinstances where EM was the standout method. In no experiments did MV come out as the clear winning method.11 PREPRINT - F

EBRUARY

18, 2021 (a) Showing instances where EM and CT f1-scores outper-formed MV with respect to the number of label choices.EM is much better suited to fewer label choices, while CTpeaks when there are ﬁve label choices. Neither performthat much better with a large label set choice. (b) Showing instances where EM and CT f1-scores outper-formed MV with respect to the sample size. CT and EM aremore likely to produce an advantage over MV as the samplesize grows.(c) Showing instances where EM and CT f1-scores outperformedMV with respect to the number of workers in the set. CT and EMhave a similar pattern. Neither are strong in performance comparedto MV when there are few workers.

Figure 5: Charts displaying counts of signiﬁcant experiments for labels in set, sample size and workers in set whencomparing results of EM and CT against MV.

Whereas CT and EM do offer advantages over a simple MV, this depends on the parameters of the labelling task. EM ismore likely to produce better results if the label set is binary, whereas for CT it is harder to determine a general ruleof thumb for when it is the outright best method to use, although it does seem effective with moderately sized labelsets. Neither seems to give an advantage for large label sets, while the same is true for small worker sets, where alarge variance in f1-score is seen for all methods. The experimental results show the relationship between the variousparameters and these particular methods, and give a new understanding as to when they actually provide an advantage.Overall, the results presented here should offer practitioners a general guidance of when to utilise one of thesemethods rather than MV depending on their choices of workers, expertise, labels and sample. Caution is, however,advised in that this is not a recommendation to base a label task around using a particular method, or just chooseone without consideration, but rather seeing where there is an opportunity to use one once a label task has beencompleted. While previous research has looked at the optimal number of workers for ground-truth inference withmultiple workers[22], the results suggest that the other task parameters can be just as important when attempting tomaximise accuracy through methodology choice. Since label gathering can often be considered the most crucial aspectof any supervised machine learning project, the insight offered here provides practitioners with a huge beneﬁt whenattempting to get the best possible data in an efﬁcient manner.12

PREPRINT - F

EBRUARY

18, 2021If worker ability can be assured to be of high quality, then fewer workers will be required to gain a near perfect ground-truth consensus, while the method choice is not as crucial when compared to lower expertise workers. Considering onlythree workers, somewhat favourable labels are gained when all can be assured to get at least 51% of answers correctwhen compared to the ground-truth. Although results get slightly better the more labels that are added, we often observeconvergence to a perfect score with eight to ten workers, with anything beyond this not adding any real value. In thesituation that worker cost is high, our analysis indicates there is an advantage in pre-screening workers and gaugingtheir ability, as a small quality check can remove noisy workers and focus on a team with more conﬁdence in theirability. However, the cost of this screening and the hit rate of getting good workers needs to be taken into consideration,while any task to perform this needs to be balanced enough that the expertise scored garnered for workers has a highdegree of conﬁdence. One interesting approach from any screening experiment would be to group workers based on thescreen into two separate groups: “strong” and “weak”, to see performance on a full task and record the impact on thefull task results.With regards to lower expertise workers, we see that smaller numbers result in more variance for the ﬁnal f1-scorefor all methods, as less than eight workers often means that the ﬁnal results can become quite unstable. In addition, ahigh number of workers often results in convergence to a near-perfect f1-score for all methods, although even in theseinstances there are occasions with lower expertise workers where CT and EM can still provide an advantage over MV.Variance in the result does tend to reduce as the number of workers grow, although this is not as sudden for the lowerexpertise worker sets as can be seen when compared to the higher expertise experiments.Both methods tended to produce their best results when there were ten or more workers, although incidents of betterperformance did not tend to grow in parallel with the number of workers, which could be attributed to other parametershaving a bigger impact, or the fact that more workers, on average, tended to produce more accurate results due to thenature of crowd averages phenomenon. Whereas previous research has suggested that ten or eleven workers is theoptimal number regardless of other factors[22], the results here demonstrate that employing slightly more workers thanthis can still offer some gains when the expertise of workers can drop below 50%. For smaller worker sets, often seen inspecialist knowledge settings where there are fewer available resources[19], the results indicate the challenges that theyface, in that the ﬁnal consensus results can vary to a large degree, and none of the methods used in these experimentsoffer a robust solution to inferring ground-truth.For binary tasks, or even some tasks with three labels in the taxonomy, the choice of EM is clear, as it can oftenbe seen to provide an advantage when this label parameter is met. If the label count is low, there can be reasonableassurance that EM will offer an advantage in inferring ground-truth, so it is recommended to consider it. The methoddoes not appear to scale as well to the number of labels in a set; this causes an issue for real world applications,where large label taxonomies are common due to the complex problems practitioners attempt to solve, as the potentialadvantages are sparse outside of tasks with fewer than seven labels.On the other hand, CT is a method that generally offers advantages over MV across a broader range of task parameters- although it can be hard to provide a general rule of thumb of when it is best used, unlike EM. Whereas it does offersome advantages at smaller label taxonomy sizes, most of its performance enhancements over MV are seen with slightlylarger taxonomies where there are ﬁve to seven labels to choose from. One of the curious aspects of CT is seen in thelower expertise experiments for binary label tasks and three workers, as it produced exactly the same results as MVevery time, suggesting that it mimics MV with the minimum number of workers (three) and labels (two) for a task,rendering it redundant. Recommending it as a way of improving results over both methods is also tricky, as there werevery few occasions when it was the standout method. When it was, these tended to be sparsely populated across theparameter experiments. It is clear that CT is a useful tool and can offer advantages in inferring ground-truth, but givinga deﬁnitive idea of when to use it to beneﬁt from above advantages is not entirely obvious.Overall, we discovered that large label taxonomies tend to mean that the choice of method for ground-truth inferenceis redundant. This may be due to some relation to the class subset sample size, as it can be seen from the overallsample sizes that CT and EM both have some sensitivity to this. Keeping in mind that it is usually the goal to accrue asmuch data as possible, EM seems to be even more sensitive to the overall sample size, concluding that 500-1000 as aminimum sample size can increase the possibility that a particular method will help boost ground-truth inference. Thisis particularly curious as the example in Dawid & Skene[32] base their work on a, albeit contrived, sample size of 45.Considerations over sample size are particularly true for CT where it is recommended to use thresholds on theirmedia unit annotation score to remove noisy examples, which can help aid training data for a model[31]. Thresholdsthat remove too many labels could, in theory, be costly to CT, meaning the method is unlikely to offer substantialbeneﬁt over a standard MV when compared to the results of experiments conducted here. Further experimentation witheven larger sample sizes could prove beneﬁcial in understanding whether the occurrences of signiﬁcant improvementsincreases in likelihood with respect to the other parameters, or if it tails off to a similar number for each sample set, ascan be seen in sample sets 1000 and 2000 for both CT and EM.13

PREPRINT - F

EBRUARY

18, 2021This also raises questions about label distributions and minority classes, in particular how both of these methodshandle it and if some instances of beneﬁcial performance could actually be explained by sub-samples of each label.As we see in much research, the two data sets used here tended to have much more desirable distributions than manyreal-life equivalents, and analysis into the impact of label imbalance could add to the knowledge acquired here ininforming others on how to infer consensus in this particular scenario. Indeed, it could be theorised that, since both EMand CT produce statistics related to each label, the same statistic between a minority class and a majority class is notentirely fair. Whereas these experiments have focused on overall performance, examining the performance of eachmethod on each class with controls on class distributions where minority classes are introduced is a possible avenue forfuture research to assess how robust each method is when it is presented with these issues. Taking EM as an example,the error rate square matrix produced for each worker will likely be impacted by imbalanced label distributions, and anygains gathered in the overall picture may be massively skewed by the dominant classes, whereas minority classes areoften included due to their high importance.

Here, we have concluded that each method has its advantages and draw backs. However, as a ﬁrst choice we recommend:1. EM is a better choice for smaller label sets, while CT is more favourable in medium-sized label sets. Neithermethod offers much advantage when the label set size is large.2. Employing 10-20 workers will not only result in a good balance between overall accuracy and efﬁciency, butalso increases the likelihood that CT or EM can produce more accurate ground-truth inference than MV.3. None of the methods can provide much advantage when there are fewer than ten workers, while accuracy alsotends to be less stable, so alternative methods should be sought to mitigate any potential errors in the label sets.4. High average worker expertise can result in needing fewer workers to gain a better consensus score, althoughthis means methodology for inferring ground-truth is not as important.5. Practitioners should aim to get each of their workers to provide an answer to as much sample as possible ifthey want to reap the potential advantages of CT or EM, as both beneﬁt from a larger sample.From the experiments, a loose rubric on these popular methods is now available to help not only guide data-gatheringexercises, but also inform others of opportunities to enhance ground-truth inference when the parameters of theirexercise align with those in the experiments of this paper. In this paper we have seen that CT and EM are only likelyto be advantageous with certain parameter conditions, which was not previously understood. The results provideresearchers with new insight into the parameter relationships and how they might impact accuracy on a task. This isparticularly important for any supervised learning task, as accuracy of the labels associated with the data is crucial tothe success of any model, and every opportunity to increase this accuracy should always be taken.The experimental set up in this paper can help guide future research in testing other methods to see the parameterconditions they work best under, as well as providing researchers with a framework to establish new methods that canattempt to resolve some of the shortcomings of the methods used in these experiments.

References [1] Yuji Roh, Geon Heo, and Steven Euijong Whang. A survey on data collection for machine learning: a big data - aiintegration perspective.

ArXiv , abs/1811.03402, 2018.[2] Olga Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Zhiheng Huang, A. Karpathy, A. Khosla,Michael S. Bernstein, A. Berg, and Li Fei-Fei. Imagenet large scale visual recognition challenge.

InternationalJournal of Computer Vision , 115:211–252, 2015.[3] Rion Snow, Brendan O’Connor, Daniel Jurafsky, and Andrew Ng. Cheap and fast – but is it good? evaluatingnon-expert annotations for natural language tasks. In

Proceedings of the 2008 Conference on Empirical Methodsin Natural Language Processing , pages 254–263, Honolulu, Hawaii, October 2008. Association for ComputationalLinguistics.[4] John Le, Andy Edmonds, Vaughn Hester, and Lukas Biewald. Ensuring quality in crowdsourced search relevanceevaluation: The effects of training question distribution. In

In SIGIR 2010 workshop , pages 21–26, 2010.[5] Michael Giancola, Randy C. Paffenroth, and Jacob Whitehill. Permutation-invariant consensus over crowdsourcedlabels. In

HCOMP , 2018. 14

PREPRINT - F

EBRUARY

18, 2021[6] H. Halpin and R. Blanco. Machine-learning for spammer detection in crowd-sourcing. In

HCOMP@AAAI , 2012.[7] Vikas C. Raykar and S. Yu. Eliminating spammers and ranking annotators for crowdsourced labeling tasks.

J.Mach. Learn. Res. , 13:491–518, 2012.[8] Srikanth Jagabathula, L. Subramanian, and Ashwin V Venkataraman. Identifying unreliable and adversarialworkers in crowdsourced labeling tasks.

J. Mach. Learn. Res. , 18:93:1–93:67, 2017.[9] G. Wang, Tianyi Wang, H. Zheng, and B. Zhao. Man vs. machine: Practical adversarial detection of maliciouscrowdsourcing workers. In

USENIX Security Symposium , 2014.[10] Wei Lee, Chien-Wei Chang, Po-An Yang, Chi-Hsuan Huang, Ming-Kuang Wu, Chu-Cheng Hsieh, and K. Chuang.Effective quality assurance for data labels through crowdsourcing and domain expert collaboration. In

EDBT ,2018.[11] Lora Aroyo and Chris Welty. Crowd truth: Harnessing disagreement in crowdsourcing a relation extraction goldstandard. 2013.[12] S. Benbadis. The tragedy of over-read eegs and wrong diagnoses of epilepsy.

Expert Review of Neurotherapeutics ,10:343 – 346, 2010.[13] M. Graber. The incidence of diagnostic error in medicine.

BMJ Quality and Safety , 22:ii21 – ii27, 2013.[14] J. Ehrich, E. Somekh, and M. Pettoello-Mantovani. The importance of expert opinion–based data: Lessons fromthe european paediatric association/union of national european paediatric societies and associations (epa/unepsa)research on european child healthcare services.

The Journal of Pediatrics , 195:310–311,311.e1, 2018.[15] D. Karimi, Haoran Dou, S. Warﬁeld, and A. Gholipour. Deep learning with noisy labels: exploring techniquesand remedies in medical image analysis.

Medical image analysis , 65:101759, 2020.[16] Mohamad Dolatshah, M. Teoh, Jiannan Wang, and J. Pei. Cleaning crowdsourced labels using oracles for statisticalclassiﬁcation.

Proc. VLDB Endow. , 12:376–389, 2018.[17] Vikas C. Raykar, S. Yu, L. Zhao, G. Hermosillo, Charles Florin, L. Bogoni, and L. Moy. Learning from crowds.

Journal of Machine Learning Research , 11:1297–1322, 2010.[18] C. Hube, B. Fetahu, and U. Gadiraju. Limitbias! measuring worker biases in the crowdsourced collection ofsubjective judgments (short paper). In

SAD/CrowdBias@HCOMP , 2018.[19] Hamed Valizadegan, Quang Nguyen, and M. Hauskrecht. Learning classiﬁcation models from multiple experts.

Journal of biomedical informatics , 46 6:1125–35, 2013.[20] C. Brodley and M. Friedl. Identifying mislabeled training data.

J. Artif. Intell. Res. , 11:131–167, 1999.[21] Yudian Zheng, G. Li, Yuanbing Li, C. Shan, and Reynold Cheng. Truth inference in crowdsourcing: Is the problemsolved?

Proc. VLDB Endow. , 10:541–552, 2017.[22] Arthur Carvalho, S. Dimitrov, and K. Larson. How many crowdsourced workers should a requester hire?

Annalsof Mathematics and Artiﬁcial Intelligence , 78:45–72, 2016.[23] P. Zhang and Z. Obradovic. Learning from inconsistent and unreliable annotators by a gaussian mixture modeland bayesian information criterion. In

ECML/PKDD , 2011.[24] P. Welinder, S. Branson, Serge J. Belongie, and P. Perona. The multidimensional wisdom of crowds. In

NIPS ,2010.[25] Yan Yan, Rómer Rosales, G. Fung, M. Schmidt, G. Hermosillo, L. Bogoni, L. Moy, and Jennifer G. Dy. Modelingannotator expertise: Learning when everybody knows a bit of something. In

AISTATS , 2010.[26] Indranil Balki, Afsaneh Amirabadi, Jacob Levman, Anne L. Martel, Ziga Emersic, Blaz Meden, Angel Garcia-Pedrero, Saul C. Ramirez, Dehan Kong, Alan R. Moody, and Pascal N. Tyrrell. Sample-size determinationmethodologies for machine learning in medical imaging research: A systematic review.

Canadian Association ofRadiologists Journal , 70(4):344–353, 2019. PMID: 31522841.[27] James Surowiecki.

The Wisdom of Crowds . Anchor, 2005.[28] A. Dumitrache, Oana Inel, Lora Aroyo, Benjamin Timmermans, and Chris Welty. Crowdtruth 2.0: Quality metricsfor crowdsourcing with disagreement (short paper). In

SAD/CrowdBias@HCOMP , 2018.[29] Oana Inel, Khalid Khamkham, T. Cristea, A. Dumitrache, Arne Rutjes, J. V. D. Ploeg, Lukasz Romaszko, LoraAroyo, and Robert-Jan Sips. Crowdtruth: Machine-human computation framework for harnessing disagreementin gathering annotated data. In

International Semantic Web Conference , 2014.[30] Lora Aroyo and Chris Welty. Truth is a lie: Crowd truth and the seven myths of human annotation.

AI Mag. ,36:15–24, 2015. 15

PREPRINT - F

EBRUARY

18, 2021[31] A. Dumitrache, Oana Inel, Benjamin Timmermans, C. Martinez-Ortiz, Robert-Jan Sips, Lora Aroyo, and ChrisWelty. Empirical methodology for crowdsourcing ground truth.

ArXiv , abs/1809.08888, 2018.[32] A. P. Dawid and A. Skene. Maximum likelihood estimation of observer error-rates using the em algorithm.

Journal of The Royal Statistical Society Series C-applied Statistics , 28:20–28, 1979.[33] Panagiotis G. Ipeirotis, F. Provost, and J. Wang. Quality management on amazon mechanical turk. In

HCOMP’10 , 2010.

Supplement

Results from the experiments in this paper can be found at: https://github.com/Caspian-Ltdhttps://github.com/Caspian-Ltd