[PDF] Time-Sensitive Bayesian Information Aggregation for Crowdsourcing Systems

Abstract

Crowdsourcing systems commonly face the problem of aggregating multiple judgments provided by potentially unreliable workers. In addition, several aspects of the design of efficient crowdsourcing processes, such as defining worker's bonuses, fair prices and time limits of the tasks, involve knowledge of the likely duration of the task at hand. Bringing this together, in this work we introduce a new time--sensitive Bayesian aggregation method that simultaneously estimates a task's duration and obtains reliable aggregations of crowdsourced judgments. Our method, called BCCTime, builds on the key insight that the time taken by a worker to perform a task is an important indicator of the likely quality of the produced judgment. To capture this, BCCTime uses latent variables to represent the uncertainty about the workers' completion time, the tasks' duration and the workers' accuracy. To relate the quality of a judgment to the time a worker spends on a task, our model assumes that each task is completed within a latent time window within which all workers with a propensity to genuinely attempt the labelling task (i.e., no spammers) are expected to submit their judgments. In contrast, workers with a lower propensity to valid labeling, such as spammers, bots or lazy labelers, are assumed to perform tasks considerably faster or slower than the time required by normal workers. Specifically, we use efficient message-passing Bayesian inference to learn approximate posterior probabilities of (i) the confusion matrix of each worker, (ii) the propensity to valid labeling of each worker, (iii) the unbiased duration of each task and (iv) the true label of each task. Using two real-world public datasets for entity linking tasks, we show that BCCTime produces up to 11% more accurate classifications and up to 100% more informative estimates of a task's duration compared to state-of-the-art methods.

Full PDF

JJournal of Artiﬁcial Intelligence Research 1 (2016) 1-30 Submitted 6/16; published 9/16

Time-Sensitive Bayesian Information Aggregation forCrowdsourcing Systems

Matteo Venanzi [email protected]

Microsoft, London, 2 Whaterhouse Square, EC1N 2ST UK

John Guiver [email protected]

Pushmeet Kohli [email protected]

Microsoft Research, Cambridge, 21 Station Road CB1 2FB UK

Nicholas R. Jennings [email protected]

Imperial College, London, South Kensington SW7 2AZ UK

Abstract

Crowdsourcing systems commonly face the problem of aggregating multiple judgmentsprovided by potentially unreliable workers. In addition, several aspects of the design ofeﬃcient crowdsourcing processes, such as deﬁning worker’s bonuses, fair prices and timelimits of the tasks, involve knowledge of the likely duration of the task at hand. Bring-ing this together, in this work we introduce a new time–sensitive Bayesian aggregationmethod that simultaneously estimates a task’s duration and obtains reliable aggregationsof crowdsourced judgments. Our method, called BCCTime, builds on the key insight thatthe time taken by a worker to perform a task is an important indicator of the likely qualityof the produced judgment. To capture this, BCCTime uses latent variables to representthe uncertainty about the workers’ completion time, the tasks’ duration and the workers’accuracy. To relate the quality of a judgment to the time a worker spends on a task, ourmodel assumes that each task is completed within a latent time window within which allworkers with a propensity to genuinely attempt the labelling task (i.e., no spammers) areexpected to submit their judgments. In contrast, workers with a lower propensity to validlabelling, such as spammers, bots or lazy labellers, are assumed to perform tasks consider-ably faster or slower than the time required by normal workers. Speciﬁcally, we use eﬃcientmessage-passing Bayesian inference to learn approximate posterior probabilities of (i) theconfusion matrix of each worker, (ii) the propensity to valid labelling of each worker, (iii)the unbiased duration of each task and (iv) the true label of each task. Using two real-worldpublic datasets for entity linking tasks, we show that BCCTime produces up to 11% moreaccurate classiﬁcations and up to 100% more informative estimates of a task’s durationcompared to state–of–the–art methods.

1. Introduction

Crowdsourcing has emerged as an eﬀective way to acquire large amounts of data that enablesthe development of a variety of applications driven by machine learning, human computa-tion and participatory sensing systems (Kamar, Hacker, & Horvitz, 2012; Bernstein, Little,Miller, Hartmann, Ackerman, Karger, Crowell, & Panovich, 2010; Zilli, Parson, Merrett,& Rogers, 2014). Services such as Amazon Mechanical Turk (AMT), oDesk and Crowd- c (cid:13) a r X i v : . [ c s . A I] A p r enanzi, Guiver, Kohli & Jennings Flower have enabled a number of applications to hire pools of human workers to providedata to serve for training image annotation (Whitehill, Ruvolo, Wu, Bergsma, & Movellan,2009; Welinder, Branson, Belongie, & Perona, 2010), galaxy classiﬁcation (Kamar et al.,2012) and information retrieval systems (Alonso, Rose, & Stewart, 2008). In such applica-tions, a central problem is to deal with the diversity of accuracy and speed that workersexhibit when performing crowdsourcing tasks. As a result, due to the uncertainty over thereliability of individual crowd responses, many systems collect many judgments from diﬀer-ent workers to achieve high conﬁdence in the quality of their labels. However, this can incura high cost either in time or money, particularly when the workers are paid per judgment,or when a delay in the completion of the entire crowdsourcing project is introduced whenworkers intentionally delay their submissions to follow their own work schedule. For exam-ple, in a typical crowdsourcing scenario, a requester must specify the number of requestedassignments (i.e., individual responses from diﬀerent workers), as well as the time limit forthe completion of each assignment. He must also set the price to be paid for each response ,which usually includes a participation fee and a bonus based on the quality of the submissionand the actual eﬀort required by the task. However, it is a non–trivial problem to set a timelimit that gives the workers suﬃcient time to perform the task correctly without leading totask starvation (i.e., no one working on the task after being assigned). Generally speaking,the knowledge of the actual duration of each assignment (task instance) is useful to therequesters for various reasons. First, a task’s duration can be used as a proxy to estimateits diﬃculty, as more diﬃcult tasks usually take longer to complete (Faradani, Hartmann,& Ipeirotis, 2011). Second, this information is useful to set the time limit of a task andto reduce the overall time of task completion. Third. a task requestor can use the taskduration to pay fair bonuses to workers based on the diﬃculty of the task they complete.When seeking to estimate this information, however, it is important to consider that someworkers might not perform a task immediately and they might delay their submissions afteraccepting the task or, at the other extreme, they might submit a poor annotation in rapidtime (Kazai, 2011). As a result, common heuristic estimates of a task’s duration (such asthe workers’ average or median completion time) that do not account for such aspects arelikely to be inaccurate.Given the above, there are a number of challenges to be addressed in the various stepsof designing eﬃcient crowdsourcing workﬂows. First, after all the judgments have been col-lected, the uncertainty about the unknown reliability of individual workers must be takeninto account to compute the ﬁnal labels. Such aggregated labels are often estimated insettings where the true answer of each task is never revealed, as this is the very quantitythat the crowdsourcing process is trying to discover (Kamar et al., 2012). Second, whenestimating a task’s duration, the uncertainty over the completion time deriving from theprivate work schedule of a worker must be taken into account (?). Third, these two chal-lenges must be addressed simultaneously due to the interdependencies between the workers’reliability, the time required to complete each task, and the ﬁnal labels estimated for suchtasks.

5. A common guideline for task requesters is to consider $0.10 per minute to be the minimum wage forethical crowdsourcing experiments ( ime-Sensitive Bayesian Information Aggregation for Crowdsourcing Systems In an attempt to address these challenges, there has been growing interest in developingalgorithms and techniques to compute accurate labels while minimising the set of, possiblyunreliable, crowd judgements (Sheng, Provost, & Ipeirotis, 2008). In more detail, simple so-lutions typically use heuristic methods such as majority voting or weighted majority voting(Tran-Thanh, Venanzi, Rogers, & Jennings, 2013). However, these methods do not considerthe reliability of diﬀerent workers and they treat all judgments as equally reliable. Moresophisticated methods such as the one–coin model (Karger, Oh, & Shah, 2011), GLAD(Whitehill et al., 2009), CUBAM (Welinder et al., 2010), DS (Dawid & Skene, 1979) andthe Bayesian Classiﬁer Combination (BCC) (Kim & Ghahramani, 2012) use probabilisticmodels that do take reliabilities into account, nor the potential labelling biases of the work-ers, e.g., the tendency for a worker to consistently over or underrate items. In particularDS represents the worker’s skills based on a confusion matrix expressing the reliability ofa worker for each possible class of objects. BCC works similarly to DS, but it also con-siders the uncertainty over the confusion matrices and aggregated labels using a principledBayesian learning framework. This representational power has enabled BCC to be suc-cessfully applied to a number of crowdsourcing applications including galaxy classiﬁcation(Simpson, Roberts, Psorakis, & Smith, 2013), disaster response (Ramchurn, Huynh, Ikuno,Flann, Wu, Moreau, Jennings, Fischer, Jiang, Rodden, et al., 2015) and sentiment analysis(Simpson, Venanzi, Reece, Kohli, Guiver, Roberts, & Jennings, 2015). More recently, (Ve-nanzi, Guiver, Kazai, Kohli, & Shokouhi, 2014) proposed a community–based extension ofBCC (i.e., CBCC) to improve predictions by leveraging groups of workers with similar con-fusion matrices. Similarly, (Simpson et al., 2015) combined BCC with language modellingtechniques for automated text sentiment analysis using crowd judgments. This degree ofapplicability and performance of BCC-based methods are a promising point of departurefor developing new data aggregation methods for crowdsourcing systems. However, none ofthe existing methods can reason about the workers’ completion time to learn the durationof a task outsourced to the crowd. Moreover, all these methods can only learn their prob-abilistic models from the information contained in the judgment set. Unfortunately, thisstrategy is challenged by datasets that can be arbitrarily sparse, i.e., the workers only pro-vide judgments for a small sub-set of tasks, and therefore the judgments only provide weakevidence of the accuracy of a worker. In such contexts, it is our hypothesis that a wider setof features must be leveraged to learn more reliable crowdsourcing models. In this work, wefocus on the time it takes to a worker to complete a task considered as a key indicator ofthe quality of his work. Importantly, the information about the workers’ completion time ismade available by all the most popular crowdsourcing platforms including AMT, the Mi-crosoft Universal Human Relevance System (UHRS) and CrowdFlower. Therefore, we seekto eﬃciently combine these features into a data aggregation algorithm that can be naturallyintegrated with the output data produced by these platforms. In more detail, we presenta novel time–sensitive data aggregation method that simultaneously estimates the tasks’duration and obtains reliable aggregations of crowdsourced judgments. The characteristicof time–sensitivity of our method relates to its ability to jointly reason about the worker’scompletion time together with the judgments in the data aggregation process. In detail,our method is an extension of BCC, which we term BCCTime. Speciﬁcally, it incorporatesa newly developed time model that enables the method to leverage observations of the timespent by a worker on a task to best inform the inference of the ﬁnal labels. As in BCC, enanzi, Guiver, Kohli & Jennings we use confusion matrices to represent the labelling accuracy of individual workers. Tomodel the granularity in the workers’ time proﬁles, we use latent variables to represent thepropensity of each worker to submit valid judgments. Further, to model the uncertainty ofthe duration of each task, we use latent thresholds to deﬁne the time interval within whichthe task is expected to be completed by all the workers with high propensity to valid la-belling. Then, using Bayesian message-passing inference, our method simultaneously infersthe posterior probabilities of (i) the confusion matrix of each worker, (ii) the propensityto valid labelling of each worker, (iii) the true label of each task and (iv) the upper andlower bound of the duration of each task. In particular, the latter represents a reliableestimate of the likely duration of a task obtained through automatically ﬁltering out all thecontributions of the workers with a low propensity to valid labelling. We demonstrate theeﬃcacy of our method using two commonly–used public datasets that relate to an impor-tant Natural Language Processing (NLP) application of crowdsourcing entity linking tasks.In these datasets, our method achieves up to 11% more accurate classiﬁcations comparedto seven state-of-the-art methods. Further, we show that our tasks’ duration estimates areup to 100% more informative than the common heuristics that do not consider the workers’completion time as correlated to the quality of their judgments.Against this background, we make the following contributions to the state of the art. • Through an analysis on two real-world datasets for crowdsourcing entity-linking tasks,we show the existence of diﬀerent types of task–speciﬁc quality–time trends, e.g.,increasing, decreasing or invariant trends, between the quality of the judgments andthe time spent by the workers to produce them. We also re-conﬁrm existing resultsshowing that the workers who submit judgments too quickly or too slowly over theentire task set typically provide lower quality judgments. • We develop BCCTime: The ﬁrst time-sensitive Bayesian aggregation model that lever-ages observations of a worker’s completion time to simultaneously aggregate crowdjudgments and infer the duration of each task as well as the reliability of each worker. • We show that BCCTime outperforms seven of the most competitive state–of–the–artdata aggregation methods for crowdsourcing, including BCC, CBCC, one coin andmajority voting, by providing up to 11% more accurate classiﬁcations and up to 100%more informative estimates of the task’s duration.The rest of the paper unfolds as follows. Section 2 describes our notation and the prelimi-naries of the Bayesian aggregation of crowd judgments. Section 3 details our time analysisof real-world datasets. Then, Section 4 formally introduces BCCTime and details its prob-abilistic inference. Section 5 presents its evaluation against the state of the art. Section 6summarises the rest of the related work in the areas of data aggregation and time analysisof crowd generated content and Section 7 concludes.

2. Preliminaries

Consider a crowd of K workers labelling N objects into C possible classes – all our symbolsare listed in Table 1. Assume that k submits a judgment c ( k ) i ∈ { , . . . , C } for classifying ime-Sensitive Bayesian Information Aggregation for Crowdsourcing Systems Table 1: List of symbols.

Symbol Deﬁnition N Number of tasks K Number of workers C Number of true label values T Set of observed workers’ completion time J Set of observed judgments t i True label of the task i t Vector of all t i c ( k ) i Judgment of k for task iτ ( k ) i Time spent by k for judging the task i π ( k ) Confusion matrix of k p Class proportions of all the tasks ψ k Propensity of k for making valid labelling attempts s Labelling probabilities of a general low-propensity worker ψ Vector of ψ k , ∀ k = 1 , . . . , Kv ki Boolean variable signalling if c ( k ) i is a valid labelling attempt σ i Lower-bound threshold of the duration of task iλ i Upper-bound threshold of the duration of task iσ Mean for the Gaussian prior over σ i γ Precision hyperparameter of the Gaussian prior over σ i λ Mean hyperparameter of the Gaussian prior over λ i δ Precision hyperparameter of the Gaussian prior over λ i α True count hyperparameter of the Beta prior over ψ k β False count hyperparameter of the Beta prior over ψ k s Hyperparameter of the Dirichlet prior over sp Hyperparameter of the Dirichlet prior over pπ ( k )0 Hyperparameter of the Dirichlet prior over π ( k ) an object i . Let t i be the unobserved true label of i . Then, suppose that τ ( k ) i ∈ R + isthe time taken by k to produce c ( k ) i . Let J = { c ( k ) i |∀ i = 1 , . . . , N, ∀ k = 1 , . . . , K } and T = { τ ( k ) i |∀ i = 1 , . . . , N, ∀ k = 1 , . . . , K } be the set containing all the judgments and thetime spent by the workers, respectively.We now introduce the key features of the BCC model that are relevant to our method.First introduced by (Kim & Ghahramani, 2012), BCC is a method that combines multiplejudgments produced by independent classiﬁers (i.e., crowd workers) with unknown accu-racy. Speciﬁcally, this model assumes that, for each task i , t i is drawn from a categoricaldistribution with parameters p : t i | p ∼ Cat( t i | p ) (1)where p denotes the class proportions for all the objects. Then, a worker’s accuracy isrepresented through a confusion matrix π ( k ) comprising the labelling probabilities of k foreach possible true label value. Speciﬁcally, each row of the matrix π ( k ) c = { π ( k ) c, , . . . , π ( k ) c,C } isthe vector where π ( k ) c,j is the probability of k producing the judgment j for an object of class c . Importantly, this confusion matrix expresses both the accuracy (diagonal values) and the enanzi, Guiver, Kohli & Jennings biases (oﬀ-diagonal values) of a worker. It can recognise workers who are particularly accu-rate (inaccurate) or have a bias for a speciﬁc class of objects. In fact, accurate (inaccurate)workers are represented through high (low) probabilities on the diagonal of the confusionmatrix, whilst workers with a bias towards a particular class will have high probabilities onthe corresponding column of the matrix. For example, in the galaxy zoo domain in whichthe workers classify images of celestial galaxies, the confusion matrices can detect workerswho have low accuracy in classifying spiral galaxies or those who systematically classifyevery object as elliptical galaxies (Simpson et al., 2013).To relate the worker’s confusion matrix to the quality of a judgment, BCC assumes that c ( k ) i is drawn from a categorical distribution with parameters corresponding to the t i -th rowof π ( k ) : c ( k ) i | π ( k ) , t i ∼ Cat (cid:0) c ( k ) i | π ( k ) t i (cid:1) (2)This is equivalent to having a categorical mixture model over c ( k ) i with t i as the mixtureparameter and π c as the parameter of the c -th categorical component. Then, assuming thatthe judgments are independent and identically distributed (i.i.d.), the joint likelihood canbe expressed as: p ( C , t | π , p ) = N (cid:89) i =1 Cat( t i | p ) K (cid:89) k =1 Cat (cid:0) c ( k ) i | π ( k ) t i (cid:1) Using conjugate Dirichlet prior distributions for the parameters p and π and applyingBayes’ rule, the joint posterior distribution can be derived as: p ( π , p | C , t ) ∝ Dir( p | p ) N (cid:89) i =1 (cid:110) Cat( t i | p ) K (cid:89) k =1 Cat (cid:0) c ( k ) i | π ( k ) t i (cid:1) Dir( π ( k ) t i | π ( k ) t i , ) (cid:111) (3)From this expression, it is possible to derive the predictive posterior distributions of eachunobserved (latent) variable using standard integration rules for Bayesian inference (Bishop,2006). Unfortunately, the exact derivation of these posterior distributions is intractable forBCC due to the non-conjugate form of the model (Kim & Ghahramani, 2012). However,it has been shown that, particularly for BCC models, it is possible to compute eﬃcient ap-proximations of these distributions using standard techniques such as Gibbs sampling (Kim& Ghahramani, 2012), variational Bayes (Simpson, 2014) and Expectation-Propagation(Venanzi et al., 2014). Building on this, several extensions of BCC have been proposedfor various crowdsourcing domains (Venanzi et al., 2014; Simpson et al., 2015, 2013). Inparticular, CBCC applies community–based techniques to represent groups of workers withsimilar confusion matrices in the classiﬁer combination process (Venanzi et al., 2014). Thismechanism enables the model to transfer learning of a worker’s reliability through the com-munities and so improve the quality of the inference.However, a drawback of all these BCC based models is that they do not learn the task’sduration nor do they consider any extra features other than the worker’s judgments. As a ime-Sensitive Bayesian Information Aggregation for Crowdsourcing Systems result, they perform the full learning of the confusion matrices and task labels using only thejudgments produced by the workers. But, as mentioned earlier, this strategy is challengedby sparse datasets where each worker only labels a few tasks. This is the case, for instance,in the Crowdﬂower dataset used in the 2013 CrowdScale Shared Task challenge where thesentiment of 98,980 tweets was classiﬁed by 1,960 workers over ﬁve sentiment classes. Inthis dataset, 30% of the workers judged only 15 tweets, i.e., 0.015% of the total samples,and there is a long tail of workers with less than 3 judgments.

3. Analysis of Workers’ Time Spent on Judgments

Having discussed the basic concepts of non-time based data aggregation, we now turn tothe analysis of the relationship between the time that workers spend on the task and thequality of the judgments they produce. In contrast to previous works in this area (Demar-tini, Difallah, & Cudr´e-Mauroux, 2012; Wang, Faridani, & Ipeirotis, 2011), we extend ouranalysis of quality–time responses to both speciﬁc task instances, as well as for the entiretask set. By so doing, we provide key insights to inform the design of our time–sensitiveaggregation model. To this end, we consider two public datasets generated from a widelyused NLP application of crowdsourcing entity linking tasks. contains a set of links between the names of entitiesextracted from news articles and uniform resource identiﬁers (URIs) describing the entityin Freebase and DBpedia (Demartini et al., 2012). The dataset was collected using AMT,with each worker being asked to classify whether a single URI was either irrelevant (0)or relevant (1) to a single entity. It contains the timestamps of the acceptance and thesubmission of each judgment. Moreover, gold standard labels were collected from experteditors for all the tasks. No information was released regarding the restrictions on theworker pool, although all workers are known to be living in India, and each worker waspaid $0.01 per judgment. A total of 11,205 judgements were collected from a small poolof 25 workers, giving this dataset a moderately high number of judgements per worker, asdetailed in Table 2. In particular, Figure 1a shows that the vast majority of tasks receive5 judgements, while Figure 1c shows a skewed distribution of gold labels, in which 78% oflinks between entities and URIs were classiﬁed by workers as irrelevant (0). As such, itis worth noting that any binary classiﬁers with a bias towards unrelated classiﬁcation willcorrectly classify the majority of tasks and thus receive a high accuracy. Therefore, as wewill detail in Section 5, it is important to select accuracy metrics that evaluate the classiﬁeracross the whole spectrum of possible discriminant thresholds. ZenCrowd - USA (ZC-US):

This dataset was also provided by Demartini et al. (2012)and contains judgements for the same set of tasks as ZC-IN, although the judgements werecollected from AMT workers in the US. The same payment of $0.01 per judgement wasused. However, a larger pool of 74 workers was involved, and as such a lower number of enanzi, Guiver, Kohli & Jennings Table 2: Crowdsourcing datasets for entity linking tasks.

Dataset: Judgements Workers Tasks Labels Judgement Judgements Judgementsaccuracy per task per workerZC-IN 11205 25 2040 2 0.678 5.493 448.200ZC-US 12190 74 2040 2 0.770 5.975 164.730WS-AMT 6000 110 300 5 0.704 20.000 54.545 t a s k s (a) ZC - IN t a s k s (b) ZC - US Gold label t a s k s (c) ZC Gold label t a s k s (d) WS - AMTFigure 1: Histograms of the number of judgments per task for ZC-IN (a) and ZC-US (b) –WS - AMT is not shown because the tasks received exactly 20 judgments – and the numberof tasks per gold label for ZC (c) and WS - AMT (d).judgements were collected from each worker, as shown in Table 2. Furthermore, Figure 1bshows a similar distribution of judgements per task as the India dataset, although slightlyfewer tasks received 5 judgements, with most of the remaining tasks receiving 3-4 judgementsor 9-11 judgements. The judgement accuracy of the US dataset is higher than the Indiadataset despite an identical crowdsourcing system and reward mechanism being used. Weather Sentiment - AMT (WS-AMT):

The Weather Sentiment dataset was pro-vided by CrowdFlower for the 2013 Crowdsourcing at Scale shared task challenge. Itincludes 300 tweets with 1,720 judgements from 461 workers and has been used in several ime-Sensitive Bayesian Information Aggregation for Crowdsourcing Systems

5 80 620 1520 P r e c i s i on Time spent (sec) (a) Z C - US

5 80 620 1520 R e c a ll Time spent(sec) (b) Z C - US

5 85 350 600 1400 P r e c i s i on Time spent (sec) (c) Z C - I N

5 85 350 600 1400 R e c a ll Time spent (sec) (d) Z C - I N Figure 2: Histograms of the precision and recall binned by the time spent by the US workers(a, b) and Indian workers (c, d) in the ZenCrowd datasets.experimental evaluations of crowdsourcing models (Simpson et al., 2015; Venanzi et al.,2014; ?). In detail, the workers were asked to classify the sentiment of tweets with respectto the weather into the following categories: negative (0), neutral (1), positive (2), tweet notrelated to weather (3) and can’t tell (4). As a result, this dataset pertains to a multi-classclassiﬁcation problem. However, the original dataset used in the Share task challenge didnot contain any time information about the collected judgments. Therefore, a new dataset(WS-AMT), was recollected for the same tasks as in the CrowdFlower shared task datasetusing the AMT platform, acquiring exactly 20 judgements and recording the elapsed timefor each judgment (?). As a result, WS-AMT contains 6,000 judgements from 110 workers,as shown in Table 2. No restrictions were placed on the worker pool and each worker waspaid $0 .

03 per judgement. Furthermore, Figure 1d shows that, as per the original dataset,the most common gold label is unrelated , while only ﬁve tasks were assigned the gold label can’t tell . We wish to analyse the distribution of the workers’ completion time and the judgments’accuracy. To do so, we focus on the two datasets, ZC-US and ZC-IN with binary labels. Infact, the binary nature of these two datasets allow us to analyse accuracy at a higher level of enanzi, Guiver, Kohli & Jennings

3 10 18 41 R e c a ll Time spent (sec) (b) Time-increasing taskZ C - I N Entity: American - Link: freebase.com/united_states

11 23 48 R e c a ll Time spent (sec) (c) Time-decreasing taskZ C - USEntity: European - Link: dbpedia.org/page/European

2 12 25 R e c a ll (d) Time-decreasing taskZ C - I N Entity: Swiss - Link: dbpedia.org/page/Switzerland

Execution time (sec)

5 19 36 45 R e c a ll Time spent (sec) (e) Time-decreasing taskZ C - USEntity: Switzerland - Link: dbpedia.org/page/Switzerland

2 8 14 71 R e c a ll Time spent (sec) (f) Time-constant taskZ C - USEntity: GMT - Link: dbpedia.org/page/Greenwich_Mean_Time

2 7 10 20 60 R e c a ll Time spent (sec) (a) Time -increasing taskZ C - US USEnt ity: SouthernAvenue - Link:freebase.com/m/03hkhgs Figure 3: Histograms of the recall for six entity linking tasks with positive gold standardlabels and at least ten judgments in the ZenCrowd datasets. They show the diﬀerent trendsof recall-time curves for various tasks.detail, i.e., in terms of precision and recall of the workers’ judgments and the time spent toproduce them. Speciﬁcally, Figure 2 shows the cumulative distribution of the precision andthe recall of the set of judgments selected by a speciﬁc time threshold (x-axis) with respectto the gold standard labels. Here, the precision is the fraction of true positive classiﬁcationsover all the returned positive classiﬁcations (true positives + false positives) and the recall isthe number of true positive classiﬁcations divided by the number of all the positive samples.Similarly to Demartini et al. (2012), we ﬁnd that the accuracy is lower at the extremes ofthe time distributions. In ZC-US, both the precision and recall are higher for the sub-setof judgments that were produced in more than 80 seconds and less than 1500 seconds. In ime-Sensitive Bayesian Information Aggregation for Crowdsourcing Systems Task index1 2 3 4 5 6 7 8 9 10 11 12 13 P ea r s on ' s ρ -1-0.8-0.6-0.4-0.200.20.40.60.81 Task index1 2 3 4 5 6 7 8 9 10 11 12 13 p v a l ue Significant correlationStrong positivecorrelation Strong negativecorrelation (a) (b)

Figure 4: The Pearson’s correlation coeﬃcient (a) and the p-value (b) of the linear corre-lation between the workers’ completion time and the judgments accuracy for the 13 entitylinking tasks with positive gold standard labels and more than the judgments in the Zen-Crowd datasets.ZC-IN, the precision and recall are higher for judgments produced in more than 80 secondsand less than 600 seconds.In addition, Figure 3 shows the distribution of the recall and execution time for a sampleset of six positive task instances (i.e., entities with positive gold standard labels) with atleast ten judgments. For example, Figure 3b shows the time distribution of the judgmentsfor the URI: freebase.com/united states associated to the entity “American”. In thesegraphs, some samples have an increasing quality-time curve, i.e., workers spending moretime produce better judgments, (Figure 3a and Figure 3b). Other samples have a decreasingquality-time curve, i.e., workers spending more time produce worse judgments (Figure 3cand Figure 3d). Finally, the last two samples have an approximately constant quality-timecurve, i.e., worker’s quality is invariant to the time spent (Figure 3e and Figure 3f). It canalso be seen that these trends naturally correlate to the diﬃculty of each task instance.For instance, URI: freebase.com/m/03hkhgs linked to the entity “Southern Avenue” ismore diﬃcult to judge than the URI: dbpedia.org/page/Switzerland linked to the entity“Switzerland”. In fact, “Southern Avenue” is more ambiguous as an entity name, whichmay lead the worker to open the URI and check its content to be able to issue a correctjudgment. Instead, the relevance for the second entity “Switzerland” can be judged moreeasily through visual inspection of the URI. In addition, each task has a speciﬁc time intervalthat includes the sub-set of judgments with the highest precision. For example, in ZC-IN,the judgments with the highest precision for the URI: dbpedia.org/page/Switzerland and the entity “Switzerland” were submitted between 5 sec. and 20 sec. (Figure 2d).Instead, in ZC-US, the best judgments for the URI: dbpedia.org/page/European linkedto the entity “European” were submitted in the interval of 2 sec. and 16 sec. (Figure 2c).As a result, it is clear that each task instance has speciﬁc quality–time proﬁle that relatesto the diﬃculty of labelling that instance. enanzi, Guiver, Kohli & Jennings To better analyse these trends, Figure 4 shows the Pearson’s correlation coeﬃcient ( ρ )(i.e., a standard measure of the degree of linear correlation between two variables) for all the13 entities with positive links and more than ten judgments across the two datasets. Thetime spent by the worker is not always (linearly) correlated to the quality of the judgmentacross all the task instances. Some tasks have a signiﬁcantly positive correlation (i.e., taskindex = 6, 8, 13 with ρ > . p < . ρ < . p < .

4. The BCCTime Model

Based on the above results of the time analysis of workers’ judgments, we observed thatdiﬀerent types of quality–time trends occur for speciﬁc task instances. However, the stan-dard BCC, as well as all the other existing aggregation models that do not consider thisinformation, are unable to perform inference over the likely duration of a task. To rectifythis, there is a need to extend BCC to be able to include these trends in the aggregationof crowd judgments. To this end, the model must be ﬂexible enough to identify workerswho, in addition to having imperfect skills, may also not have the intention to make a validattempt to complete a task. This further increases the uncertainty about data reliability.In this section, we describe our Bayesian Classiﬁer Combination model with Time (BCC-Time). In particular, we describe the three components of the model concerning (i) therepresentation of the unknown workers’ propensity to valid labelling, (ii) the reliability ofworkers’ judgments and (iii) the uncertainty in the worker’s completion time, followed bythe details of its probabilistic inference.

Given the uncertainty about the intention of a worker to submit valid judgments, we in-troduce the latent variable ψ k ∈ [0 ,

1] representing the propensity of k towards makinga valid labelling attempt for any given task. In this way, the model is able to naturallyexplain the unreliability of a worker based not only on her imperfect skills but also on herattitude towards approaching a task correctly. In particular, ψ k close to one means thatthe worker has a tendency to exert her best eﬀort to provide valid judgments, even thoughher judgments might be still noisy as a consequence of the imperfect skills she possesses.In contrast, ψ k close to zero means that the worker tends to not provide valid judgmentsfor her tasks, which means that she behaves similarly to a spammer. Speciﬁcally, only theworkers with high propensity to valid labelling will provide inputs that are meaningful to ime-Sensitive Bayesian Information Aggregation for Crowdsourcing Systems the task’s true label and the task’s duration. To capture this, we deﬁne a per-judgmentboolean variable v ( k ) i ∈ { , } with v ( k ) i = 1 meaning that k has made a valid labellingattempt when submitting c ( k ) i and v ( k ) i = 0 meaning that c ( k ) i is an invalid annotation. Inthis setting, the number of valid labelling attempts made by the worker derives from herpropensity to valid labelling. Thus, we can model this by assuming that v ( k ) i is a randomdraw from a Bernoulli distribution parametrised by ψ k : v ( k ) i ∼ Bernoulli( ψ k ) (4)That is, workers with high propensity to valid labelling are more likely to make more validlabelling attempts, whilst workers with low propensity are more likely to submit more spamannotations. Here we describe the part of the model concerned with the generative process of crowdjudgments from the confusion matrix and the propensity of the workers. Intuitively, onlythose judgments associated with valid labelling attempts should be considered to estimatethe ﬁnal labels. This means that each judgment may be generated from two diﬀerentprocesses depending on whether or not it comes from a valid labelling attempt. To capturethis in the generative model of BCCTime, a mixture model is used to switch between thesetwo cases conditioned on v ( k ) i . For the ﬁrst case of a valid labelling attempt, i.e., v ( k ) i = 1,the judgment is generated through the worker’s confusion matrix as per the standard BCCmodel. Therefore, we assume that c ( k ) i is generated for the same model described for BCC(Eq. 2), including v ( k ) i in the conditional variables. Formally: c ( k ) i | π ( k ) , t i , v ( k ) i = 1 ∼ Cat (cid:0) c ( k ) i | π ( k ) t i (cid:1) (5)For the second case of a judgment produced from an invalid labelling attempt, i.e., v ( k ) i = 0,it is natural to assume that the judgment does not contribute to the estimation of the truelabel. Formally, this assumption can be represented through general random vote model inwhich c ( k ) i is drawn from a categorical distribution with a vector parameter s : c ( k ) i | s , v ( k ) i = 0 ∼ Cat (cid:0) c ( k ) i | s (cid:1) (6)Here s is the vector of the labelling probabilities of a general worker with low propensityto make valid labelling attempts. Notice that the equation above does not depend on t i ,which means that all the judgments coming from invalid labelling attempts are treated asnoisy responses uncorrelated to t i . As shown in Section 3, the duration of a task may be deﬁned as the interval in whichworkers are more likely to submit high-quality judgments. However, due to the dependencyof the duration on the task’s characteristics, the requirement is that such an interval mustbe non-constant across all the tasks. To model this, we deﬁne a lower-bound threshold, σ i , enanzi, Guiver, Kohli & Jennings and an upper-bound threshold, λ i , for the time interval representing the duration of i . Boththese per–task thresholds are latent variables that must be learnt at training time. Then,the tasks with a lower or higher variability in their duration can be represented based onthe values of their time thresholds. In this setting, all the valid labelling attempts made bythe workers are expected to be completed within the task’s duration interval detailed bythese thresholds. Formally, we represent the probability of τ ( k ) i being greater than σ i usingthe standard greaterThan probabilistic factor introduced by (Herbrich, Minka, & Graepel,2007) for the TrueSkill Bayesian ranking model: I ( τ ( k ) i > σ i | v ( k ) i = 1) (7)This factor deﬁnes a non-conjugate relationship over σ i such that the posterior distributionof τ ( k ) i is not in the same form as the prior distribution of σ i . Therefore the posteriordistribution p ( τ ( k ) i ) needs to be approximated. We do this via moment matching with aGaussian distribution ˆ p ( τ ( k ) i ) by matching the precision and the precision adjusted mean(i.e., the mean multiplied by the precision) to the posterior distribution of p ( τ ( k ) i ), as shownin Table 1 in Herbrich et al. (2007). In a similar way, we model the probability of τ ( k ) i being greater than λ i as: I ( λ i > τ ( k ) i | v ( k ) i = 1) (8)Drawing all this together, upon observing a set of i.i.d. pairs of judgments and workers’completion times contained in J and T respectively, we can express the joint likelihood ofBCCTime as: p ( J , T , t | π , p , s , ψ ) = N (cid:89) i =1 Cat( t i | p ) (cid:26) K (cid:89) k =1 (cid:16) I ( τ ( k ) i > σ i ) I ( λ i > τ ( k ) i )Cat (cid:0) c ( k ) i | π ( k ) t i (cid:1)(cid:17) ψ k Cat (cid:0) c ( k ) i | s (cid:1) (1 − ψ k ) (cid:27) (9)The factor graph of BCCTime is illustrated in Figure 5. Speciﬁcally, the two shaded vari-ables c ( k ) i and τ ( k ) i are the observed inputs, while all the unobserved random variables areunshaded. The graph uses the gate notation (dashed box) introduced by (Minka & Winn,2008) to represent the two mixture models of BCCTime. Speciﬁcally, the outer gate repre-sents the workers’ judgments (see Section 4.2) and completion times (see Section 4.3) thatare generated from either BCC or the random vote model using v ( k ) i as the gating variable.The inner gate is the mixture model generating the workers’ judgments from the rows ofthe confusion matrix using t i as the gating variable. To perform Bayesian inference over all the unknown quantities, we must provide priordistributions for the latent parameters of BCCTime. Following the structure of the model, ime-Sensitive Bayesian Information Aggregation for Crowdsourcing Systems N tasks

Dir.

K workers C true label values

Cat.Dir. pt i π ( k ) c c ( k ) i τ ( k ) i ψ k σ i v ( k ) i greater Cat. BernoulliGaussian smaller λ i Dir. s Cat.Gaussian Beta

Figure 5: The factor graph of BCCTime.we can select conjugate distributions for all such parameters to enable a more tractableinference of their posterior probabilities. Therefore, the prior of p is Dirichlet distributedwith hyperparameter p : (true label prior) p ∼ Dir( p | p ) (10)The priors of s and π ( k ) c are also Dirichlet distributed with hyperparameter s and π ( k ) c, respectively: (spammer label prior) s ∼ Dir( s | s ) (11)(confusion matrix prior) π ( k ) c ∼ Dir( π ( k ) c | π ( k ) c, ) (12)Then, ψ k has a Beta prior with true count α and false count β :(worker’s propensity prior) ψ k ∼ Beta( ψ k | α , β ) (13)The two time thresholds σ i and λ i have Gaussian priors with mean σ and λ and precision γ and δ respectively:(lower-bound of the task’s duration threshold prior) σ i ∼ N ( σ i | σ , γ ) (14)(upper-bound of the task’s duration threshold prior) λ i ∼ N ( λ i | λ , δ ) (15)Collecting all the hyperparameters in the set Θ = { p , s , α , β , σ , γ , λ , δ } , we ﬁnd byapplying Bayes’ theorem that the joint posterior distribution is proportional to: enanzi, Guiver, Kohli & Jennings p ( π , p , s , t , ψ | J , T , Θ ) ∝ Dir( s | s )Dir( p | p ) N (cid:89) i =1 (cid:26) Cat( t i | p ) N ( σ i | σ , γ ) N ( λ i | λ , δ ) K (cid:89) k =1 (cid:16) I ( τ ( k ) i > σ i ) I ( λ i < τ ( k ) i )Cat (cid:0) c ( k ) i | π ( k ) t i (cid:1) Dir( π ( k ) t i | π ( k ) t i , ) (cid:17) ψ k Cat (cid:0) c ( k ) i | s (cid:1) (1 − ψ k ) Beta( ψ k | α , β ) (cid:27) (16)From this expression, we can compute the marginal posterior distributions of each latentvariable by integrating out all the remaining variables. Unfortunately, these integrationsare intractable due to the non–conjugate form of our model. However, we can still computeapproximations of such posterior distributions using standard techniques from the fam-ily of approximate Bayesian inference methods (Minka, 2001). In particular, we use thewell-known EP algorithm (Minka, 2001) that has been shown to provide good quality ap-proximations for BCC models (Venanzi et al., 2014) . This method leverages a factoriseddistribution of the joint probability to approximate the marginal posterior distributionsthrough an iterative message passing scheme implemented on the factor graph. Speciﬁcally,we use the EP implementation provided by Infer.NET (Minka, Winn, Guiver, & Knowles,2014), which is a standard framework for running Bayesian inference in probabilistic models.Using Infer.NET, we are able to train BCCTime on our largest dataset of 12,190 judgmentswithin seconds using approximately 80MB of RAM on a standard laptop.

5. Experimental Evaluation

Having described our model, we test its performance in terms of classiﬁcation accuracy andability to learn the tasks’ duration in real crowdsourcing experiments. Using the datasetsdescribed in Section 3, we conduct experiments in the following experimental setup.

We consider a set of benchmarks consisting of three popular baselines (Majority voting,Vote distribution and Random) and three state–of–the–art aggregation methods (One coin,BCC and CBCC) that are commonly employed in crowdsourcing applications. In moredetail: • One coin:

This method represents the accuracy of a worker with a single reliabilityparameter (or worker’s coin) assuming that the worker will return the correct answerwith probability speciﬁed by the coin, and the incorrect answer with inverse proba-bility. As a result, this method is only applicable to binary datasets. Crucially, thismodel represents the core mechanism of several existing methods including (Whitehill

10. Alternative inference methods such as Gibbs sampling or Variational Bayes can be trivially applied toour model in the Infer.NET framework. ime-Sensitive Bayesian Information Aggregation for Crowdsourcing Systems et al., 2009; Demartini et al., 2012; Liu, Peng, & Ihler, 2012; Karger et al., 2011; Li,Zhao, & Fuxman, 2014) . • BCC:

This is the closest benchmark to our method that was described in Section 2.It learns the confusion matrices and the aggregated labels without considering theworker’s completion time as an input feature. It has been used in several crowdsourc-ing contexts including galaxy classiﬁcation (Simpson et al., 2013), image annotation(Kim & Ghahramani, 2012) and disaster response (Ramchurn et al., 2015). • BCCPropensity:

This is equivalent to BCCTime where only the workers’ propensityis learnt. This benchmark is used to assess the contribution of inferring only theworker’s propensity, versus their joint learning with the tasks’ time thresholds, to thequality of the ﬁnal labels. Note that BCCPropensity is easy to obtain from BCCTimeby setting the time thresholds to static observations with σ = 0 . λ = max . value. • CBCC:

An extension of BCC that learns the communities of workers with similarconfusion matrices as described in Section 2. Given a judgment set, CBCC is able tolearn the confusion matrix of each community and each worker, as well as the tasklabel. This method has also been used in a number of crowdsourcing applicationsincluding web search evaluation and sentiment analysis (Venanzi et al., 2014). In ourexperiments, we ran CBCC with the number of worker types set to two communitiesin order to infer the two groups of more reliable workers and less reliable workers –similar results were observed for higher number of communities. • Majority Voting:

This is a simple yet very popular algorithm that estimates theaggregated label as the one that receives the most votes (Littlestone & Warmuth,1989; Tran-Thanh et al., 2013). It assigns a point mass to the label with the highestconsensus among a set of judgments. Thus, the algorithm does not represent itsuncertainty around a classiﬁcation and it considers all judgments as coming fromreliable workers. • Vote Distribution:

This method estimates the true label based on the empirical prob-abilities of each class observed in the judgment set (Simpson et al., 2015). Speciﬁcally,it assigns the probability of a label as the fraction of judgments corresponding to thatlabel. • Random:

This is a baseline method that assigns random class labels to all the tasks,i.e., it assigns uniform probabilities to all the labels.Note that the alternative variant of BCCTime that captures only the time spent is redun-dant. In fact, when the workers’ propensity is not modelled together with the time spent,the workers’ accuracy is only captured by their confusion matrices. This means that themodel is equivalent to BCC, which is already included in the benchmarks. All these bench-marks were also implemented in Infer.NET and trained using the EP algorithm. In our

11. In particular, we refer to

One coin as the unconstrained version of

ZenCrowd (Demartini et al., 2012)without the two unicity and

SameAs constraints deﬁned in the original method. This suggests that thisversion is more suitable for a fair comparison with the other methods. enanzi, Guiver, Kohli & Jennings experiments, we set the hyperparameters of BCCTime to reproduce the typical situation inwhich the task requester has no prior knowledge of the true labels and the labelling prob-abilities of the workers, and only a basic prior knowledge about the accuracy of workersrepresenting that, a priori, they are assumed to be better than random annotators (Kim& Ghahramani, 2012). Therefore, the workers’ confusion matrices are initialised with aslightly higher value on the diagonal (0.6) and lower values on the rest of the matrix. Then,the Dirichlet priors for p and s are set uninformatively with uniform counts . The priorsof the confusion matrices were initialised with a higher diagonal value (0.7) meaning that apriori the workers are assumed to be better than random. The Gaussian priors for the tasks’time durations are set with means σ = 10 and λ = 50 and precisions γ = δ = 10 − ,meaning that a priori each entity linking task is expected to be completed within 10 and 50seconds. Furthermore, we initialise the Beta prior of ψ k as a function of the number of taskswith α = 0 . N and β = 0 . N to represent the fact that a priori the worker is consideredas a reliable if she makes valid labelling attempts for 70% of the tasks. Importantly, giventhe shape distribution of the worker’s time completion data observed in the datasets (seeFigure 2), we apply a logarithmic transformation to τ ( k ) i in order to obtain a more uniformdistribution of workers’ completion time in the training data. Finally, the priors of all thebenchmarks were set equivalently to BCCTime. We evaluate the classiﬁcation accuracy of the tested methods as measured by the AreaUnder the ROC Curve (AUC) for ZC-US and ZC-IN and the average recall for WS-AMT.In particular, the former is a standard accuracy metric to evaluate the performance ofbinary classiﬁers over a range of discriminant thresholds applied to their predictive classprobabilities (Hanley & McNeil, 1982), which is well suited for the two ZenCrowd binarydatasets. The latter is the recall averaged over the class categories (?), which is the mainmetric used to score the probabilistic methods that competed in the 2013 CrowdFlowershared task challenge on a dataset equivalent to WS-AMT (see Section 3.1).

Table 3 reports the AUC of the seven algorithms on the ZenCrowd datasets. Speciﬁcally, itshows that BCCTime and BCCPropensity have the highest accuracy in both the datasets:Their AUC is 11% higher in ZC-IN and 8% higher in ZC-US, respectively, compared to theother methods. Among the two, BCCTime is the best method with an improvement of 13%in ZC-IN and 1% in ZC-US. Similarly, Table 4 reports the average recall of the methods inWS-AMT showing that BCCTime has the highest average recall, which is 2% higher thanthe second best benchmark (Vote distribution) and 4% higher than BCCPropensity. Thismeans that the inference of the time thresholds, which already provides valuable informationabout the tasks extracted from the judgments, also adds an extra quality improvement toaggregated labels in addition to the modelling of the workers’ propensities. This is animportant observation because it proves that the information of workers’ completion time

12. It should be noted, however, that in cases where a diﬀerent type of knowledge is available about theworkers, this information can be plugged into our method by selecting the appropriate prior distributions. ime-Sensitive Bayesian Information Aggregation for Crowdsourcing Systems Table 3: The

AUC of the tested meth-ods measured on the ZenCrowd datasets.The highest AUC in each dataset is high-lighted in bold.Dataset: ZC-US ZC-INMajority vote 0.3820 0.3862Vote distribution 0.2101 0.3080One coin 0.7204 0.6263Random 0.5000 0.5000BCC 0.6418 0.5407CBCC 0.6730 0.5544BCCPropensity 0.7740 0.6177BCCTime

Table 4: The average recall of the testedmethods measured on WS-AMT. Thehighest average recall is highlighted inbold. Dataset: WS-AMTMajority vote 0.727Vote distribution 0.728One coin N/ARandom 0.183BCC 0.705CBCC 0.711BCCPropensity 0.703BCCTime can be eﬀectively for data aggregation. Altogether, this information allows the model tocorrectly ﬁlter unreliable judgments and consequently provide more accurate classiﬁcations.Figure 6 shows the ROC curve of the methods for the ZenCrowd (binary) datasets,namely the plot of the false positive rate and the true positive rate obtained for diﬀerentdiscriminant thresholds. The graph shows that the true positive rate of BCCTime is gener-ally higher than that of the benchmarks at the same false positive rate. In detail, Majorityvote, and Vote distribution perform worse than Random in these datasets as these methodsare clearly penalised by the presence of less reliable workers as they treat all the workersas equally reliable. Interestingly, One coin performs better than BCC and CBCC meaningthat the confusion matrix is better approximated by a single (one coin) parameter for thesetwo datasets. Also, looking at the percentages of the workers’ propensities inferred by BCC-Time reported in Table 5, we found that 93 .

2% of the workers in ZC-US, 60% of the workersin ZC-IN and 97 .

3% of the workers in WS-AMT have a propensity greater than 0 .

5. Thismeans that, in ZC-US and WS-AMT, only a few workers were identiﬁed as suspected spam-mers while the majority of them were estimated as more reliable with diﬀerent propensityvalues. In ZC-IN, the percentage of suspected spammers is higher and this is also reﬂectedin the lower accuracy of the judgments with respect to the gold standard labels.Figure 7 shows the mean value of the inferred upper-bound time threshold λ i (bluecross points) and the workers’ maximum completion time (green asterisked points) for eachtask of the three datasets. Looking at the raw data in the ZenCrowd datasets, the averageTable 5: The propensity of workers learnt from BCCTime in each dataset.Dataset: % high propensity % low propensityworkers ( p ( ψ k ) > .

5) workers ( p ( ψ k ) ≤ . enanzi, Guiver, Kohli & Jennings False positive rate T r ue po s i t i v e r a t e v ote BCC

One coin

CBCCVote d istribution BCCPropensity BCCTimeRandom

False positive rate0 0.2 0.4 0.6 0.8 1 T r ue po s i t i v e r a t e (a) ZC - US (b) ZC - IN Figure 6: The ROC curve of the aggregation methods for ZC-US (a) and ZC-IN (b).maximum time spent by the US workers is higher (approx. 1.7 minutes) than that of theIndian workers (approx. 1 minute). It can also be seen that in both datasets there is asigniﬁcant portion of outliers that reach up to 50 minutes. However, as discussed in Section3, we know that many of these entity linking tasks are fairly simple – some of them caneasily be solved through visual inspection of the candidate URI. This does not imply thata normal worker who completes the task in a single session (i.e., no interrupts) shouldtake such a long time to issue her judgment. Interestingly, BCCTime eﬃciently removesthese outliers and recovers more realistic estimates of the maximum duration of an entitylinking task. In fact, its estimated upper-bound time thresholds lie within a smaller timeband, i.e., around 10 seconds in ZC-US and 40 seconds in ZC-IN. Similar results are alsoobserved in WS-AMT where the average observed maximum time is signiﬁcantly higherthan the average inferred maximum time, thus suggesting that the BCCTime estimatesare also more realistic in this dataset. In addition, Figure 7 shows the same plot for theaverage duration as estimated by BCCTime (i.e, ( E [ λ i ] − E [ α i ]) / ∀ i ) and the averageworker’s completion time for each task. The graphs show that the BCCTime estimates aresimilar between the micro-tasks of the three datasets, i.e., between 3 and 5, while the sameestimates obtained from the worker’s completion time data are much higher: 53 secondsfor ZC-US, 45 seconds for ZC-IN and 80 seconds in WS-AMT. Again, this is due to thepresence of outliers in the original data that signiﬁcantly bias the empirical average timestowards high values. Moreover, measuring the variability in the two sets of estimates, theBCCTime estimates have a much smaller standard deviation that is up to 100% lower thanthat of the empirical averages. This means that our estimates are more informative whencompared to the normal average times obtained from the raw workers’ completion timedata.To evaluate the performance of the methods against data sparsity, Figure 8 shows theaccuracy measured over sub-samples of judgments in each dataset. In more detail, one coin ime-Sensitive Bayesian Information Aggregation for Crowdsourcing Systems Task index (a) ZC - US T i m e s pen t ( s e c ) Task index (b) ZC - US T i m e s pen t ( s e c ) Task index (c) ZC - IN T i m e s pen t ( s e c ) Task index (d) ZC - IN T i m e s pen t ( s e c ) Task index (e) WS - AMT T i m e s pen t ( s e c ) Task index (f) WS - AMT T i m e s pen t ( s e c ) Figure 7: The plot of the inferred (+) and observed (*) maximum time spent on the tasksin ZC-US (a), ZC-IN (c) and WS-AMT (e), and the average time spent on the tasks inZC-US (b), ZC-IN (d) and WS-AMT (f). enanzi, Guiver, Kohli & Jennings −1 Num. of judgments (a) ZC − US A UC −3 −2 −1 Num. of judgments (b) ZC − IN A UC (c) WS − AMT A v e r age r e c a ll Majority vote

Vote distribution

BCC

One coin

CBCC

BCCPropensity

BCCTime

Random

Figure 8: The AUC in ZC-US (a) and ZC-IN (b) and the average recall in WS-AMT (c) ofthe methods trained over increasingly large sub-sets of judgments.is more accurate over sparse judgments in ZC-IN and ZC-US, while in WS-AMT there is noclear winner since all the methods except Random have a similar average recall when trainedon sparse judgments. This shows that BCCTime in the current form does not necessarilyoutperform the other methods with sparse data. This can be explained by the fact thatthe extra latent variables (i.e., workers’ propensity and time thresholds) used to improvethe quality of the ﬁnal labels also require a larger set of judgments to be accurately learnt.However, to address this issue, it is possible to draw from community-based models (e.g.,CBCC) to design a hierarchical extension for BCCTime over, for example, the workers’confusion matrices and so improve its robustness on sparse data. Here, for simplicity, BC-CTime is presented based on simpler instance of Bayesian classiﬁer combination framework(i.e., the BCC model), and its community-based version is considered as a trivial extension.

6. Related Work

Here we review the rest of previous work relating to aggregation models and time analysisin crowdsourcing contexts extending the background of the methods already considered in ime-Sensitive Bayesian Information Aggregation for Crowdsourcing Systems our experimental evaluation. In recent years, a large body of literature has focussed on thedevelopment of smart data aggregation methods to aid requesters in combining judgmentsfrom multiple workers. In general, existing methods vary by assumptions and complexityin modelling the diﬀerent aspects of labelling noise. The interested reader may refer to thesurvey by Sheshadri and Lease (2013), as well as to the summary in Table 6 that lists themost popular methods and their comparison with our approach.In particular, some of these methods are able to handle both binary classiﬁcation prob-lems, i.e., when workers have to vote on objects between two possible classes, and multi-classclassiﬁcation problems, i.e., when workers have to vote on objects between more than twoclasses. Among these, many approaches use the one coin model introduced in our bench-marks. In more detail, this model represents the worker’s reliability with a single parameterdeﬁned within the range of [0 ,

1] (0 = unreliable worker, 1 = reliable worker) (Karger et al.,2011; Liu et al., 2012; Demartini et al., 2012; Li et al., 2014; ?). Speciﬁcally, (Karger et al.,2011) combines this model with a budget–limited task allocation framework and providesstrong theoretical guarantees on the asymptotical optimality of the inference of the workers’reliability and the worker-task matching. (Liu et al., 2012) uses a more general variationalinference model that reduces to Karger et al.’s method, as well as other algorithms un-der special conditions. Other methods use a two coin model that represents the bias of aworker towards the positive labelling class (speciﬁcity) and towards the negative class (sen-sitivity) (Raykar, Yu, Zhao, Valadez, Florin, Bogoni, & Moy, 2010; Rodrigues, Pereira, &Ribeiro, 2014; Bragg & Weld, 2013). Then, these quantities may be inferred using logisticregression as in (Raykar et al., 2010) or maximum–a–posteriori approaches as in (Bragg &Weld, 2013). Alternatively, (Rodrigues et al., 2014) uses the two coin model embedded ina Gaussian process classiﬁcation framework to compute the predictive probabilities of theaggregated labels and the workers’ reliability using EP. Along the same lines, other mod-els reason about the diﬃculty of a task that aﬀects the quality of a judgment to improvethe reliability of aggregated labels (Whitehill et al., 2009; Bachrach et al., 2012; Kajino& Kashima, 2012). In this area, (Whitehill et al., 2009) use a logistic regression modelto incorporate the task’s diﬃculty, together with the expertise of the worker for labellingimages. In contrast, (Bachrach et al., 2012) use the diﬀerence between these two quantitiesto quantify the advantage that the worker may have in classifying the object within a jointdiﬃculty-ability-response model. In a similar setting, (Kajino & Kashima, 2012) exploit aconvex problem formulation of this model to improve the eﬃciency of inferring these quan-tities through a numerical optimisation method. Additional factors, such as the worker’smotivation or propensity for a particular task, are taken into account in more sophisticatedmodels introduced by (Welinder et al., 2010; Yan, Rosales, Fung, Schmidt, Valadez, Bogoni,Moy, & Dy, 2010; Bi, Wang, Kwok, & Tu, 2014). More recently, (?) devised a method thatleverage the fact that the error rates of the workers are directly aﬀected by the access paththey follow, where the access path represents several contextual features of the task (e.g.,task design, information sources and task composition). However, unlike our work, none ofthese methods learn the confusion matrix of each worker. As a result, they do not representreliability considering the accuracy and the potential biases of a worker with a single datastructure.Alternative models that do learn the confusion matrices of the workers have been pre-sented, among others, by (Dawid & Skene, 1979; Zhou, Basu, Mao, & Platt, 2012; Kim & enanzi, Guiver, Kohli & Jennings Ghahramani, 2012; Venanzi et al., 2014). In particular, (Dawid & Skene, 1979) introducedthe ﬁrst confusion matrix-based model in which the confusion matrices are inferred usingexpectation-maximisation in an unsupervised manner. Then, (Zhou et al., 2012) extendedthis work to include a task–speciﬁc latent matrix representing the confusability of a task asperceived by the workers. However, neither of these methods consider the uncertainty overthe worker’s reliability and the other parameters of their models. For example, when onlyone label is obtained from a worker, these methods may infer that the worker is perfectlyreliable or totally incompetent when, in reality, the worker is neither. To overcome thislimitation, other methods such as BCC and CBCC capture the uncertainty in the worker’sexpertise and the true labels using a Bayesian learning framework. These two methods wereextensively discussed earlier (see Sections 2 and 5) and are included as benchmarks in ourexperiments. Similarly to CBCC, other methods leverage groups of workers with equivalentreliability to improve the quality of the aggregated labels with limited data (Li et al., 2014;Bi et al., 2014; Kajino & Kashima, 2012; Yan et al., 2010). However, as already noted,all these methods do not use any extra information other than the workers’ judgments tolearn their probabilistic models. As a result, unlike our approach, they cannot take fulladvantage of the time information provided by the crowdsourcing platform to improve thequality of their inference results.Now we turn to the problem of time analysis in crowd generated content. recently intro-duced a metric for measuring the eﬀort required to complete a crowdsourced task based onthe area under the error-time curve (ETA). As such, this metric supports the idea of con-sidering time as an important factor a crowdsourcing eﬀort. In this regard, a closely relatedwork on the analysis of the ZenCrowd datasets (see Section 3) was presented by (Difallah,Demartini, & Cudr´e-Mauroux, 2012). Their work showed that workers who complete theirtasks too fast or too slow are typically less accurate than the others. These ﬁndings werealso conﬁrmed in our work. However, in addition, we extended their analysis by showingthe judgment’s quality is correlated to the time spent by the workers in diﬀerent ways forspeciﬁc task instances. This is the intuition that our method exploits to eﬃciently combinethe workers’ completion time features in the data aggregation process. Furthermore, earlierwork introducing a method that predicts the duration of the task based on a number ofavailable features (including the task’s price, the creation time and the number of assign-ments) using a survival analysis model was presented by (Wang et al., 2011). However, theirmethod does not deal with aggregating labels, nor learning the accuracy of the workers, aswe do in our approach.

7. Conclusions

We presented and evaluated BCCTime, a new time–sensitive aggregation method that si-multaneously merges crowd labels and estimates the duration of individual task instancesusing principled Bayesian inference. The key innovation of our method is to leverage an ex-tended set of features comprising the workers’ completion time and the judgment set. Whenappropriately correlated together, these features become important indicators of the relia-bility of a worker that, in turn, allow us to estimate the ﬁnal labels, the tasks’ duration andthe workers’ reliability more accurately. Speciﬁcally, we introduced a new representationof the accuracy proﬁle of a worker consisting of both the worker’s confusion matrix, which ime-Sensitive Bayesian Information Aggregation for Crowdsourcing Systems T a b l e : C o m p a r i s o n o f e x i s t i n g m e t h o d s f o r c o m pu t i n gagg r e ga t e d l a b e l s f r o m c r o w d s o u r ce d j ud g m e n t s . b i n a r y m u l t i w o r k e r w o r k e r t a s k t a s k w o r k e r c l a ss c l a ss a cc u r a c y c o n f u s i o n m a t r i x d i ﬃ c u l t y du r a t i o n t y p e o r p r o p e n s i t y M a j o r i t yv o t i n g (cid:88)(cid:88) - ---- D S - D a w i d & S k e n e ( ) (cid:88)(cid:88)(cid:88)(cid:88) --- G L A D - W h i t e h ill e t a l. ( ) (cid:88) - (cid:88) - (cid:88) -- R Y - R a yk a r e t a l. ( ) (cid:88) - (cid:88) ---- C U B A M - W e li nd e r e t a l. ( ) (cid:88) - (cid:88) --- (cid:88) YU - Y a n e t a l. ( ) (cid:88)(cid:88)(cid:88) ---- L D A - W a n g e t a l. ( ) ----- (cid:88) - K J - K a j i n o e t a l. ( ) (cid:88) - (cid:88) --- (cid:88) Z e n C r o w d - D e m a r t i n i e t a l. ( ) (cid:88) - (cid:88) ---- D A R E - B a c h r a c h e t a l. ( ) (cid:88)(cid:88)(cid:88) - (cid:88) -- M i n M a x E n t r o p y - Z h o u e t a l. ( ) (cid:88)(cid:88)(cid:88)(cid:88) --- B CC - K i m & G h a h r a m a n i ( ) (cid:88)(cid:88)(cid:88)(cid:88) --- M SS - Q i e t a l. ( ) (cid:88)(cid:88)(cid:88) --- (cid:88) M L N B - B r agg e t a l. ( ) (cid:88)(cid:88)(cid:88) ---- B M - B i e t a l. ( ) (cid:88) - (cid:88) - (cid:88) -- G P - R o d r i g u eze t a l. ( ) (cid:88) - (cid:88) ---- L U - L i u e t a l. ( ) (cid:88) - (cid:88) ---- W M - L i e t a l. ( ) (cid:88)(cid:88)(cid:88) --- (cid:88) C B CC - V e n a n z i e t a l. ( ) (cid:88)(cid:88)(cid:88)(cid:88) -- (cid:88) A P M - N u s h i e t a l. ( ) (cid:88)(cid:88)(cid:88) ---- B CCT i m e - P r opo s e d m e t hod (cid:88)(cid:88)(cid:88)(cid:88)(cid:88)(cid:88)(cid:88) enanzi, Guiver, Kohli & Jennings accounts for the worker’s labelling probabilities in each class, and the worker’s propensityto valid labelling, which represents the worker’s intention to meaningfully participate in thelabelling process. Furthermore, we used latent variables to represent the duration of eachtask using pairs of latent thresholds to capture the time interval in which the best judg-ments for that task are likely to be submitted by honest workers. In this way, the model candeal with the diﬀerences in the time length of each task instance relating to the diﬀerenttype of correlation between quality of the received judgments and the time spent by theworkers. In fact, such task–speciﬁc correlations have been observed in our experimentalanalysis of crowdsourced datasets in which various task instances showed diﬀerent types ofquality–time trends. Thus, the main idea behind BCCTime is to model these trends in theaggregation of crowd judgments to make more reliable inference about all the quantities ofinterest. Through an extensive experimental validation on real-world datasets, we showedthat BCCTime produces signiﬁcantly more accurate classiﬁcations and its estimates of thetasks’ duration are considerably more informative than common heuristics obtained fromthe raw workers’ completion time data.Against this background, there are several implications of this work concerning variousaspects of reliable crowdsourcing systems. Firstly, the process of designing the task cantake exploit the unbiased task’s duration estimated by BCCTime. As we have shown, thisinformation is a valid proxy to assess the diﬃculty of a task and therefore supports a numberof decision–making problems such as fair pricing for more diﬃcult tasks and deﬁning fairbonuses to honest workers. Secondly, the worker’s propensity to valid labelling uncoversan additional dimension of the workers’ reliability that enables us to score their attitudetowards correctly approaching a given task. This information is useful to select diﬀerent taskdesigns or more engaging tasks for workers who systematically approach a task incorrectly.Thirdly, our method uses only features that are readily available in common crowdsourcingsystems, which allows for a faster take up of this technology in real applications.Building on these advances, there are several aspects of our current model that indicatepromising directions for further improvements. For example, we can consider that time–dependencies in the accuracy proﬁle of a worker capture the fact that workers typicallyimprove their skills over time by performing a sequence of tasks. By so doing, it is possibleto take advantage of these temporal dynamics to potentially improve the quality of the ﬁnallabels. In addition, some crowdsourcing settings involve continuous-valued judgments thatare currently not supported by our method. To deal with these cases, a number of non–trivial extensions to our generative model and, in turn, a new treatment of its probabilisticinference are required. ime-Sensitive Bayesian Information Aggregation for Crowdsourcing Systems References

Alonso, O., Rose, D. E., & Stewart, B. (2008). Crowdsourcing for relevance evaluation. In

ACM SigIR Forum , Vol. 42, pp. 9–15, New York, NY, USA. ACM.Bachrach, Y., Graepel, T., Minka, T., & Guiver, J. (2012). How to grade a test withoutknowing the answers—a Bayesian graphical model for adaptive crowdsourcing andaptitude testing. In

Proceedings of the 29th International Conference on MachineLearning (ICML) , pp. 1183–1190.Bernstein, M., Little, G., Miller, R., Hartmann, B., Ackerman, M., Karger, D., Crowell, D.,& Panovich, K. (2010). Soylent: a word processor with a crowd inside. In

Proceedingsof the 23nd annual ACM symposium on User interface software and technology , pp.313–322. ACM.Bi, W., Wang, L., Kwok, J. T., & Tu, Z. (2014). Learning to predict from crowdsourceddata. In

Proceedings of the 30th International Conference on Uncertainty in ArtiﬁcialIntelligence (UAI) .Bishop, C. (2006).

Pattern recognition and machine learning , Vol. 4. Springer New York.Bragg, J., & Weld, D. S. (2013). Crowdsourcing multi-label classiﬁcation for taxonomycreation. In

First AAAI Conference on Human Computation and Crowdsourcing .Dawid, A., & Skene, A. (1979). Maximum likelihood estimation of observer error-rates usingthe em algorithm.. pp. 20–28. JSTOR.Demartini, G., Difallah, D. E., & Cudr´e-Mauroux, P. (2012). Zencrowd: Leveraging prob-abilistic reasoning and crowdsourcing techniques for large-scale entity linking. In

Proceedings of the 21st international conference on World Wide Web (WWW) , pp.469–478.Difallah, D. E., Demartini, G., & Cudr´e-Mauroux, P. (2012). Mechanical cheat: Spammingschemes and adversarial techniques on crowdsourcing platforms.. In

CrowdSearch , pp.26–30.Faradani, S., Hartmann, B., & Ipeirotis, P. G. (2011). What’s the right price? pricing tasksfor ﬁnishing on time.. In

Human Computation , Vol. WS-11-11 of

AAAI Workshops ,pp. 26–31. AAAI.Hanley, J. A., & McNeil, B. J. (1982). The meaning and use of the area under a receiveroperating characteristic (roc) curve..

Radiology , (1), 29–36.Herbrich, R., Minka, T., & Graepel, T. (2007). Trueskill(tm): A Bayesian skill ratingsystem. In Advances in Neural Information Processing Systems (NIPS) , pp. 569–576.MIT Press.Kajino, H., & Kashima, H. (2012). Convex formulations of learning from crowds.

Transac-tions of the Japanese Society for Artiﬁcial Intelligence , , 133–142.Kamar, E., Hacker, S., & Horvitz, E. (2012). Combining human and machine intelligencein large-scale crowdsourcing. In Proceedings of the 11th International Conference onAutonomous Agents and Multiagent Systems (AAMAS) , pp. 467–474. enanzi, Guiver, Kohli & Jennings Kamar, E., Kapoor, A., & Horvitz, E. (2015). Identifying and accounting for task-dependentbias in crowdsourcing. In

Third AAAI Conference on Human Computation andCrowdsourcing .Karger, D., Oh, S., & Shah, D. (2011). Iterative learning for reliable crowdsourcing systems..In

Advances in Neural Information Processing Systems (NIPS) , pp. 1953–1961. MITPress.Kazai, G. (2011). In search of quality in crowdsourcing for search engine evaluation. In

Advances in information retrieval , pp. 165–176. Springer.Kim, H., & Ghahramani, Z. (2012). Bayesian classiﬁer combination. In

InternationalConference on Artiﬁcial Intelligence and Statistics , pp. 619–627.Li, H., Zhao, B., & Fuxman, A. (2014). The wisdom of minority: discovering and targetingthe right group of workers for crowdsourcing. In

Proceedings of the 23rd InternationalConference on World Wide Web (WWW) , pp. 165–176.Littlestone, N., & Warmuth, M. K. (1989). The weighted majority algorithm. In , pp. 256–261. IEEE.Liu, Q., Peng, J., & Ihler, A. (2012). Variational inference for crowdsourcing. In

Advancesin Neural Information Processing Systems (NIPS) , pp. 692–700. MIT Press.Minka, T. (2001). Expectation propagation for approximate Bayesian inference. In

Pro-ceedings of the 17th Conference on Uncertainty in Artiﬁcial Intelligence (UAI) , pp.362–369.Minka, T., & Winn, J. (2008). Gates. In

Advances in Neural Information Processing Systems(NIPS) , pp. 1073–1080. MIT Press.Minka, T., Winn, J., Guiver, J., & Knowles, D. (2014). Infer.NET 2.6. Microsoft ResearchCambridge.Minka, T. P. (2001).

A family of algorithms for approximate Bayesian inference . Ph.D.thesis, Massachusetts Institute of Technology.Ramchurn, S. D., Huynh, T. D., Ikuno, Y., Flann, J., Wu, F., Moreau, L., Jennings, N. R.,Fischer, J. E., Jiang, W., Rodden, T., et al. (2015). Hac-er: a disaster response systembased on human-agent collectives. In , pp. 533–541.Raykar, V., Yu, S., Zhao, L., Valadez, G., Florin, C., Bogoni, L., & Moy, L. (2010). Learningfrom crowds.

The Journal of Machine Learning Research , , 1297–1322.Rodrigues, F., Pereira, F., & Ribeiro, B. (2014). Gaussian process classiﬁcation and activelearning with multiple annotators. In Proceedings of the 31st International Conferenceon Machine Learning (ICML) , pp. 433–441.Sheng, V., Provost, F., & Ipeirotis, P. (2008). Get another label? Improving data quality anddata mining using multiple, noisy labelers. In

Proceedings of the 14th InternationalConference on Knowledge Discovery and Data Mining (SIGKDD) , pp. 614–622. ACM.Sheshadri, A., & Lease, M. (2013). Square: A benchmark for research on computing crowdconsensus. In

Proceedings of the 1st AAAI Conference on Human Computation andCrowdsourcing (HCOMP) , pp. 156–164. ime-Sensitive Bayesian Information Aggregation for Crowdsourcing Systems Simpson, E., Roberts, S., Psorakis, I., & Smith, A. (2013). Dynamic bayesian combinationof multiple imperfect classiﬁers. In

Decision Making and Imperfection , pp. 1–35.Springer.Simpson, E., Venanzi, M., Reece, S., Kohli, P., Guiver, J., Roberts, S., & Jennings, N. R.(2015). Language understanding in the wild: Combining crowdsourcing and machinelearning. In , pp. 992–1002.ACM.Simpson, E. (2014).

Combined Decision Making with Multiple Agents . Ph.D. thesis, Uni-versity of Oxford.Tran-Thanh, L., Venanzi, M., Rogers, A., & Jennings, N. R. (2013). Eﬃcient Budget Al-location with Accuracy Guarantees for Crowdsourcing Classiﬁcation Tasks. In

The12th International Conference on Autonomous Agents and Multi-Agent Systems (AA-MAS) , pp. 901–908.Venanzi, M., Guiver, J., Kazai, G., Kohli, P., & Shokouhi, M. (2014). Community-basedbayesian aggregation models for crowdsourcing. In , pp. 155–164. ACM.Wang, J., Faridani, S., & Ipeirotis, P. (2011). Estimating the completion time of crowd-sourced tasks using survival analysis models. In

Crowdsourcing for Search and DataMining (CSDM) , Vol. 31, pp. 31–34.Welinder, P., Branson, S., Belongie, S., & Perona, P. (2010). The multidimensional wisdomof crowds.. In

Advances in Neural Information Processing Systems (NIPS) , Vol. 10,pp. 2424–2432. MIT Press.Whitehill, J., Ruvolo, P., Wu, T., Bergsma, J., & Movellan, J. R. (2009). Whose voteshould count more: Optimal integration of labels from labelers of unknown expertise.In

Advances in Neural Information Processing Systems (NIPS) , Vol. 22, pp. 2035–2043. MIT Press.Yan, Y., Rosales, R., Fung, G., Schmidt, M., Valadez, G. H., Bogoni, L., Moy, L., & Dy,J. (2010). Modeling annotator expertise: Learning when everybody knows a bit ofsomething. In

International Conference on Artiﬁcial Intelligence and Statistics , pp.932–939.Zhou, D., Basu, S., Mao, Y., & Platt, J. (2012). Learning from the wisdom of crowds byminimax entropy. In

Advances in Neural Information Processing Systems (NIPS) , pp.2195–2203. MIT Press.Zilli, D., Parson, O., Merrett, G. V., & Rogers, A. (2014). A hidden markov model-basedacoustic cicada detector for crowdsourced smartphone biodiversity monitoring.

Jour-nal of Artiﬁcial Intelligence Research , 805–827., 805–827.