Fundamental Limits of Budget-Fidelity Trade-off in Label Crowdsourcing
FFundamental Limits of Budget-Fidelity Trade-off inLabel Crowdsourcing
Farshad Lahouti
Electrical Engineering Department, California Institute of Technology [email protected]
Babak Hassibi
Electrical Engineering Department, California Institute of Technology [email protected]
Abstract
Digital crowdsourcing (CS) is a modern approach to perform certain large projectsusing small contributions of a large crowd. In CS, a taskmaster typically breaksdown the project into small batches of tasks and assigns them to so-called workerswith imperfect skill levels. The crowdsourcer then collects and analyzes the resultsfor inference and serving the purpose of the project. In this work, the CS problem,as a human-in-the-loop computation problem, is modeled and analyzed in aninformation theoretic rate-distortion framework. The purpose is to identify theultimate fidelity that one can achieve by any form of query from the crowd and anydecoding (inference) algorithm with a given budget. The results are establishedby a joint source channel (de)coding scheme, which represent the query schemeand inference, over parallel noisy channels, which model workers with imperfectskill levels. We also present and analyze a query scheme dubbed k -ary incidencecoding and study optimized query pricing in this setting. Digital crowdsourcing (CS) is a modern approach to perform certain large projects using smallcontributions of a large crowd. Crowdsourcing is usually used when the tasks involved may better suitehumans rather than machines or in situations where they require some form of human participation.As such crowdsourcing is categorized as a form of human-based computation or human-in-the-loopcomputation system. This article examines the fundamental performance limits of crowdsourcingand sheds light on the design of optimized crowdsourcing systems.Crowdsourcing is used in many machine learning projects for labeling of large sets of unlabeleddata and Amazon Mechanical Turk (AMT) serves as a popular platform to this end. Crowdsourcingis also useful in very subjective matters such as rating of different goods and services, as is nowwidely popular in different online rating platforms and applications such as Yelp. Another example isif we wish to classify a large number of images as suitable or unsuitable for children. In so-calledcitizen research projects, a large number of –often human deployed or operated– sensors contributeto accomplish a wide array of crowdsensing objectives, e.g., [2] and [3].In crowdsourcing, a taskmaster typically breaks down the project into small batches of tasks, recruitsso-called workers and assigns them the tasks accordingly. The crowdsourcer then collects andanalyzes the results collectively to address the purpose of the project. The worker’s pay is oftenlow or non-existent. In cases such as labeling, the work is typically tedious and hence the workersusually handle only a small batch of work in a given project. The workers are often non-specialists a r X i v : . [ c s . L G ] A ug a) (b)(c) Figure 1: (a): An information theoretic crowdsourcing model; (b) IC for N = 2 valid responses, (c)invalid responsesand as such there may be errors in their completion of assigned tasks. Due to the nature of taskassignment by the taskmaster, the workers and the skill level are typically unknown a priori. In caseof rating systems, such as Yelp, there is no pay for regular reviewers but only non-monetary personalincentives; however, there are illegal reviewers who are paid to write fake reviews. In many casesof crowdsourcing, the ground truth is not known at all. The transitory or fleeting characteristic ofworkers, their unknown and imperfect skill levels and their possible motivations for spamming makesthe design of crowdsourcing projects and the analysis of the obtained results particularly challenging.Researchers have studied the optimized design of crowdsourcing systems within the setting describedfor enhanced reliability. Most research reported so far is devoted to optimized design and analysisof aggregation and inference schemes and possibly using redundant task assignment. In AMT-typecrowdsourcing, two popular approaches for aggregation are namely majority voting and the Dawidand Skene (DS) algorithm [5]. The former sets the estimate based on what the majority of the crowdagrees on, and is provably suboptimal [8]. Majority voting is susceptible to error, when there arespammers in the crowd, as it weighs the opinion of everybody in the crowd the same. The DSalgorithm, within a probabilistic framework, aims at joint estimation of the workers’ skill levelsand a reliable label based on the data collected from the crowd. The scheme runs as an expectationmaximization (EM) formulation in an iterative manner. More recent research with similar EMformulation and a variety of probabilistic models are reported in [9, 12, 10]. In [8], a label inferencealgorithm for CS is presented that runs iteratively over a bipartite graph. In [1], the CS problemis posed as a so-called bandit survey problem for which the trade offs of cost and reliability in thecontext of worker selection is studied. Schemes for identifying workers with low skill levels is studiedin, e.g., [6]. In [13], an analysis of the DS algorithm is presented and an improved inference algorithmis presented. Another class of works on crowdsourcing for clustering relies on convex optimizationformulations of inferring clusters within probabilistic graphical models, e.g., [11] and the referencestherein.In this work, a crowdsourcing problem is modeled and analyzed in an information theoretic setting.The purpose is to seek ultimate performance bounds, in terms of the CS budget (or equivalently thenumber of queries per item) and a CS fidelity, that one can achieve by any form of query from theworkers and any inference algorithm. Two particular scenarios of interest include the case where theworkers’ skill levels are unknown both to the taskmaster and the crowdsourcer, or the case wherethe skill levels are perfectly estimated during inference by the crowdsourcer. Within the presentedframework, we also investigate a class of query schemes dubbed k -ary incidence coding and analyzeits performance. At the end, we comment on an associated query pricing strategy. In this Section, we present a communication system model for crowdsourcing. The model, as depictedin Figure 1a, then enables the analysis of the fundamental performance limits of crowdsourcing.2 .1 Data Set: Source
Consider a dataset X = { X , . . . , X L } composed of L items, e.g., images. In practice, there iscertain function B ( X ) ∈ B ( X ) of the items that is of interest in crowdsourcing and is here consideredas the source. The value of this function is to be determined by the crowd for the given dataset. In thecase of crowdsourced clustering, B ( X i ) = B j ∈ B ( X ) = { B , . . . , B N } indicates the bin or clusterto which the item X i ideally belongs. We have B ( X , . . . , X n ) = B ( X n ) = ( B ( X ) , . . . , B ( X n )) .The number of clusters, |B ( X ) | = N , may or may not be known a priori. The crowd is modeled by a set of parallel noisy channels in which each channel C i , i = 1 , . . . , W, represents the i th worker. The channel input is a query that is designed based on the source. Thechannel output is what a user perceives or responds to a query. The output may or may not be thecorrect answer to the query depending on the skill level of the worker and hence the noisy channel ismeant to model possible errors by the worker.A suitable model for C i is a discrete channel model. The channels may be assumed independent,on the basis that different individuals have different knowledge sets. Related probabilistic modelsrepresenting the possible error in completion of a task by a worker are reviewed in [8]. Formally, achannel (worker) is represented by a probability distribution P ( v | u ) , u ∈ U , v ∈ V , where U is theset of possible responses to a query and V is the set of choices offered to the worker in respondingto a query. For the example of images suitable for children, in general we may consider a shade ofpossible responses to the query, U , including the extremes of totally suitable and totally unsuitable; Aspossible choices offered to the worker to answer the query, V , we may consider two options of suitableand unsuitable. As described below, in this work we consider two channel models representingpossibly erroneous responses of the workers: an M -ary symmetric channel model (MSC) and aspammer-hammer channel model (SHC).An MSC model with parameter (cid:15) , is a symmetric discrete memoryless channel without feedback [4]and with input u ∈ U and output v ∈ V ( |U| = |V| = M ), that is characterized by the followingtransition probability P ( v | u ) = (cid:26) (cid:15)M − v (cid:54) = u − (cid:15) v = u. (1)If we consider a sequence of channel inputs u n = ( u , . . . , u n ) and the corresponding outputsequence v n , we have P ( v n | u n ) = (cid:81) ni =1 P ( v i | u i ) , which holds because of the memoryless and nofeedback assumptions. In case of clustering and MSC, the probability of misclassifying any inputfrom a given cluster in another cluster only depends on the worker and not the corresponding clusters.In the spammer-hammer channel model with the probability of being a hammer of q (SHC( q )), aspammer randomly and uniformly chooses a valid response to a query, and a hammer perfectlyanswers the query [8]. The corresponding discrete memoryless channel model without feedback,with input u ∈ U and output v ∈ V and state C ∈ { S , H } , P ( C = H ) = q is described as follows P ( v | u, C ) = C = H and v (cid:54) = u C = H and v = u |V| C = S (2)where C ∈ { S , H } indicates whether the worker (channel) is a spammer or a hammer. In the case ofour current interest |U| = |V| = M , and we have P ( v n | u n , C n ) = (cid:81) ni =1 P ( v i | u i , C i ) .In the sequel, we consider the following two scenarios: when the workers’ skill levels are unknown(SL-UK) and when it is perfectly known by the crowdsourcer (SL-CS). In both cases, we assume thatthe skill levels are not known at the taskmaster (transmitter).The presented framework can also accommodate other more general scenarios of interest. Forexample, the feedforward link in Figure 1a could be used to model a channel whose state is affectedby the input, e.g., difficulty of questions. These extensions remain for future studies.3 .3 Query Scheme and Inference: Coding In the system model presented in Figure 1a, encoding shows the way the queries are posed. A basicquery is that the worker is asked of the value of B ( X ) . In the example of crowdsourcing for labelingimages that are suitable for children, the query is "This image suits children; true or false?" Thedecoder or the crowdsourcer collects the responses of workers to the queries and attempts to infer theright label (cluster) for each of the images. This is while the collected responses could be in generalincomplete or erroneous.In the case of crowdsourcing for labeling a large set of dog images with their breeds, a query may beformed by showing two pictures at once and inquiring whether they are from the same breed [11].The queries in fact are posed as showing the elements of a binary incidence matrix, A , whose rowsand columns correspond to X . In this case, A ( X , X ) = 1 indicates that the two are members ofthe same cluster (breed) and A ( X , X ) = 0 indicates otherwise. The matrix is symmetric and itsdiagonal is . We refer to this query scheme as Binary Incidence Coding. If we show three picturesat once and ask the user to classify them (put the pictures in similar or distinct bins); it is as if weask about three elements of the same matrix, i.e., A ( X , X ) , A ( X , X ) and A ( X , X ) (TernaryIncidence Coding). In general, if we show k pictures as a single query, it is equivalent to inquiringabout C ( k, (choose out of k elements) entries of the matrix ( k -ary Incidence Coding or k IC). Aswe elaborate below, out of the C ( k, possibilities, a number of the choices remain invalid and thisprovides an error correction capability for k IC.Figures 1b and 1c show the graphical representation of IC, and the choices a worker would have inclustering with this code. The nodes denote the items and the edges indicate whether they are in thesame cluster. In IC, if X and X are in the same cluster as X , then all three of them are in thesame cluster. It is straightforward to see that in IC and for N = 2 , we only have four valid responses(Figure 1b) to a query as opposed to C (3 , = 8 . The first item in Figure 1c is invalid because thereare only two clusters ( N = 2 ) (in case we do not know the number of clusters or N ≥ , then thiswould remain a valid response). In this setting, the encoded signal u can be one of the four validsymbols in the set U ; and similarly what the workers may select v (decoded signal over the channel)is from the set V , where U = V . As such, since in k IC the obviously erroneous answers are removedfrom the choices a worker can make in responding to a query, one expects an improved overall CSperformance, i.e., an error correction capability for k IC. In Section 4, we study the performance ofthis code in greater details. Note that in clustering with k IC ( k ≥ ) described above, the code wouldidentify clusters up to their specific labellings.While we presented k IC as a concrete example, there may be many other forms of query or codingschemes. Formally, the code is composed of encoder and decoder mappings: C : B ( X ) n → U (cid:80) Wi =1 m i , C (cid:48) : V (cid:80) Wi =1 m i → ˆ B ( X ) n , (3)where n is the block size or number of items in each encoding (we assume n | L ), and m i is the numberof uses of channel C i or queries that worker i, ≤ i ≤ W, handles. In many practical cases ofinterest, we have ˆ B ( X ) = B ( X ) and we may have n = L . The rate of the code is R = (cid:80) Wi =1 m i /n queries per item. In this setting, C (cid:48) ( C ( B ( X n ))) = ˆ B ( X n ) .Depending on the availability of feedback from the decoder to the encoder, the code can adapt foroptimized performance. The feedback could provide the encoder with the results of prior queries.We here focus on non-adaptive codes in (3) that are static and remain unchanged over the period ofcrowdsourcing project. We will elaborate on code design in Section 3.Depending on the type of code in use and the objectives of crowdsourcing, one may design differentdecoding schemes. For instance, in the simple case of directly inquiring workers about the function B ( X i ) , with multiple queries for each item, popular approaches are majority voting, and EM styledecoding due to Dawid and Skene [5], where it attempts to jointly estimate the workers’ skilllevels and decode B ( X ) . In the case of clustering with IC, an inference scheme based on convexoptimization is presented in [11].The rate of the code is proportional to the CS budget and we use the rate as a proxy for budgetthroughout this analysis. However, since different types of query have different costs both financially(in crowdsourcing platforms) and from the perspective of time or effort it takes from the user toprocess it, one needs to be careful in comparing the results of different coding schemes. We shallelaborate on this issue for the case of k IC in Appendix E.4 .4 Distortion and the Design Problem
In the framework of Figure 1a, we are interested to design the CS code, i.e., the query and inferenceschemes, such that with a given budget a certain CS fidelity is optimized. We consider the fidelity asan average distortion with respect to the source (dataset). For a distance function d ( B ( x ) , ˆ B ( x )) , forwhich d ( B ( x n ) , ˆ B ( x n )) = n (cid:80) ni =1 d ( B ( x i ) , ˆ B ( x i )) , the average distortion is D ( B ( X ) , ˆ B ( X )) = Ed ( B ( X n ) , ˆ B ( X n ))= (cid:80) X n P ( B ( X n )) P ( ˆ B ( X n ) | B ( X n )) d ( B ( X n ) , ˆ B ( X n )) , (4)where P ( B ( X n )) = P ( B ( X )) n for iid B ( X ) . The design problem is therefore one of CS fidelity-query budget optimization (or distortion-rate, D ( R ) , optimization) and may be expressed as follows D ∗ ( R t ) = min C , C (cid:48) ,R ≤ R t D ( B ( X ) , ˆ B ( X )) (5)where R t is a target rate or query budget. The optimization is with respect to the coding anddecoding schemes, the type of feedback (if applicable), and query assignment and rate allocation.The optimum solution to the above problem is referred to as the distortion-rate function, D ∗ ( R t ) (orCS fidelity-query budget function). A basic distance function, for the case where B ( X ) is discrete,is l ( B ( X ) , ˆ B ( X )) , or the Hamming distance. In this case, the average distortion D ( B ( X ) , ˆ B ( X )) reflects the average probability of error. As such, the D ( R ) optimization problem may be rewrittenas follows D ∗ ( R t ) = min C , C (cid:48) ,R ≤ R t P ( E : ˆ B ( X ) (cid:54) = B ( X )) . (6)In case of crowdsourcing for clustering, this quantifies the performance in terms of the overallprobability of error in clustering. For other crowdsourcing problems, we may consider other distortionfunctions. Equivalently, we may consider minimizing the rate subject to a constrained distortion incrowdsourcing. The R ( D ) problem is expressed as follows R ∗ ( D t ) = min C , C (cid:48) ,D ( B ( X ) , ˆ B ( X )) ≤ D t R = min C , C (cid:48) ,D ( B ( X ) , ˆ B ( X )) ≤ D t W (cid:88) i =1 m i /n (7)where D t is a target distortion or average probability of error. The optimum solution to the aboveproblem is referred to as the rate-distortion function, R ∗ ( D t ) (CS query budget-fidelity function). Incase, the taskmaster does not know the skill level of the workers, different users -disregarding theirskill levels- would receive the same number of queries ( m i = m (cid:48) , ∀ i ); and the code design involvesdesigning the query and inference schemes. In the CS budget-fidelity optimization problem in (5), the code providing the optimized solutionindeed needs to balance two opposing design criteria to meet the target CS fidelity: On one hand thedesign aims at efficiency of the query and making as small number of queries as possible; On theother hand, the code needs to take into account the imperfection of worker responses and incorporatesufficient redundancy. In information theory (coding theory) realm, the former corresponds to sourcecoding (compression) and the latter corresponds to channel coding (error control coding) and codingto serve both purposes is a joint source channel code.In this Section, we first present a brief overview on joint source channel coding and related resultsin information theory. Next, we present the CS budget-fidelity function in two cases of SL-UK andSL-CS described in Section 2.2.
Consider the communication of a random source Z from a finite alphabet Z over a discrete memo-ryless channel. The source is first processed by an encoder C and whose output is communicatedover the channel. The channel output is processed by a decoder C (cid:48) , which reconstructs the source as ˆ Z ∈ ˆ Z and we often have Z = ˆ Z . 5rom a rate-distortion theory perspective, we first consider the case where the channel is error free.The source is iid distributed with probability mass function P ( Z ) and based on Shannon’s sourcecoding theorem is characterized by a rate-distortion function, R ∗ ( D t ) = min C , C (cid:48) : D ( Z, ˆ Z ) ≤ D t I ( Z, ˆ Z ) , (8)where I ( ., . ) indicates mutual information between two random variables. The source coding isdefined by the following two mappings: C : Z n → { , . . . , nR } , C (cid:48) : { , . . . , nR } → ˆ Z n (9)The average distortion is defined in (4) and D t is the target performance. The optimization in sourcecoding with distortion is with respect to the source coding or compression scheme, that is describedprobabilistically as P ( ˆ Z | Z ) in information theory. The proof of the source coding theorem followsin two steps: In the first step, we prove that any rate R ≥ R ∗ ( D t ) is achievable in the sense that thereexists a family (as a function of n ) of codes {C n , C (cid:48) n } for which as n grows to infinity the resultingaverage distortion satisfies the desired constraint. In the second step or the converse, we prove thatany code with rate R < R ∗ ( D t ) results in an average distortion that violates the desired constraint.This establishes the described rate-distortion function as the fundamental limit for lossy compressionof a source with a desired maximum average distortion.From the perspective of Shannon’s channel coding theorem, we consider the source as an iid uniformsource and the channel as a discrete memoryless channel characterized by P ( V | U ) , where U ∈ U isthe channel input and, V ∈ V is the channel output. The channel coding is defined by the followingtwo mappings: C : { , . . . , |Z|} → U n C (cid:48) : V n → { , . . . , |Z|} (10)The theorem establishes the capacity of the channel as C = max C , C (cid:48) I ( Z, ˆ Z ) and states that for arate R , there exists a channel code that provides a reliable communication over the noisy channel ifand only if R ≤ C . Again the proof follows in two steps: First, we establish achievability, i.e., weshow that for any rate R ≤ C , there exists a family of codes (as a function of length n ) for whichthe average probability of error P ( ˆ Z (cid:54) = Z ) goes to zero as n grows to infinity. Next, we prove theconverse, i.e., we show that for any rate R > C , the probability of error is always greater than zeroand grows exponentially fast to / as R goes beyond C . This establishes the described capacity asthe fundamental limit for transmission of an iid uniform source over a discrete memoryless channel.For the problem of our interest, i.e., the transmission of an iid source (not necessarily uniform)over a discrete memoryless channel, the joint source channel coding theorem, aka source-channelseparation theorem, is instrumental. The theorem states that in this setting a code exists that canfacilitate reconstruction of the source with distortion D ( Z, ˆ Z ) ≤ D t if and only if R ∗ ( D t ) < C . Forcompleteness, we reproduce the theorem form [4] below. Theorem 1
Let Z be a finite alphabet iid source which is encoded as a sequence of n input symbols U n of a discrete memoryless channel with capacity C . The output of the channel V n is mapped ontothe reconstruction alphabet ˆ Z n = C (cid:48) ( V n ) . Let D ( Z n , ˆ Z n ) be the average distortion achieved by thisjoint source and channel coding scheme. Then distortion D is achievable if and only if C > R ∗ ( D t ) . The proof follows a similar two step approach described above and assumes large block length( n → ∞ ). The result is important from a communication theoretic perspective as a concatenationof a source code, which removes the redundancy and produces an iid uniform output at a rate R > R ∗ ( D t ) , and a channel code, which communicates this reliably over the noisy channel at a rate R < C , can achieve the same fundamental limit.
We here consider crowdsourcing within the presented framework, and derive basic informationtheoretic bounds. Following Section 2.1, we examine the case where a large dataset X ( L → ∞ ) anda function of interest B ( X ) with associated probability mass function, P ( B ( X )) , are available.We consider the MSC worker pool model described in Section 2.2, where the skill set of workers arefrom a discrete set E = { (cid:15) , (cid:15) , . . . , (cid:15) W (cid:48) } with probability P ( (cid:15) ) , (cid:15) ∈ E . The number of workers ineach skill level class is assumed large. We here study the two scenarios of SL-UK and SL-CS.6t any given instance, a query is posed to a random worker with a random skill level within the set, E .We assume there is no feedback available from the decoder (non-adaptive coding) and the queries donot influence the channel probabilities (no feedforward). Extensions remain for future work.The following theorem identifies the information theoretic minimum number of queries per item toperform at least as good as a target fidelity in case the skill levels are not known (SL-UK). The boundis oblivious to the type of code used and serves as an ultimate performance bound. Theorem 2
In crowdsourcing for a large dataset of N -ary discrete source B ( X ) ∼ P ( B ( X )) withHamming distortion, when a large number of unknown workers with skill levels (cid:15) ∈ E , (cid:15) ∼ P ( (cid:15) ) from an MSC population participate (SL-UK), the minimum number of queries per item to obtain anoverall error probability of at most ˆ (cid:15) , is given by R min = (cid:26) H ( B ( X )) − H N (ˆ (cid:15) )log M − H M ( E ( (cid:15) )) ˆ (cid:15) ≤ min { − p max , − N } otherwise, (11) in which H N ( (cid:15) ) (cid:44) H (1 − (cid:15), (cid:15)/ ( N − , . . . , (cid:15)/ ( N − , and p max = max B ( X ) ∈B ( X ) P ( B ( X )) .For any rate R above this threshold, there exists a query scheme (code) that achieves the desiredfidelity ˆ (cid:15) , and for any rate below the threshold no such scheme exists. The proof is provided in Appendix A. Another interesting scenario is when the crowdsourcer attemptsto estimate the worker skill levels from the data it has collected as part of the inference. In case thisestimation is done perfectly, the next theorem identifies the corresponding fundamental limit on thecrowdsourcing rate. The proof is provided in Appendix B.
Theorem 3
In crowdsourcing for a large dataset of N -ary discrete source B ( X ) ∼ P ( B ( X )) withHamming distortion, when a large number of workers with skill levels (cid:15) ∈ E , (cid:15) ∼ P ( (cid:15) ) -known to thecrowdsourcer (SL-CS)- from an MSC population participate, the minimum number of queries peritem to obtain an overall error probability of at most ˆ (cid:15) , is given by R min = (cid:26) H ( B ( X )) − H N (ˆ (cid:15) )log M − E ( H M ( (cid:15) )) ˆ (cid:15) ≤ min { − p max , − N } otherwise. (12) For any rate R above this threshold, there exists a query scheme (code) that achieves the desiredfidelity ˆ (cid:15) , and for any rate below the threshold no such scheme exists. Comparing the results in Theorems 2 and 3 the following interesting observation can be made. In casethe worker skill levels are unknown, the CS system provides an overall work quality (capacity) of anaverage worker; whereas when the skill levels are known at the crowdsourcer, the system provides anoverall work quality that pertains to the average of the work quality of the workers. k -ary Incidence Coding In this Section, we examine the performance of the k -ary incidence coding introduced in Section 2.3.The k -ary incidence code poses a query as a set of k ≥ items and inquires the workers to identifythose with the same label. In the sequel, we begin with deriving a lower-bound on the performanceof k IC with a spammer-hammer worker pool. We then presents numerical results along with theinformation theoretic lower bounds presented in the previous Section. k IC with SHC Worker Pool
We consider k IC for crowdsourcing in the following setting. The items X in the dataset are iidwith N = 2 . There is no feedback from the decoder to the task manager (encoder), i.e., the codeis non-adaptive. Since the task manager has no knowledge of the workers’ skill levels, it queriesthe workers at the same fixed rate of R queries per item. To compose a query, the items are drawnuniformly at random from the dataset. We assume that the workers are drawn from the SHC( q ) modelelaborated in Section 2.2. The purpose is to obtain a lower-bound on the performance assuming anOracle decoder that can perfectly identify the workers’ skill levels (here a spammer or a hammer)7nd perform an optimal decoding. Specifically, we consider the following: min C (cid:48) , C : k IC P ( E : ˆ B ( X ) (cid:54) = B ( X )) (13)where minimization is with respect to the choice of a decoder for a given k IC code. We note that thecode length is governed by how the decoder operates, and often could be as long as the dataset. Asevident in (2), in the SHC model, the channel error rate (worker reliability) is explicitly influenced bythe code and parameter, k . In the model of Figure 1a, this implies that a certain static feedforwardexists in this setting. We first present a lemma, which is used later to establish a Theorem 4 on k ICperformance. The proofs are respectively provided in Appendix C and Appendix D.
Lemma 1
In crowdsourcing for binary labeling ( N = 2 ) of a uniformly distributed dataset, with k IC and a SHC worker pool, the probability of error in labeling of an item by a spammer ( C = S), isgiven by ¯ (cid:15) S = P ( E : ˆ B ( X ) (cid:54) = B ( X ) | C = S ) = k k − (cid:80) (cid:98) ( k − / (cid:99) i =0 i × (cid:18) ki (cid:19) k odd k k − (cid:20)(cid:80) (cid:98) ( k − / (cid:99) i =0 i × (cid:18) ki (cid:19) + k (cid:18) kk/ (cid:19)(cid:21) k even . Theorem 4
Assuming crowdsourcing using a non-adaptive k IC over a uniformly distributed dataset( k ≥ ), if the number of queries per item, R , is less than k ln(1 − q ) ln ˆ (cid:15) ¯ (cid:15) S , then no decoder can achievean average probability of labeling error less than ˆ (cid:15) for any L under the SHC( q ) worker model. To interpret and use the result in Theorem 4, we consider the following points: (i) The theorempresents a necessary condition, i.e., the minimum rate (budget) requirement identified here for k ICwith a given fidelity is a lower bound. This is due to the fact that we are considering an oracle CSdecoder that can perfectly identify the workers’ skill levels and correctly label the item if the itemis at least labeled by one hammer out of R (cid:48) times it is processed by the workers. (ii) In the currentsetting, where the taskmaster does not know the workers’ skill levels, each item is included in exactly R (cid:48) ∈ Z + k -ary queries. That is due to the nature of the code R (cid:48) . (iii) As discussed in AppendixE, Theorem 4 can also be used to establish an approximate rule of thumb for pricing. Specifically,considering two query schemes k IC and k IC, the query price π is to be set as π ( k ) π ( k ) ≈ k k . To obtain an information theoretic benchmark, the next corollary specializes Theorem 3 to the settingof interest in this Section.
Corollary 1
In crowdsourcing for binary labeling of a uniformly distributed dataset with a SHC( q )worker pool -known to the crowdsourcer (SL-CS)- and number of choices in responding to a query of M , the minimum rate for any given coding scheme to obtain a probability of error of at most ˆ (cid:15) , is R min = (cid:26) − H b (ˆ (cid:15) ) q log M , ≤ ˆ (cid:15) ≤ . otherwise. queries per item (14)Figure 2 shows the information theoretic limit of Corollary 1 and the bound obtained in Theorem 4.For rates (budgets) greater than the former bound, there exist a code which provides crowdsourcingwith the desired fidelity; and for rates below this bound no such code exists. The coding theoreticlower bounds for k IC depend on k , q and fidelity, and improve as k and q grow. The k IC bounds for k = 1 is equivalent to the analysis leading to Lemma 1 of [8].8igure 2: kIC performance bound and the information theoretic limit References [1] I. Abraham, O. Alonso, V. Kandylas, and A. Slivkins. Adaptive crowdsourcing algorithms forthe bandit survey problem. In , 2013.[2] Audubon. History of the christmas bird count, 2015. URL http://birds.audubon.org .[3] Caltech. Community seismic network project, 2016. URL http://csn.caltech.edu .[4] T. M. Cover and J. A. Thomas.
Elements of Information Theory . Wiely, New Jersey, USA,2006.[5] A. P. Dawid and A. M. Skene. Maximum Likelihood Estimation of Observer Error-Rates Usingthe EM Algorithm.
Journal of the Royal Statistical Society. Series C (Applied Statistics) , 28(1):20–28, 1979.[6] O. Dekel and O. Shamir. Vox populi: Collecting high-quality labels from a crowd. In
Proceed-ings of the Twenty-Second Annual Conference on Learning Theory , June 2009.[7] A El Gamal and Y.-H. Kim.
Network Information Theory . Cambridge University Press, NewYork, USA, 2011.[8] D. R. Karger, S. Ohy, and D. Shah. Budget-optimal task allocation for reliable crowdsourcingsystems.
Operations Research , 61(1):1–24, 2014.[9] V. C. Raykar, S. Yu, L. H. Zhao, G. H. Valadez, C. Florin, L. Bogoni, and L. Moy. Learningfrom crowds.
J. Machine Learn. Res. , 99(1):1297–1322, 2010.[10] P. Smyth, U. Fayyad, M. Burl, P. Perona, and P. Baldi. Inferring ground truth from subjectivelabeling of venus images.
Adv. Neural Inform. Processing Systems , pages 1085–1092, 1995.[11] R. Korlakai Vinayak, S. Oymak, and B. Hassibi. Graph clustering with missing data: Convexalgorithms and analysis. In
Advances in Neural Information Processing Systems 27: AnnualConference on Neural Information Processing Systems, Montreal, Quebec, Canada , pages2996–3004, 2014.[12] J. Whitehill, P. Ruvolo, T. Wu, J. Bergsma, and J. Movellan. Whose vote should count more:optimal integration of labels from labelers of unknown expertise.
Adv. Neural Inform. ProcessingSystems , 22(1):2035–2043, 2009.[13] Y. Zhang, X. Chen, D. Zhou, and M. I. Jordan. Spectral methods meet EM: A provably optimalalgorithm for crowdsourcing. In
Advances in Neural Information Processing Systems 27:Annual Conference on Neural Information Processing Systems, Montreal, Quebec, Canada ,pages 1260–1268, 2014. 9 ppendixes
A Proof of Theorem 2
The parallel MSC channels representing workers with random skill levels may be viewed as a channelwith random state [7]. The proof relies on the optimality of separate source and channel coding [7](according to Shannon) in large dataset size and worker pool regime, which holds with a discretememoryless source and discrete memoryless channel with random state. Here, we need a lossy sourcecoder to bring the rate to R ( D ( ˆ B ( X ) , B ( X )) = ˆ (cid:15) ) . If the source is memoryless, then we have R ( D ( ˆ B ( X ) , B ( X )) = ˆ (cid:15) ) = min I ( B ( X ) , ˆ B ( X )) = H ( P ( B ( X ))) − H ( D = ˆ (cid:15) ) , (15)where H denotes the entropy and the minimization is with respect to the choice of a mapping(source coding scheme) P ( ˆ B ( X ) | B ( X )) (disregarding the channel or worker possible errors). WithHamming distance, the second term amounts to H N (ˆ (cid:15) ) , when ˆ (cid:15) ≤ min { − p max , − N } . If ˆ (cid:15) ≥ − p max , then we can achieve the desired average distortion with rate zero and by a decoderthat simply outputs argmax P ( B ( X )) . Again, if ˆ (cid:15) ≥ − N then we can achieve the desired averagedistortion with rate zero and by a decoder that uniformly outputs a B ( X ) at random. We need thechannel to be able to provide reliable communication for such a rate. In case the skill levels areunknown to both the taskmaster and the crowdsourcer, we have nR ( D = ˆ (cid:15) ) ≤ W (cid:88) i m i C SL − UK . (16)and C SL − UK = max P ( u ) I ( U ; V ) = log M − H M ( E ( (cid:15) )) (17)where the maximum occurs when P ( u ) = 1 /M and E ( . ) denotes the expectation operator. Notethat since the taskmaster does not know about the skill levels, it can only communicate the queriesto workers with an identical rate. Combining (15), (16) and (17), the proof is complete. Note thatthe optimality of the separate source and channel coding scheme invoked here relies on proving anachievable scheme and a converse, which is the basis for the latter part of the Theorem statement.This can be done in line with what is presented in sections 3.9 and 7.4 of [7] and is omitted for brevity. B Proof of Theorem 3
The proof follows the same steps as in that of Theorem 2. However, the capacity when the skill levelsare known at the crowdsourcer is given by C SL − CS = max P ( u ) I ( U ; V | (cid:15) ) = log M − E ( H M ( (cid:15) )) (18)where the maximum occurs when P ( u ) = 1 /M . Note again that since the taskmaster does not knowabout the skill levels, it can only communicate the queries to workers with an identical rate. C Proof of Lemma 1
The probability of error, when the query is posed to a spammer in a SHC, can be quantified as follows, P ( E | C = S ) = P ( E : ˆ B ( X ) (cid:54) = B ( X ) | C = S )= (cid:80) u ∈U (cid:80) v ∈V P ( u, v | C = S ) P ( E | U = u, V = v, C = S )= (cid:80) u ∈U (cid:80) v ∈V P ( u ) P ( v | u, C = S ) P ( E | U = u, V = v, C = S )= (cid:80) u ∈U (cid:80) v ∈V M P ( E | U = u, V = v ) (19)The last equality follows in part because of the uniform distribution of the dataset ( P ( B ( X )) ∼ uniform) and that |U| = |V| = M . 10ach k IC symbol can be represented by a vector of length k , whose elements are in the set { , , . . . , N − } (here N = 2 ). As such, for k IC with arbitrary k ≥ and N = 2 , the num-ber of valid responses to a query can be counted as k (cid:48) = (cid:80) (cid:98) ( k − / (cid:99) i =0 (cid:18) ki (cid:19) k odd (cid:80) (cid:98) ( k − / (cid:99) i =0 (cid:18) ki (cid:19) + 0 . (cid:18) kk/ (cid:19) k even (20)or alternatively k (cid:48) = 2 k − . This is obtained noting that for k ≥ , k IC does not recover specificlabellings (e.g., in clustering it does put items in their associated bins but does not label the bins).In the same direction, each query of k IC that is posed to a worker, is equivalent to transmissionof a symbol over a k (cid:48) -ary discrete channel ( M = k (cid:48) ), or alternatively a codeword of k bits overan equivalent binary discrete channel. As such, any linear combination of two codewords over thecorresponding field would create another codeword (up to labeling). As a result, (cid:88) u ∈U (cid:88) v ∈V P ( E | U = u, V = v ) = k (cid:48) k (cid:88) u (cid:48) ∈U w H ( u (cid:48) ) where w H is the Hamming weight. Noting (20) and setting M = k (cid:48) in (19) the proof is complete. D Proof of Theorem 4
Under the SHC model, we consider an oracle decoder who only makes a mistake on task i if it is onlyassigned to spammers. Formally, the average error probability at the decoder, if in all R (cid:48) queries pertask (codeword) only spammers are selected, is given by P ( E : ˆ B ( X ) (cid:54) = B ( X )) = (cid:88) C P ( E : ˆ B ( X ) (cid:54) = B ( X ) | C , . . . , C R (cid:48) ) P ( C , . . . , C R (cid:48) ) = P ( E | C = S ) × (1 − q ) R (cid:48) = P ( E | C = S ) × (1 − q ) R (cid:48) The last equality follows, since when the decoder observes R (cid:48) instances of spammers, it does nothave any more information as observing a single spammer. In k IC, the number of queries per item X , R , is given by R = k R (cid:48) . As such, we have the following result for an arbitrary decoder ˆ (cid:15) = P ( E : ˆ B ( X ) (cid:54) = B ( X ) | C = S ) × (1 − q ) kR (21)Using Lemma 1, the desired result is obtained. Note that as evident in the lemma ¯ (cid:15) S = P ( E | C = S ) is also a function of k . E Pricing Strategy
Using Equation (21) (in proof of) Theorem KICBound, one can draw some insights into the design ofa CS system based on k IC. Below we consider this in the context of pricing queries. Specifically,in crowdsourcing with k IC query scheme with a spammer-hammer worker distribution of hammerprobability q , we consider two scenarios: (i) the case where the price of the query per item is fixed at π and does not change with k and (ii) the case where the query price is a function of k , π ( k ) . In theformer case, one can readily examine that since P ( E | C = S ) in Lemma 1 grows very slowly with k , any increase in k directly translates to a smaller rate R and hence crowdsourcing cost πnR for agiven crowdsourcing fidelity ˆ (cid:15) . In the latter case, however, the analysis sheds light on the appropriatepricing range. Specifically, for a given crowdsourcing budget and two query schemes k IC and k IC with rates R and R , we have π ( k ) R = π ( k ) R ; and for a given ˆ (cid:15) from (21), we have k R ≈ k R . This indicates that the pricing in the two scenarios compare as π ( k ) π ( k ) ≈ k k . Thissets a threshold (range) for crowdsourcing pricing strategy: If we are to use a k IC as opposed to k IC ( k > k ), then we would have to pay at most π ( k ) (cid:46) π ( k ) k k . Note that due to said natureof P ( E | C = S ) , this approximation is more accurate for larger values of k and k2