[PDF] Equal Affection or Random Selection: the Quality of Subjective Feedback from a Group Perspective

Abstract

Full PDF

EEqual Aﬀection or Random Selection: the Quality ofSubjective Feedback from a Group Perspective

Jiale Chen Yuqing Kong Yuxuan LuThe Center on Frontiers of Computing Studies, Computer Science Dept.,Peking University {jiale_chen, yuqing.kong, yx_lu}@pku.edu.cn

Abstract

In the setting where a group of agents is asked a single subjective multi-choicequestion (e.g. which one do you prefer? cat or dog?), we are interested in evaluating thequality of the collected feedback. However, the collected statistics are not suﬃcient toreﬂect how informative the feedback is since fully informative feedback (equal aﬀectionof the choices) and fully uninformative feedback (random selection) have the sameuniform statistics.Here we distinguish the above two scenarios by additionally asking for respondents’predictions about others’ choices. We assume that informative respondents’ predictionsstrongly depend on their own choices while uninformative respondents’ do not. With thisassumption, we propose a new deﬁnition for uninformative feedback and correspondinglydesign a family of evaluation metrics, called f -variety, for group-level feedback whichcan 1) distinguish informative feedback and uninformative feedback (separation) evenif their statistics are both uniform and 2) decrease as the ratio of uninformativerespondents increases (monotonicity). We validate our approach both theoretically andnumerically. Moreover, we conduct two real-world case studies about 1) comparisonsabout athletes and 2) comparisons about stand-up comedians to show the superiorityof our approach. Many areas need subjective data collected by survey methods. For example, the votingpollsters need to elicit subjective opinions from potential voters [12, 2]. Product companieswant to elicit purchase intentions from potential customers [11]. Meanwhile, it is always aconcern that the feedback quality may not be guaranteed due to the lack of expertise oreﬀort of the respondents [11]. In such case, researchers develop multiple approaches (e.g.attention test, diﬀerent non-ﬂat payment schemes) [8, 13, 14, 10, 5] to encourage high-qualitysubjective feedback. However, despite the existence of these elicitation approaches, theredoes not exist a systematic way to evaluate the quality of the subjective feedback such thatwe can compare those elicitation approaches in practice.1 a r X i v : . [ c s . G T ] F e b key challenge here is that we cannot verify each individual’s answer since it is subjective.From a group perspective, the collected statistics are not suﬃcient to reﬂect how informativethey are. For example, given a multi-choice question (e.g. Which one do you prefer? PandaExpress or Chick-ﬁl-A), an unbalanced statistics (e.g. 80% Chick-ﬁl-A, 20% Panda Express)is informative since uninformative respondents’ statistics should be uniform (50%,50% forbinary-choice). However, the opposite is not true. That is, uniform statistics may not beuninformative. Let’s consider the following example.• Which one do you prefer? dog or cat?• Which one do you prefer? realism or liberalism?When we ask people the above two questions, we may receive uniform statistics (50%,50%)for both of them. However, it’s possible that in the ﬁrst question, the respondents fullyunderstand the question and half of them prefer cats while in the second question, therespondents do not understand the question and randomly select one choice. To distinguish“equal aﬀection” and “random selection”, we need 1) to collect additional information besidestheir choices; 2) a more reﬁned concept of uninformativeness .To address the above problem, we additionally probe the respondents’ predictions forother people by asking them "What percentage of people prefer dogs?". With this additionalquery, we can compare the dog lovers’ predictions and cat lovers’ predictions (also forrealism/liberalism “lovers”). Since people are usually attracted to belief systems that areconsistent with their preferences, we can boldly assume that dog lovers have very diﬀerentpredictions for other people’s preferences from cat lovers. However, if respondents do notunderstand the meaning of realism nor liberalism, they will not form strong opinions forother people’s preferences.Inspired by this, we utilize the additional statistics about people’s predictions andpropose a more reﬁned concept of uninformativeness by adding a condition that describesthe relationship between respondent’s choice and prediction. In our new deﬁnition, arespondent’s feedback is uninformative if and only if• Uniform choice : she picks the choice uniformly at random;•

Independence : her choice and prediction are independent.With the above deﬁnition, the “random selection” is still uninformative while the “equalaﬀection” is not. Moreover, we show that this new deﬁnition satisﬁes two natural properties:•

Stability : a mixed group of uninformative feedback is still uninformative;•

Additive property : a mixed group of uninformative feedback and informative feedbackis informative.We also provide a corresponding family of non-negative evaluation metric, f -variety ,such that with our new deﬁnition, f -variety• Separation: separates informative and uninformative feedback by assigning approxi-mately zero value to only uninformative feedback;2

Monotonicity: decreases as the ratio of uninformative feedback increases. f -variety is deﬁned as f -divergence between the joint distribution over choice-predictionpairs and the corresponding uninformative ones, which has uniform choices and the samemarginal distribution over predictions. Intuitively, f -variety represents the amount ofinformation contained in group-level feedback. To give a taste of f -variety, we take a special f -divergence, total variation distance (Tvd), as an example and visualize the correspondingTvd-variety in the following ﬁgure. . . . . . . . . . . . . . . . . . Prediction P r o b a b ili t y d e n s i t y choice + choice − . . . . . . . . . . . . . . . . . Prediction P r o b a b ili t y d e n s i t y Tvd variety

Figure 1:

Tvd-variety in binary case:

In the binary case where choices are {+ , −} , we draw thejoint distribution over the choice-prediction pairs. Speciﬁcally, the blue region is the distributionover “ − ” people’s predictions about what percentage of people will choose “ + ”, multiplying theratio of “ − ” people . The area of the blue region is the ratio of “ − ” people. We plot the red regionanalogously. The area of the shading region, which is the diﬀerence between the blue region andred region, is proportional to the Tvd-variety. Tvd-variety is always non-negative. The above ﬁgureshows an example of “equal aﬀection”. In the “random selection” case, since choices are uniform andpredictions are independent of choices, the blue region will be the same as the red region. Thisleads to a zero Tvd-variety. In addition to theoretical validation (Section 2), we perform multiple numerical exper-iments (Section 3) to validate the robustness of f -variety when we only have access tosamples rather than the joint distribution over choice-prediction pairs. We also performtwo real-world case studies about comparisons for athletes and comparisons for stand-upcomedians (Section 4). For evaluation, we also collect side information of our respondentsas reference (e.g. we ask for their knowledge about these two contents in advance). Wecompare f -variety with a baseline, which only measures the uniformity of the aggregatedstatistics of the choices. The results show that compared to baseline, f -variety is moreconsistent with the reference, which shows the superiority of f -variety. In the situationwhere we cannot obtain high-quality side information (e.g. polls, survey purchase intention,comparing diﬀerent payment schemes for subjective surveys), we can use f -variety as anevaluation metric. 3 .1 Related work We use the choice-prediction framework for data collection, i.e. , we ask for respondents’choices and their predictions of others’ choices. The choice-prediction framework hasapplications in diﬀerent ﬁelds. Firstly, it provides more accurate data. Psychologicalresearch suggests that peer-predictions (predictions about others’ behavior) are a moreaccurate predictor of individuals’ future behavior than self-predictions (predictions aboutoneself) [3]. Researches about political voting also have shown that predictions of others’choices can achieve higher accuracy of election predictions [12, 2]. We consider a diﬀerentproblem and focus on evaluating the collected data without any ground truth.Secondly, the choice-prediction framework is used to elicit truthful opinions of peoplein information elicitation. Bayesian Truth Serum (BTS) combines respondents’ choicesand predictions and then creates incentives for truthfulness in elicitation for subjectivequestions [8, 13]. And it’s directly used in solving the crowd wisdom questions [9]. Thereare also further works that adopt the same framework while avoiding BTS’s assumptionof inﬁnite participants [14, 10, 5]. These works all focus on designing truthful incentivesfor individuals and assume that people who have the same choices also have the samepredictions, i.e. , the common prior assumption. In contrast to the above works, we focus ondesigning an evaluation metric after collecting feedback from a group of people and do notneed the strong common prior assumption.Thirdly, the choice-prediction framework can also be used in measuring the expertiseof individual [11], which is closely related to our work. Radas and Prelec [11] assume thatpeople with more accurate predictions are more informative and use this assumption tomeasure the expertise of each individual. This assumption may not be valid when expertsare not familiar with other people who also answer the question. Our work measures theexpertise of a group of people and we do not need such an assumption.Our work uses f -divergence as an important ingredient to designing the new metric, f -variety. Previous works are using f -divergence in measuring the amount of informationin individuals’ answers to subjective questions [6, 7]. These works aim to elicit individuals’truthful opinions, while our work uses f -divergence to measure group-level informativenessand leads to a totally diﬀerent metric. In this section, we will formally introduce our model and state our deﬁnition for uninformativedistribution. Given the deﬁnition, we will propose a family of metrics, f -variety, to measurethe amount of information contained in distributions. We will provide a theoretical validationfor both our deﬁnition and our metrics.A group of agents is asked to answer a multi-choice question (e.g. which one do youprefer? realism or liberalism?) and also predict other people’s choices (e.g. what percentageof people prefer realism?). Given the question, we assume that each agent receives a pairof choice and prediction ( c, p ) from distribution D independently. We do not assume thatall agents are homogeneous. That is, Alice’s choice-prediction pair can have a diﬀerentdistribution from Bob’s. For a group of agents, we care about the average distribution over4heir choice-prediction pairs.For non-experts who have no clue about the question’s meaning, they will pick thechoice uniformly at random. We can deﬁne a distribution with uniform choices as anuninformative distribution. However, like our motivating example, experts’ feedback can alsobe uniform (e.g. 50% experts prefer realism). In this case, we reﬁne the previous deﬁnition byadditionally requiring that a non-expert’s choice is independent of her prediction. Formally,we require that every non-expert’s choice-prediction pair is drawn from an uninformative distribution which is deﬁned as follows: Deﬁnition 2.1 (Uninformative Dist U ⊗ P ) . A distribution D over choice and prediction C, P is uninformative if and only if: • Uniform choice: the marginal distribution of choice is uniform, i.e. Pr D [ C = c ] = N C where N C is the number of choices; • Independence: Choice and prediction are independent, i.e. Pr D [ C = c, P = p ] = Pr D [ C = c ] Pr D [ P = p ] Given random variables X , Y , we use X ⊗ Y to represent the independent joint distributionwhich is the product of X and Y ’s distributions. We use U to denote a random choice whosedistribution is uniform. An uninformative distribution D can be represented by U ⊗ P where P ’s distribution is D ’s marginal distribution over the predictions.We use four diﬀerent distributions in Figure 2 to explain our deﬁnition of uninformativedistribution. The deﬁnition U ⊗ P not only is consistent with our intuition but also hasmultiple desired natural properties. The ﬁrst property, stability , is that a mixed group ofnon-experts is still uninformative. In the previous example, mixing a group of respondents,whose average distribution is dist 3, and another group, whose average distribution is dist 4,will not make them informative. The second property, additive property , means that addingexperts into the group of non-experts will make the whole group informative. The initialdeﬁnition, which deﬁnes a distribution with uniform choices as an uninformative distribution,satisﬁes both stability and additive property naturally. We show that our reﬁned deﬁnitionstill satisﬁes the two properties and allows a more reﬁned concept of non-experts. Proposition 2.2 (Properties of U ⊗ P ) . The average distribution over

Stability: 0+0=0 a mixed group of non-experts’ choice-prediction pairs is uninformative;

Additive Property: 0+!0=!0 a mixed group of experts and non-experts’ choice-predictionpairs is informative.Proof.

Given two group of non-experts whose average distributions are U ⊗ P and U ⊗ P correspondingly, the mixed average distribution will be αU ⊗ P + ( − α ) U ⊗ P = U ⊗( αP + ( − α ) P ) since α Pr [ U = u ] Pr [ P = p ] + ( − α ) Pr [ U = u ] Pr [ P = p ] = Pr [ U = u ] ( α Pr [ P = p ] + ( − α ) Pr [ P = p ]) . Thus, the mixed average distribution is still uninfor-mative. Given a group of non-experts U ⊗ P and a group of experts CP , if the averagedistribution over the experts has non-uniform marginal distribution over the choices, thenthe mixed version must have non-uniform marginal distribution over choices as well thus beinformative. Therefore, we only need to consider the situation where the average distribution5 . . . . . . . . . . . . . . . . . choice + choice − (a) Dist 1 (informative) : non-uniformchoices, independent choice and predictionpairs . . . . . . . . . . . . . . . . . choice + choice − (b) Dist 2 (informative) : uniform choices,dependent choice and prediction pairs . . . . . . . . . . . . . . . . . choice + choice − (c) Dist 3 (uninformative) : uniformchoices, independent choice and predictionpairs . . . . . . . . . . . . . . . . . choice + choice − (d) Dist 4 (uninformative) : uniformchoices, independent choice and predictionpairs

Figure 2:

Examples of (un)informative distributions in binary case:

In distribution 1,choices and predictions are independent and the marginal distribution over choices is non-uniform,thus the blue region and red region are proportional. In distribution 2, predictions depend onchoices and choices are uniform. In distributions 3 and 4, predictions and choices are independentand choices are uniform, thus the blue region and the red region are the same. In our deﬁnition,distributions 1 and 2 are informative while distributions 3 and 4 are uninformative. over the experts is

U P . In this case, we will prove the result by contradiction. Let’s assumethat the mixed version has uninformative average distribution. Then there exists a randomvariable P mix and α > such that α Pr [ U = u ] Pr [ P = p ] + ( − α ) Pr [ U = u, P = p ] = Pr [ U = u ] Pr [ P mix = p ] where Pr [ P mix = p ] = α Pr [ P = p ] + ( − α ) Pr [ P = p ] . This implies that Pr [ U = u, P = p ] = Pr [ U = u ] Pr [ P = p ] which contradicts the fact that U P is informative, i.e. , not equal to U ⊗ P .Given the deﬁnition of uninformative distribution, it’s natural to ask for a metric for theinformativeness of the distribution. At a high level, this metric should be always non-negativeand assign zero value to uninformative distribution and strictly positive value to informativedistribution. Moreover, we want the metric to satisfy an information-monotonicity as well:mixing experts with non-experts will decrease the amount of information contained inexperts.We propose the following metric family, f -variety, that satisﬁes all desired properties.6he idea is to measure the amount of information contained in a distribution D by measuringits “distance” to a corresponding uninformative distribution. To measure the “distance”, weuse f -divergence D f ∶ ∆ Σ × ∆ Σ → R , a non-symmetric measure of the diﬀerence betweendistribution p ∈ ∆ Σ and distribution q ∈ ∆ Σ and is deﬁned to be D f ( p , q ) = ∑ σ ∈ Σ p ( σ ) f ( q ( σ ) p ( σ ) ) where f (⋅) is a convex function and f ( ) = . Two commonly used f -divergences are KLdivergence D KL ( p , q ) = ∑ σ p ( σ ) log p ( σ ) q ( σ ) by choosing − log ( x ) as the convex function f ( x ) ,and Total variation Distance D tvd ( p , q ) = ∣ p − q ∣ = ∑ σ ∣ p ( σ ) − q ( σ )∣ , by choosing ∣ x − ∣ as the convex function f ( x ) . Deﬁnition 2.3 ( f -variety) . For any distribution D over choice and prediction, we deﬁnethe f -variety of D as V f ( D ) ∶= D f ( CP, U ⊗ P ) where CP represents distribution D and U ⊗ P represents the uninformative distributionwhich has the same marginal distribution over predictions as D . f -variety vs f -mutual information The deﬁnition of f -variety is very similar to thedeﬁnition of f -mutual information D f ( CP, C ⊗ P ) . In the concept of mutual information, theuninformative joint distribution is the distribution over two independent random variables.Thus, mutual information measures the information of a joint distribution CP by measuringthe distance between CP and C ⊗ P . A natural question here is that in our setting, canwe extend the deﬁnition of uninformative distribution to C ⊗ P and use the f -mutualinformation between the choice and prediction to measure the informativeness. The answeris no since C ⊗ P does not satisfy the stability property + = . For example, both dist 1and 4 in Figure 2 have independent choice and prediction. However, a mixed version of themdoes not. Without satisﬁes the stability property, the monotonicity will never be satisﬁedsince adding non-experts can increase informativeness.We introduce a special f -variety, Tvd-variety. This special measure has a nice visualizationin the binary case (see Figure 1). Example 2.4 (Tvd-variety) . Given D , we use vector q to represent the marginal distributionvector over predictions. We use vector q c to represent the distribution vector over predictions,conditioning on the agent receives choice c . V tvd ( D ) = D tvd ( CP, U ⊗ P )= ∑ c,p ∣ Pr [ C = c, P = p ] − N C Pr [ P = p ]∣= ∑ c ∣ q c q c − N C q ∣ here N C is the number of choices. In the binary choice case, V tvd ( D ) = ∑ c =+− ∣ q c q c − q ∣ = ∑ c =+− ∣ q c q c − ( q + q + + q − q − )∣ = ∣ q + q + − q − q − ∣ Thus, in binary case, Tvd-variety is half of the area of the symmetric diﬀerence of red andblue regions (see Figure 1).

Here we formally state and prove the properties of the general f -variety. Theorem 2.5 (Properties of f -variety) . f -variety V f satisﬁes: Separation: V f (0)=0, V f (!0)>0 for any D , V f ( D ) ≥ , for any uninformative D , V f ( D ) = ; Monotonicity: V f (x+0)< V f (x) for any D and any uninformative D , ∀ < α < , V f (( − α ) D + αD ) ≤ ( − α ) V f ( D ) Proof.

The separation property follows directly from the deﬁnition of f -variety and uninfor-mative distribution.To prove monotonicity, we need to use the joint convexity of f -divergence. Lemma 2.6 (Joint Convexity [1]) . For any ≤ λ ≤ , for any p , p , q , q ∈ ∆ Σ , D f ( λ p + ( − λ ) p , λ q + ( − λ ) q ) ≤ λD f ( p , q ) + ( − λ ) D f ( p , q ) . With the above lemma, V f (( − α ) D + αD ) = D f (( − α ) CP + αU ⊗ P , U ⊗ (( − α ) P + αP ))= D f (( − α ) CP + αU ⊗ P , ( − α ) U ⊗ P + αU ⊗ P ))≤( − α ) V f ( D ) The above theorem implies that if we can estimate the average distribution of a groupof agents perfectly and use it to calculate f -variety, f -variety can separate experts andnon-experts and satisfy information-monotonicity perfectly. Since in this case, non-experts’ f -variety will be zero, and experts’ f -variety will be a positive number. Moreover, addingnon-experts into an existed group will decrease the f -variety.However, we cannot obtain a perfect estimation of distribution in practice since we onlyhave a ﬁnite number of samples. In practice, when we ask for additional prediction, we providethe respondents 11 discrete options { , , ..., } and use the empirical histogramto estimate the distribution and calculate f -variety. We will provide several numericalexperiments to show the robustness of our empirical estimation method in Section 3.8 . . . . . . . . . . . . . . . . . Prediction P r o b a b ili t y d e n s i t y choice + choice − . . . . . . . . . . . . . . . . . . Prediction R a t i oo f r e s p o nd e n t s choice + choice − Figure 3:

Empirical histogram:

The subgraph on the left is the true underlying distribution. Inpractice, we use histograms over ﬁnite prediction options to estimate the joint distribution, which isshown on the right.

In this section, we will generate multiple choice-prediction pairs of experts/non-experts andmix them with diﬀerent ratios. Ideally, the f -variety will decrease as the ratio of non-expertsincreases (monotonicity) and vanish when there are only non-experts (separation). We willtest the robustness of the empirical distribution’s f -variety by checking whether it satisﬁesthe monotonicity and separation property.To generate the synthetic data, we ﬁrst determine the underlying distributions of theexperts and non-experts. We then generate the choice-prediction pairs according to theunderlying distribution for experts and non-experts, regarding diﬀerent sample sizes (e.g.100, 200, 500, 1000). We conduct multiple numerical experiments with diﬀerent underlyingdistributions of the experts.We test the empirical Tvd-variety by performing 4 groups of experiments and use Betadistribution [4] to model the underlying conditional distribution over the predictions. Forall cases, we choose Beta(2,2) as the distribution over non-experts’ predictions while we usediﬀerent distributions over experts’ choice-prediction pairs in diﬀerent groups. We showthe results in Figure 4. It shows that empirical Tvd-variety approximately decreases withthe ratio of non-experts. The empirical Tvd-variety becomes closer to true Tvd-variety assample size goes larger . We perform two real-world case studies. In each study, we pick a topic and design an onlinesurvey about this topic. Each survey consists of multiple subjective questions in the formatof “Which one do you prefer? X or Y ? What percentage of people will choose X ?”. ( X, Y ) pair represents comparable athletes, stand-up comedians, or other concepts. The orders We also test Pearson-variety and Hellinger-variety. The results are similar and shown in the Appendix. The error bar in the ﬁgure shows the standard deviation. . . . . . . . . . . . . . . . . . ratio of uninformative participants T v d - v a r i e t y theoretical n = n = n = n = (a) Uniform-1:

For experts, q + = q − = . , q + is Beta(8,3) and q − is Beta(4,5). . . . . . . . . . . . . . . . . . ratio of uninformative participants T v d - v a r i e t y theoretical n = n = n = n = (b) Non-uniform-1:

For experts, q + = . , q − = . , q + is Beta(8,3) and q − is Beta(4,5). . . . . . . . . . . . . . . . . . ratio of uninformative participants T v d - v a r i e t y theoretical n = n = n = n = (c) Uniform-2:

For experts, q + = q − = . , q + is Beta(6,6) and q − is Beta(2,3). . . . . . . . . . . . . . . . . . ratio of uninformative participants T v d - v a r i e t y theoretical n = n = n = n = (d) Non-uniform-2:

For experts, q + = . , q − = . , q + is Beta(6,6) and q − is Beta(2,3). Figure 4:

Tvd-variety v.s. ratio of non-experts:

In all cases, we choose Beta(2,3) as thedistribution over non-experts’ prediction. In the ﬁrst column (a) and (c), the choice of experts areuniform, while in the second column (b) and (d), q + = . and q − = . . In the ﬁrst row (a) and(b), q + is Beta(8,3) and q − is Beta(4,5) while in the second row (c) and (d), q + is Beta(6,6) and q − is Beta(2,3). In each ﬁgure, the charts below show the joint distribution at diﬀerent ratios ofnon-experts. The symmetric diﬀerence vanishes when the ratio is one, i.e. , there are only non-experts.The above line charts show that Tvd-variety decreases with the ratio of non-experts which veriﬁesthe Monotonicity . When the number of samples increases, Tvd-variety almost vanishes whenthere are only non-experts, which veriﬁes the

Separation property. As the sample size increases,the empirical value becomes closer to the theoretical value (black line) which is calculated by theperfect information about the joint distribution.

10f options are randomly shuﬄed. For the prediction question, we provide 11 predictionoptions { , , ⋯ , } . We conduct the survey on an online survey platform and ourrespondents are recruited by the platform. Each survey contains an attention test andrewards respondents who pass the test a ﬂat participation fee. In total, we ask 15 questions. Each question is answered by above 600 respondents on average and we pay $0.5 for eachanswer sheet. Evaluation

We choose Tvd-variety in the analysis of the studies. We will evaluate Tvd-variety through two aspects:•

Cross respondents:

We divided the respondents into two groups through sidequestions (e.g. do you watch sports frequently). It is commonly assumed that onegroup is more familiar with the questions than the other. We can evaluate theinformativeness metric by checking whether a high-expertise group has a higher metric.•

Cross questions:

We divide the questions into easy and hard categories in advance.We can evaluate the informativeness metric by checking whether the easy questionshave higher metrics than the hard ones.We also compare Tvd-variety with a baseline metric which measures the degree ofunbalance of the statistics. Our case studies focus on binary cases. In the binary case, thebaseline metric is deﬁned as

Baseline ∶= ∣ q + − ∣ . More uniform statistics will have a lower baseline score.

We conducted a study about the preference for athletes. We asked 7 questions and therewere 656 respondents, of which 306 were men and 350 were women. Additionally, we askedrespondents whether they often watch sports. 215 of respondents reported they often watchsports and the others reported they do not.Here we will give a sample question. All 7 questions have the same format. We attachthe contents of these questions in the appendix.• Which soccer player do you prefer? Andrés Iniesta or Luka Modric?• What percentage of people do you think prefer Andrés Iniesta?Figure 5 shows the comparison between Tvd-variety and the baseline. In the performancecomparison in the groups of often watching sports and of not, the baseline correctly suggeststhat people who often watch sports are more informative for 4 of the questions, but gives theopposite results for the other 3 questions. In contrast, among all 7 questions, Tvd-varietycorrectly suggests that people who often watch sports are more informative for 6 questions. Each attention test has the following form: “There are n red balls and m blue balls with the same shapein the box. One is randomly selected. What percentage do you think is the probability of a red/blue ball?” asketball(M) Soccer(M) Basketball(F) Snooker Formula One Volleyball(F)Ping-pong(M) . . . . . . . . . . . . . T v d - v a r i e t y Often watch sports Not often watch sports

Basketball(M) Soccer(M) Basketball(F) Snooker Formula One Volleyball(F)Ping-pong(M) . . . . . . .

31 6 13 . . . . . | B a s e li n e | Figure 5:

Often watching sports v.s. not:

The questions ask for diﬀerent sports. Thus, wecan use the sports category (e.g. Basketball(M) means men’s basketball and Basketball(F) meanswomen’s basketball) to represent the questions. The red bar represents the metrics of the groupthat contains people who often watch sports, while the green bar represents the metrics of the groupthat contains people who do not often watch sports. The y-axis is 100 times of metrics.

We conducted a study about the preference for stand-up comedians. We asked 8 questions,4 of which compares native stand-up comedians (type native), other 4 of which comparesforeign stand-up comedians (type foreign). There were 632 respondents, of which 262 weremen and 370 were women. Additionally, We asked respondents the frequency they watchnative/foreign stand-up comedy. The result was shown in the chart below.often sometimes occasionally almost nevernative stand-up comedy 240 252 123 17foreign stand-up comedy 21 169 248 194Table 1:

Frequency of watching stand-up comedy

Here we will give a sample question. All 8 questions have the same format. We attachthe contents of these questions in the appendix.• Which stand-up comedian do you prefer? Ronny Chieng or Jimmy OYang?• What percentage of people do you think prefer Ronny Chieng?

Cross respondent

For each type, we divide the respondents into two groups. The familiargroup for type native is deﬁned as the group of respondents who report that they often or12ometimes watch native stand-up comedy. Other respondents are deﬁned as the unfamiliargroup. We deﬁne familiar and unfamiliar groups for foreign type analogously. Figure 6shows for native type, Tvd-variety correctly suggests that familiar group has a higher scorefor all 4 questions while baseline fails in three questions. Figure 7 shows for foreign type,both Tvd-variety and baseline correctly suggest that familiar group has a higher score forthree questions and fail in one question.

Native1 Native2 Native3 Native4 . . . . . . T v d - v a r i e t y Familiar with native stand-up comedy Unfamiliar with native stand-up comedy

Native1 Native2 Native3 Native4 . . .

915 17 . . . | B a s e li n e | Figure 6:

Familiar with native stand-up comedy v.s. unfamiliar:

There are four questionsthat compare two native stand-up comedians. The red bar represents the metrics of the group thatcontains people who are familiar with native stand-up comedy, while the green bar represents themetrics of the group who are unfamiliar with native stand-up comedy. The y-axis is 100 times ofmetrics.

Cross questions

We pick a group of respondents who often or sometimes watch nativestand-up comedy but occasionally or almost never watch foreign stand-up comedy. The sizeof this group is 321. For this group, the comparisons for foreign comedians are much morediﬃcult. We compute the Tvd-variety and baseline score of this group of respondents. Figure8 shows that Tvd-variety successfully separates easy questions (comparisons between nativecomedians) and hard questions (comparisons between foreign comedians) while baselineassigns an easy question (native3) a lower score than a hard question (foreign3).Our results validate the advantage of our metric, compared to the baseline metric. Tocheck the robustness of the results, we also reduce the eﬀect of the group size by sampling(without replacement) the same number of respondents from each group for each comparison.The results are consistent. We show the results in the appendix. We also perform additionalcomparisons between male respondents and female respondents for all case studies. We deferthe results to the appendix. 13 oreign1 Foreign2 Foreign3 Foreign4 . . . . . . . T v d - v a r i e t y Familiar with foreign stand-up comedy Unfamiliar with foreign stand-up comedy

Foreign1 Foreign2 Foreign3 Foreign4 . . . . . . . | B a s e li n e | Figure 7:

Familiar with foreign stand-up comedy v.s. unfamiliar:

There are four questionsthat compare two foreign stand-up comedians. The red bar represents the metrics of the group thatcontains people who are familiar with foreign stand-up comedy, while the green bar represents themetrics of the group who are unfamiliar with foreign stand-up comedy. The y-axis is 100 times ofmetrics.

Our work focuses on measuring the informativeness of a group of people in subjectivequestions. By additionally asking for respondents’ predictions about other people’s choices,we provide a reﬁned deﬁnition of uninformative feedback. For the new deﬁnition, we proposea new family of informativeness metric, f -variety, for a group of people’s feedback. f -varietyseparates informative and uninformative feedback and decreases as the ratio of uninformativefeedback increases. We validate our metric both theoretically and empirically.Our method provides only group-level measurements. A future direction is to separateexperts and non-experts in the mixed group with additional assumptions. Another futuredirection to theoretically explore the eﬀect of diﬀerent convex functions used for f -varietyand further deﬁne an optimization goal and optimize over the convex functions.Our experimental setting focuses on binary choices, while our theory is applicable in thenon-binary case. However, in practice, the respondents require additional eﬀort to provide aprediction over non-binary choices. In the future, we can design a more practical approachin the non-binary setting. For example, one potential solution is to ask respondents for aprediction for a single choice randomly and combine them afterward. Moreover, we use ﬂatpayment in our experiments. In the future, we can consider incentives and compare thesubjective feedback collected by diﬀerent payment schemes.14 ative1 Foreign1Native2 Foreign2Native3 Foreign3Native4 Foreign4 . .

931 27 . . . . . .

85 30 . T v d - v a r i e t y Native1 Foreign1Native2 Foreign2Native3 Foreign3Native4 Foreign4 . .

916 4 . . .

616 4 .

212 12 | B a s e li n e | Figure 8:

Easy and hard questions:

We divide the questions into two types. For type native (theleft), these questions only compare native comedians and for type foreign (the right), these questionsonly compare foreign comedians. We pick a group of respondents who are familiar with nativestand-up comedy but not familiar with foreign stand-up comedy such that questions of type nativeare easier for them, compared to questions of type foreign. We compute both the Tvd-variety andbaseline score of their feedback for all questions. The green lines show that Tvd-variety successfullyseparates easy and hard questions (easy questions are above the line and hard questions are belowthe line) while baseline does not.

References [1] Imre Csiszár, Paul C Shields, et al. Information theory and statistics: A tutorial.

Foundations and Trends® in Communications and Information Theory , 1(4):417–528,2004.[2] Mirta Galesic, W Bruine de Bruin, Marion Dumas, A Kapteyn, JE Darling, andE Meijer. Asking about social circles improves election predictions.

Nature HumanBehaviour , 2(3):187–193, 2018.[3] Erik G Helzer and David Dunning. Why and when peer prediction is superior toself-prediction: The weight given to future aspiration versus past achievement.

Journalof personality and social psychology , 103(1):38, 2012.[4] Norman L Johnson, Samuel Kotz, and Narayanaswamy Balakrishnan.

Continuousunivariate distributions, volume 2 , volume 289. John wiley & sons, 1995.[5] Yuqing Kong and Grant Schoenebeck. Equilibrium selection in information elicitationwithout veriﬁcation via information monotonicity. In

Kong and Schoenebeck [5], pages 13:1–13:20. doi: 10.4230/LIPIcs.ITCS.2018.13. URL https://doi.org/10.4230/LIPIcs.ITCS.2018.13 .156] Yuqing Kong and Grant Schoenebeck. Water from two rocks: Maximizing the mutual in-formation. In

Proceedings of the 2018 ACM Conference on Economics and Computation ,pages 177–194, 2018.[7] Yuqing Kong and Grant Schoenebeck. An information theoretic framework for designinginformation elicitation mechanisms that reward truth-telling.

ACM Transactions onEconomics and Computation (TEAC) , 7(1):1–33, 2019.[8] Dražen Prelec. A bayesian truth serum for subjective data. science , 306(5695):462–466,2004.[9] Dražen Prelec, H Sebastian Seung, and John McCoy. A solution to the single-questioncrowd wisdom problem.

Nature , 541(7638):532–535, 2017.[10] Goran Radanovic and Boi Faltings. Incentives for truthful information elicitation ofcontinuous signals. In

Proceedings of the AAAI Conference on Artiﬁcial Intelligence ,volume 28, 2014.[11] Sonja Radas and Drazen Prelec. Whose data can we trust: How meta-predictions canbe used to uncover credible respondents in survey data.

PloS one , 14(12):e0225432,2019.[12] David Rothschild and Justin Wolfers. Forecasting elections: Voter intentions versusexpectations.

Available at SSRN 1884644 , 2011.[13] Ray Weaver and Drazen Prelec. Creating truth-telling incentives with the bayesiantruth serum.

Journal of Marketing Research , 50(3):289–302, 2013.[14] Jens Witkowski and David Parkes. A robust bayesian truth serum for small populations.In

Proceedings of the AAAI Conference on Artiﬁcial Intelligence , volume 26, 2012.16

More comparisons

In this section, we show the comparison between the male group and female group in ourcase studies. For athletes comparisons, among all respondents, 306 are male and 350 arefemale. Figure 9 shows the comparison results. For stand-up comedians comparisons, amongall respondents, 262 are male and 370 are female. Figure 10 shows the comparison results.

Basketball(M) Soccer(M) Basketball(F) Snooker Formula One Volleyball(F)Ping-pong(M) . . . . . . . . . . . . . T v d - v a r i e t y Male Female

Basketball(M) Soccer(M) Basketball(F) Snooker Formula One Volleyball(F)Ping-pong(M) . . . . . . . . . . . | B a s e li n e | Figure 9:

Athletes study: male v.s. female:

We also compare the male group and femalegroup of respondents in the study for comparisons between athletes.

Native1 Foreign1Native2 Foreign2Native3 Foreign3Native4 Foreign4 . . . . . . . . . . . . . . . . T v d - v a r i e t y Male Female

Native1 Foreign1Native2 Foreign2Native3 Foreign3Native4 Foreign4 . . . . . . . .

413 1 . . . . . . . | B a s e li n e | Figure 10:

Comedians study: male v.s. female:

We also compare the male group and femalegroup of respondents in the study for comparisons between stand-up comedians. Pearson-variety & Hellinger-variety for numerical ex-periments

In this section, we will additionally evaluate two special f -variety, Pearson-variety andHellinger-variety by numerical experiments. Pearson-variety uses Pearson-divergence. Pearson-divergence D pearson ( p , q ) = ∑ σ ∈ Σ ( p ( σ )− q ( σ )) q ( σ ) . Hellinger-variety uses squared Hellinger dis-tance. Squared Hellinger distance D hellinger ( p , q ) = ∑ σ ∈ Σ (√ p ( σ ) − √ q ( σ )) . . . . . . . . . . . . . . . . . . ratio of uninformative participants P e a r s o n - v a r i e t y theoretical n = n = n = n = (a) Uniform-1 . . . . . . . . . . . . . . . . . ratio of uninformative participants P e a r s o n - v a r i e t y theoretical n = n = n = n = (b) Non-Uniform-1 . . . . . . . . . . . . . . . . . ratio of uninformative participants P e a r s o n - v a r i e t y theoretical n = n = n = n = (c) Uniform-2 . . . . . . . . . . . . . . . . . ratio of uninformative participants P e a r s o n - v a r i e t y theoretical n = n = n = n = (d) Non-Uniform-2

Figure 11:

Pearson-variety v.s. ratio of non-experts:

We adopt the same setting as Figure 4and observe the similar results for Pearson-variety. . . . . . . . . . . . . . . . . ratio of uninformative participants H e lli n g e r - v a r i e t y theoretical n = n = n = n = (a) Uniform-1 . . . . . . . . . . . . . . . . ratio of uninformative participants H e lli n g e r - v a r i e t y theoretical n = n = n = n = (b) Non-Uniform-1 . . . . . . . . . . . . . . . . ratio of uninformative participants H e lli n g e r - v a r i e t y theoretical n = n = n = n = (c) Uniform-2 . . . . . . . . . . . . . . . . ratio of uninformative participants H e lli n g e r - v a r i e t y theoretical n = n = n = n = (d) Non-Uniform-2

Figure 12:

Hellinger-variety v.s. ratio of non-experts:

We adopt the same setting as Figure 4and observe the similar results for Hellinger-variety. Robustness-check for group size

In this section, for each comparison between two groups of respondents, to reduce the eﬀectof group size, we sample the same amount of respondents without replacement from thelarger size group and then compare the two equal-size groups by Tvd-variety and baseline.For example, if group A has 300 respondents and group B has 200 respondents, then we willsample 200 respondents from group A without replacement and compare group B with thesubset of group A. We can compute the error bar, i.e., the standard deviation by repeatingthe sampling process. The following ﬁgures show the results. Due to our sampling process,only one-side has an error bar. The results still show that compared to baseline, Tvd-varietyis more consistent with the reference.

Basketball(M) Soccer(M) Basketball(F) Snooker Formula One Volleyball(F)Ping-pong(M) . . . . . . . . . . . . T v d - v a r i e t y Often watch sports Not often watch sports

Basketball(M) Soccer(M) Basketball(F) Snooker Formula One Volleyball(F)Ping-pong(M) . . . . . . . . . . . . . . | B a s e li n e | Figure 13:

Often watching sports v.s. not ative1 Native2 Native3 Native4 . . . . . . T v d - v a r i e t y Familiar with native stand-up comedy Unfamiliar with native stand-up comedy

Native1 Native2 Native3 Native4 . . . .

915 17 . . . | B a s e li n e | Figure 14:

Familiar with native stand-up comedy v.s. unfamiliar

Foreign1 Foreign2 Foreign3 Foreign4 . . . . . . . T v d - v a r i e t y Familiar with foreign stand-up comedy Unfamiliar with foreign stand-up comedy

Foreign1 Foreign2 Foreign3 Foreign4 . . . . . . . . | B a s e li n e | Figure 15:

Familiar with foreign stand-up comedy v.s. unfamiliar Contents of surveys

In this section, we list the questions we used in case studies. For each respondent, the ordersof options are randomly shuﬄed.

D.1 Survey for athletes

1. What is your gender?(a) Female (b) Male2. Do you often watch sports?(a) I often watch sports (b) I do not often watch sports3. Which of the following two basketball players do you prefer?(a) Zhenlin Zhang (b) Songwei Zhu4. What percentage of people do you think prefer Zhenlin Zhang?(a) 0 (b) 10 (c) 20 (d) 30 (e) 40 (f) 50 (g) 60 (h) 70 (i) 80 (j) 90 (k) 1005. Which of the following two soccer players do you prefer?(a) Andrés Iniesta (b) Luka Modrić6. What percentage of people do you think prefer Andrés Iniesta?(a) 0 (b) 10 (c) 20 (d) 30 (e) 40 (f) 50 (g) 60 (h) 70 (i) 80 (j) 90 (k) 1007. There are 8 red balls and 12 blue balls with the same shape in the box. One israndomly selected. What percentage do you think is the probability of a blue ball?(a) 0 (b) 10 (c) 20 (d) 30 (e) 40 (f) 50 (g) 60 (h) 70 (i) 80 (j) 90 (k) 1008. Which of the following two basketball players do you prefer?(a) Nan Chen (b) Lijie Miao9. What percentage of people do you think prefer Nan Chen?(a) 0 (b) 10 (c) 20 (d) 30 (e) 40 (f) 50 (g) 60 (h) 70 (i) 80 (j) 90 (k) 10010. Which of the following two snooker players do you prefer?(a) Judd Trump (b) John Higgins11. What percentage of people do you think prefer Judd Trump?(a) 0 (b) 10 (c) 20 (d) 30 (e) 40 (f) 50 (g) 60 (h) 70 (i) 80 (j) 90 (k) 10012. Which of the following two Formula One players do you prefer?(a) Sebastian Vettel (b) Lewis Hamilton13. What percentage of people do you think prefer Sebastian Vettel?(a) 0 (b) 10 (c) 20 (d) 30 (e) 40 (f) 50 (g) 60 (h) 70 (i) 80 (j) 90 (k) 10014. Which of the following two volleyball players do you prefer?(a) Ruirui Zhao (b) Yimei Wang225. What percentage of people do you think prefer Ruirui Zhao?(a) 0 (b) 10 (c) 20 (d) 30 (e) 40 (f) 50 (g) 60 (h) 70 (i) 80 (j) 90 (k) 10016. Which of the following two ping-pong players do you prefer?(a) Jingkun Liang (b) Chuqin Wang17. What percentage of people do you think prefer Jingkun Liang?(a) 0 (b) 10 (c) 20 (d) 30 (e) 40 (f) 50 (g) 60 (h) 70 (i) 80 (j) 90 (k) 100