Optimal Number of Choices in Rating Contexts
OOptimal Number of Choices in Rating Contexts
Sam GanzfriedGanzfried [email protected] Farzana YusufFlorida International Universityfyusu003@fiu.edu
Abstract
In many settings people must give numerical scores to entities from a small discrete set. For instance,rating physical attractiveness from 1–5 on dating sites, or papers from 1–10 for conference reviewing.We study the problem of understanding when using a different number of options is optimal. We considerthe case when scores are uniform random and Gaussian. We study computationally when using 2, 3, 4,5, and 10 options out of a total of 100 is optimal in these models (though our theoretical analysis is fora more general setting with k choices from n total options as well as a continuous underlying space).One may expect that using more options would always improve performance in this model, but we showthat this is not necessarily the case, and that using fewer choices—even just two—can surprisingly beoptimal in certain situations. While in theory for this setting it would be optimal to use all 100 options,in practice this is prohibitive, and it is preferable to utilize a smaller number of options due to humans’limited computational resources. Our results could have many potential applications, as settings requiringentities to be ranked by humans are ubiquitous. There could also be applications to other fields such assignal or image processing where input values from a large set must be mapped to output values in asmaller set. Humans rate items or entities in many important settings. For example, users of dating websites and mobileapplications rate other users’ physical attractiveness, teachers rate scholarly work of students, and reviewersrate the quality of academic conference submissions. In these settings, the users assign a numerical (integral)score to each item from a small discrete set. However, the number of options in this set can vary significantlybetween applications, and even within different instantiations of the same application. For instance, forrating attractiveness, three popular sites all use a different number of options. On “Hot or Not,” users ratethe attractiveness of photographs submitted voluntarily by other users on a scale of 1–10 (Figure 1 ). Thesescores are aggregated and the average is assigned as the overall “score” for a photograph. On the datingwebsite OkCupid, users rate other users on a scale of 1–5 (if a user rates another user 4 or 5 then the rateduser receives a notification) (Figure 2 ). And on the mobile application Tinder users “swipe right” (greenheart) or “swipe left” (red X) to express interest in other users (two users are allowed to message each otherif they mutually swipe right), which is essentially equivalent to using a binary { , } scale (Figure 3 ).Education is another important application area requiring human ratings. For the 2016 International JointConference on Artificial Intelligence, reviewers assigned a “Summary Rating” score from -5–5 (equivalentto 1–10) for each submitted paper (Figure 4) . The papers are then discussed and scores aggregated toproduce an acceptance or rejection decision based on the average of the scores. http://blog.mrmeyer.com/2007/are-you-hot-or-not/ The likelihood of receiving an initial message is actually much more highly correlated with the variance—and particularly thenumber of “5” ratings—than with the average rating [10]. http://blog.okcupid.com/index.php/the-mathematics-of-beauty/ https://tctechcrunch2011.files.wordpress.com/2015/11/tinder-two.jpg https://easychair.org/conferences/?conf=ijcai16 a r X i v : . [ c s . A I] N ov igure 1: Hot or Not users rate attractiveness 1–10. Figure 2: OkCupid users rate attractiveness 1–5.Figure 3: Tinder users rate attractiveness 1–2. Figure 4: IJCAI reviewers rate papers -5–5.Despite the importance and ubiquity of the problem, there has been little fundamental research doneon the problem of determining the optimal number of options to allow in such settings. We study a modelin which users have an underlying integral ground truth score for each item in { , . . . , n } and are requiredto submit an integral rating in { , . . . , k } , for k << n . (For ease of presentation we use the equivalentformulation { , . . . , n − } , { , . . . , k − } .) We use two generative models for the ground truth scores: auniform random model in which the fraction of scores for each value from 0 to n − is chosen uniformly atrandom (by choosing a random value for each and then normalizing), and a model where scores are chosenaccording to a Gaussian distribution with a given mean and variance. We then compute a “compressed”score distribution by mapping each full score s from { , . . . , n − } to { , . . . , k − } by applying s ← (cid:36) s (cid:0) nk (cid:1) (cid:37) . (1)We then compute the average “compressed” score a k , and compute its error e k according to e k = (cid:12)(cid:12)(cid:12)(cid:12) a f − n − k − · a k (cid:12)(cid:12)(cid:12)(cid:12) , (2)2here a f is the ground truth average. The goal is to pick argmin k e k (in our simulations we also considera metric of the frequency at which each value of k produces lowest error over all the items that are rated).While there are many possible generative models and cost functions, these seem to be the most natural, andwe plan to study alternative choices in future work.We derive a closed-form expression for e k that depends on only a small number ( k ) of parameters of theunderlying distribution for an arbitrary distribution. This allows us to exactly characterize the performanceof using each number of choices. In simulations we repeatedly compute e k and compare the average values.We focus on n = 100 and k = 2 , , , , , which we believe are the most natural and interesting choicesfor initial study.One could argue that this model is somewhat “trivial” in the sense that it would be optimal to set k = n to permit all the possible scores, as this would result in the “compressed” scores agreeing exactly with thefull scores. However, there are several reasons that would lead us to prefer to select k << n in practice (asall of the examples previously described have done), thus making this analysis worthwhile. It is much easierfor a human to assign a score from a small set than from a large set, particularly when rating many itemsunder time constraints. We could have included an additional term into the cost function e k that explicitlypenalizes larger values of k , which would have a significant effect on the optimal value of k (providinga favoritism for smaller values). However the selection of this function would be somewhat arbitrary andwould make the model more complex, and we leave this for future study. Given that we do not include such apenalty term, one may expect that increasing k will always decrease e k in our setting. While the simulationsshow a clear negative relationship, we show that smaller values of k actually lead to smaller e k surprisinglyoften. These smaller values would receive further preference with a penalty term.One line of related theoretical research that also has applications to the education domain studies theimpact of using finely grained numerical grades (100, 99, 98) vs. coarse letter grades (A, B, C) [7]. Theyconclude that if students care primarily about their rank relative to the other students, they are often bestmotivated to work by assigning them coarse categories than exact numerical scores. In a setting of “dis-parate” student abilities they show that the optimal absolute grading scheme is always coarse. Their modelis game-theoretic; each player (student) selects an effort level, seeking to optimize a utility function thatdepends on both the relative score and effort level. Their setting is quite different from ours in many ways.For one, they study a setting where it is assumed that the underlying “ground truth” score is known, yet maybe disguised for strategic reasons. In our setting the goal is to approximate the ground truth score as closelyas possible.While we are not aware of prior theoretical study of our exact problem, there have been experimentalstudies on the optimal number of options on a “Likert scale” [17, 19, 26, 6, 9]. The general conclusion isthat “the optimal number of scale categories is content specific and a function of the conditions of mea-surement.” [11] There has been study of whether including a “mid-point” option (i.e., the middle choicefrom an odd number) is beneficial. One experiment demonstrated that the use of the mid-point categorydecreases as the number of choices increases: 20% of respondents choose the mid-point for 3 and 5 op-tions while only 7% did for , , . . . , [20]. They conclude that it is preferable to either not include amid-point at all or use a large number of options. Subsequent experiments demonstrated that eliminatinga mid-point can reduce social desirability bias which results from respondents’ desires to please the inter-viewer or not give a perceived socially unacceptable answer [11]. There has also been significant research onquestionnaire design and the concept of “feeling thermometers,” particularly from the fields of psychologyand sociology [25, 21, 16, 4, 18, 23]. One study concludes from experimental data: “in the measurement of For theoretical simplicity we theoretically study a continuous version where scores are chosen according to a distribution over (0 , n ) (though the simulations are for the discrete version) and the compressed scores are over { , . . . , k − } . In this setting weuse a normalization factor of nk instead of n − k for the e k term. Continuous approximations for large discrete spaces have beenstudied in other settings; for instance, they have led to simplified analysis and insight in poker games with continuous distributionsof private information [2]. { , ..., } are mapped into a binary “thumbs up“ /“thumbs down” (analogously to the swipe right/left example for Tinder above) [5]. Generally users mappedoriginal ratings of 1 and 2 to “thumbs down” and original ratings of 3, 4, and 5 to “thumbs up,” which can beviewed as being similar to the floor compression procedure described above. We consider a more generalizedsetting where ratings over { , ..., n } are mapped down to a smaller space (which could be binary but mayhave more options). In addition, we also consider a rounding compression technique in addition to theflooring compression.Some prior work has presented an approach for mapping continuous prediction scores to ordinal prefer-ences with heterogeneous thresholds that is also applicable to mapping continuous-valued ‘true preference’scores [15]. We note that our setting can apply straightforwardly to provide continuous-to-ordinal mappingin the same way as it performs ordinal-to-ordinal mapping initially. (In fact for our theoretical analysis andfor the Jester dataset we study our mapping is continuous-to-ordinal.) An alternative model assumes thatusers compare items with pairwise comparisons which form a weak ordering, meaning that some items aregiven the same “mental rating,” while for our setting the ratings would be much more likely to be unique inthe fine-grained space of ground-truth scores [3, 13]. In comparison to prior work, the main takeaway fromour work is the closed-form expression for simple natural models, and the new simulation results whichshow precisely for the first time how often each number of choices is optimal using several metrics (num-ber of times it produces lowest error and the lowest average error). We include experiments on datasetsfrom several domains for completeness, though as prior work has shown results can vary significantly be-tween datasets, and further research from psychology and social science is needed to make more accuratepredictions of how humans actually behave in practice. We note that our results could also have impact out-side of human user systems, for example to the problems of “quantization” and data compression in signalprocessing. Suppose scores are given by continuous pdf f (with cdf F ) on (0 , , and we wish to compress them totwo options, { , } . Scores below 50 are mapped to 0, and above 50 to 1. The average of the full distributionis a f = E [ X ] = (cid:82) x =0 xf ( x ) dx. The average of the compressed version is a = (cid:90) x =0 f ( x ) dx + (cid:90) x =50 f ( x ) dx = 1 − F (50) . e = | a f − − F (50)) | = | E [ X ] −
100 + 100 F (50) | . For three options, a = (cid:90) x =0 f ( x ) dx + (cid:90) x = f ( x ) dx + (cid:90) x = f ( x ) dx = 2 − F (100 / − F (200 / e = | a f − − F (100 / − F (200 / | = | E [ X ] −
100 + 50 F (100 /
3) + 50 F (200 / | In general for n total and k compressed options, a k = k − (cid:88) i =0 (cid:90) n ( i +1) k x = nik if ( x ) dx = k − (cid:88) i =0 (cid:20) i (cid:18) F (cid:18) n ( i + 1) k (cid:19) − F (cid:18) nik (cid:19)(cid:19)(cid:21) = ( k − F ( n ) − k − (cid:88) i =1 F (cid:18) nik (cid:19) = ( k − − k − (cid:88) i =1 F (cid:18) nik (cid:19) e k = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) a f − nk − (cid:32) ( k − − k − (cid:88) i =1 F (cid:18) nik (cid:19)(cid:33)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E [ X ] − n + nk − k − (cid:88) i =1 F (cid:18) nik (cid:19)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (3)Equation 3 allows us to characterize the relative performance of choices of k for a given distribution f .For each k it requires only knowing k statistics of f (the k − values of F (cid:0) nik (cid:1) plus E [ X ] ). In practice thesecould likely be closely approximated from historical data for small k values (though prior work has pointedout that there may be some challenges in order to closely approximate the cdf values of the ratings fromhistorical data, due to the historical data not being sampled at random from the true rating distribution [22]).As an example we see that e < e iff | E [ X ] −
100 + 100 F (50) | < (cid:12)(cid:12)(cid:12)(cid:12) E [ X ] −
100 + 50 F (cid:18) (cid:19) + 50 F (cid:18) (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) Consider a full distribution that has half its mass right around 30 and half its mass right around 60(Figure 5). Then a f = E [ X ] = 0 . ·
30 + 0 . ·
60 = 45 . If we use k = 2 , then the mass at 30 will bemapped down to 0 (since < ) and the mass at 60 will be mapped up to 1 (since > (Figure 6).So a = 0 . · . · . . Using normalization of nk = 100 , e = | − . | = | − | = 5 . If we use k = 3 , then the mass at 30 will also be mapped down to 0 (since < ); but the massat 60 will be mapped to 1 (not the maximum possible value of 2 in this case), since < < (Figure 6). So again a = 0 . · . · . , but now using normalization of nk = 50 we have e = | − . | = | − | = 20 . So, surprisingly, in this example allowing more ranking choicesactually significantly increases error. 5igure 5: Example distribution for which compressing with k = 2 produces lower error than k = 3 .Figure 6: Compressed distributions using k = 2 and k = 3 for example from Figure 5.If we happened to be in the case where both a ≤ a f and a ≤ a f , then we could remove the absolutevalues and reduce the expression to see that e < e iff (cid:82) x = f ( x ) dx < (cid:82) x =50 f ( x ) dx. One could performmore comprehensive analysis considering all cases to obtain better characterization and intuition for theoptimal value of k for distributions with different properties. An alternative model we could have considered is to use rounding to produce the compressed scores asopposed to using the floor function from Equation 1. For instance, for the case n = 100 , k = 2 , instead ofdividing s by 50 and taking the floor, we could instead partition the points according to whether they areclosest to t = 25 or t = 75 . In the example above, the mass at 30 would be mapped to t and the mass at60 would be mapped to t . This would produce a compressed average score of a = ·
25 + ·
75 = 50 . No normalization would be necessary, and this would produce error of e = | a f − a | = | − | = 5 , as the floor approach did as well. Similarly, for k = 3 the region midpoints will be q = , q = 50 , q = . The mass at 30 will be mapped to q = and the mass at 60 will be mapped to q = 50 . This produces a compressed average score of a = · + ·
50 = . This produces an error of e = | a f − a | = (cid:12)(cid:12) − (cid:12)(cid:12) = = 11 . . Although the error for k = 3 is smaller than for the floorcase, it is still significantly larger than k = 2 ’s, and using two options still outperforms using three for theexample in this new model.In general, this approach would create k “midpoints” { m ki } : m ki = n (2 i − k . For k = 2 we have6 = (cid:90) x =0
25 + (cid:90) x =50
75 = 75 − F (50) e = | a f − (75 − F (50)) | = | E [ X ] −
75 + 50 F (50) | One might wonder whether the floor approach would ever outperform the rounding approach (in theexample above the rounding approach produced lower error k = 3 and the same error for k = 2 ). As asimple example to see that it can, consider the distribution with all mass on 0. The floor approach wouldproduce a = 0 giving an error of 0, while the rounding approach would produce a = 25 giving an errorof 25. Thus, the superiority of the approach is dependent on the distribution. We explore this further in theexperiments.For three options, a = (cid:90) f ( x ) + (cid:90) f ( x ) + (cid:90) f ( x )= 5006 − F (cid:18) (cid:19) − F (cid:18) (cid:19) e = (cid:12)(cid:12)(cid:12)(cid:12) E [ X ] − F (cid:18) (cid:19) + 1003 F (cid:18) (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) For general n and k , analysis as above yields a k = k − (cid:88) i =0 (cid:90) n ( i +1) k x = nik m ki +1 f ( x ) dx = n (2 k − k − nk k − (cid:88) i =1 F (cid:18) nik (cid:19) e k = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) a f − (cid:34) n (2 k − k − nk k − (cid:88) i =1 F (cid:18) nik (cid:19)(cid:35)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (4) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E [ X ] − n (2 k − k + nk k − (cid:88) i =1 F (cid:18) nik (cid:19)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (5)Like for the floor model e k requires only knowing k statistics of f . The rounding model has an advantageover the floor model that there is no need to convert scores between different scales and perform normaliza-tion. One drawback is that it requires knowing n (the expression for m ki is dependent on n ), while the floormodel does not. In our experiments we assume n = 100 , but in practice it may not be clear what the agents’ground truth granularity is and may be easier to just deal with scores from 1 to k . Furthermore, it may seemunnatural to essentially ask people to rate items as “ , , ” rather than “ , , ” (though the conversionbetween the score and m ki could be done behind the scenes essentially circumventing the potential practicalcomplication). One can generalize both the floor and rounding model by using a score of s ( n, k ) i for the i ’th region. For the floor setting we set s ( n, k ) i = i , and for the rounding setting s ( n, k ) i = m ki = n (2 i +1)2 k . The above analysis leads to the immediate question of whether the example for which e < e was a flukeor whether using fewer choices can actually reduce error under reasonable assumptions on the generativemodel. We study this question using simulations with what we believe are the two most natural models.While we have studied the continuous setting where the full set of options is continuous over (0 , n ) and the7ompressed set is discrete { , . . . , k − } , we now consider the perhaps more realistic setting where the fullset is the discrete set { , . . . , n − } and the compressed set is the same (though it should be noted that thetwo settings are likely quite similar qualitatively).The first generative model we consider is a uniform model in which the values of the pmf for each ofthe n possible values are chosen independently and uniformly at random. The second is a Gaussian modelin which the values are generated according to a normal distribution with specified mean µ and standarddeviation σ (values below 0 are set to 0 and above n − to n − ). Algorithm 1
Procedure for generating full scores in uniform model
Inputs : Number of scores n scoreSum ← for i = 0 : n do r ← random(0,1)scores[ i ] ← r scoreSum = scoreSum + r for i = 0 : n do scores[ i ] = scores[ i ] / scoreSum Algorithm 2
Procedure for generating scores in Gaussian model
Inputs : Number of scores n , number of samples s , mean µ , standard deviation σ for i = 0 : s do r ← randomGaussian( µ, σ ) if r < then r = 0 else if r > n − then r ← n − ++scores[round( r )] for i = 0 : n do scores[ i ] = scores[ i ] / s For our simulations we used n = 100 , and considered k = 2 , , , , , which are popular and naturalvalues. For the Gaussian model we used s = 1000 , µ = 50 , σ = . For each set of simulations wecomputed the errors for all considered values of k for m = 100 , “items” (each corresponding to adifferent distribution generated according to the specified model). The main quantities we are interested incomputing are the number of times that each value of k produces the lowest error over the m items, and theaverage value of the errors over all items for each k value.The simulation procedure is specified in Algorithm 3. Note that this procedure could take as input anygenerative model M (not just the two we considered), as well as the parameters for the model, which wedesignate as ρ . It takes a set { k , . . . , k C } of C different compressed scores, and returns the number oftimes that each one produces the lowest error over the m items. Note that we can easily also compute otherquantities of interest with this procedure, such as the average value of the errors which we also report insome of the experiments (though we note that certain quantities could be overly dependent on the specificparameter values chosen).In the first set of experiments, we compared performance between using k = 2, 3, 4, 5, 10 to see forhow many of the trials each value of k produced the minimal error (Table 1). Not surprisingly, we seethat the number of victories increases monotonically with the value of k , while the average error decreasedmonotonically (recall that we would have zero error if we set k = 100 ). However, what is perhaps surprising8 lgorithm 3 Simulation procedure
Inputs : generative model M , parameters ρ , number of items m , number of total scores n , set of compressedscores { k , . . . , k C } scores[][] ← array of dimension m × n averages[] ← array of dimension m for i = 0 : m do scores[i] ← M ( n, ρ ) averages[i] ← average(scores[i]) for i = 0 : C do scoresCompressed[][] ← array of dimension m × C averages[] ← array of dimension m for j = 0 : m do scoresCompressed[j][i] ← Compress(scores[j])averagesCompressed[j] ← average(scoresCompressed[j]) for j = 0 : m do min ← MAX-VALUEminIndex ← -1numVictories[] ← array of dimension C for i = 0 : C do e i ← | averages[ j ] - n − k − averagesCompressed[ j ] | if e i < min then min ← e i minIndex ← i ++numVictories[minIndex] return numVictoriesis that using a smaller number of compressed scores produced the optimal error in a far from negligiblenumber of the trials. For the uniform model, using 10 scores minimized error only around 53% of the time,while using 5 scores minimized error 17% of the time, and even using 2 scores minimized it 5.6% of thetime. The results were similar for the Gaussian model, though a bit more in favor of larger values of k ,which is what we would expect because the Gaussian model is less likely to generate “fluke” distributionsthat could favor the smaller values. 2 3 4 5 10Uniform k in { } produces minimal error and average error values,over 100,000 items generated according to both models.We next explored the number of victories between just k = 2 and k = 3 , with results in Table 2. Againwe observed that using a larger value of k generally reduces error, as expected. However, we find it extremelysurprising that using k = 2 produces a lower error 37% of the time. As before, the larger k value performsrelatively better in the Gaussian model. We also looked at results for the most extreme comparison, k = 2 vs. k = 10 (Table 3). Using 2 scores outperformed 10 8.3% of the time in the uniform setting, which was9arger than we expected. In Figures 7–8, we present a distribution for which k = 2 particularly outperformed k = 10 . The full distribution has mean 54.188, while the k = 2 compression has mean 0.548 (54.253 afternormalization) and k = 10 has mean 5.009 (55.009 after normalization). The normalized errors between themeans were 0.906 for k = 10 and 0.048 for k = 2 , yielding a difference of 0.859 in favor of k = 2 .2 3Uniform number of victories 36805 63195Uniform average error 1.31 0.86Gaussian number of victories 30454 69546Gaussian average error 1.13 0.58Table 2: Results for k = 2 vs. 3.2 10Uniform number of victories 8253 91747Uniform average error 1.32 0.19Gaussian number of victories 4369 95631Gaussian average error 1.13 0.10Table 3: Results for k = 2 vs. 10.Figure 7: Example distribution where compressing with k = 2 produces significantly lower error than k = 10 . The full distribution has mean 54.188, while the k = 2 compression has mean 0.548 (54.253 afternormalization) and the k = 10 compression has mean 5.009 (55.009 after normalization). The normalizederrors between the means were 0.906 for k = 10 and 0.048 for k = 2 , yielding a difference of 0.859 in favorof k = 2 .We next repeated the extreme k = 2 vs. 10 comparison, but we imposed a restriction that the k = 10 option could not give a score below 3 or above 6 (Table 4). (If it selected a score below 3 then we set it to3, and if above 6 we set it to 6). For some settings, for instance paper reviewing, extreme scores are veryuncommon, and we strongly suspect that the vast majority of scores are in this middle range. Some possibleexplanations are that reviewers who give extreme scores may be required to put in additional work to justifytheir scores and are more likely to be involved in arguments with other reviewers (or with the authors in therebuttal). Reviewers could also experience higher regret or embarrassment for being “wrong” and possiblyoff-base in the review by missing an important nuance. In this setting using k = 2 outperforms k = 10 nearly of the time in the uniform model. 10igure 8: Compressed distribution for k = 2 vs. 10 for example from Figure 7.We also considered the situation where we restricted the k = 10 scores to fall between 3 and 7 (asopposed to 3 and 6). Note that the possible scores range from 0–9, so this restriction is asymmetric in thatthe lowest three possible scores are eliminated while only the highest two are. This is motivated by theintuition that raters may be less inclined to give extremely low scores which may hurt the feelings of anauthor (for the case of paper reviewing). In this setting, which is seemingly quite similar to the 3–6 setting, k = 2 produced lower error 93% of the time in the uniform model!2 10Uniform number of victories 32250 67750Uniform average error 1.31 0.74Gaussian number of victories 10859 89141Gaussian average error 1.13 0.20Table 4: Number of times each value of k in { } produces minimal error and average error values, over100,000 items generated according to both models. For k = 10 , we only permitted scores between 3 and 6(inclusive). If a score was below 3 we set it to be 3, and above 6 to 6.2 10Uniform number of victories 93226 6774Uniform average error 1.31 0.74Gaussian number of victories 54459 45541Gaussian average error 1.13 1.09Table 5: Number of times each value of k in { } produces minimal error and average error values, over100,000 items generated according to both generative models. For k = 10 , we only permitted scores between3 and 7 (inclusive). If a score was below 3 we set it to be 3, and above 7 to 7.We next repeated these experiments for the rounding compression function. There are several interestingobservations from Table 6. In this setting, k = 3 is the clear choice, performing best in both models (by alarge margin for the Gaussian model). The smaller values of k perform significantly better with roundingthan flooring (as indicated by lower errors) while the larger values perform significantly worse, and theirerrors seem to approach 0.5 for both models. Taking both compressions into account, the optimal overallapproach would still be to use flooring with k = 10 , which produced the smallest average errors of 0.19and 0.1 in the two models, while using k = 3 with rounding produced errors of 0.47 and 0.24. The 2 vs. 3experiments produced very similar results for the two compressions (Table 7). The 2 vs. 10 results were quite11ifferent, with 2 performing better almost 40% of the time with rounding, vs. less than 10% with flooring(Table 8). In the 2 vs. 10 truncated 3–6 experiments 2 performed relatively better with rounding for bothmodels (Table 9), and for the 2 vs. 10 truncated 3–7 experiments k = 2 performed better nearly all the time(Table 10). 2 3 4 5 10Uniform k produces minimal error and average error values, over 100,000items generated according to both models with rounding compression.2 3Uniform number of victories 33585 66415Uniform average error 0.78 0.47Gaussian number of victories 18307 81693Gaussian average error 0.67 0.24Table 7: k = 2 vs. 3 with rounding compression.2 10Uniform number of victories 37225 62775Uniform average error 0.78 0.50Gaussian number of victories 37897 62103Gaussian average error 0.67 0.50Table 8: k = 2 vs. 10 with rounding compression.2 10Uniform number of victories 55676 44324Uniform average error 0.79 0.89Gaussian number of victories 24128 75872Gaussian average error 0.67 0.34Table 9: k = 2 vs. 10 with rounding compression. For k = 10 only scores permitted between 3 and 6.2 10Uniform number of victories 99586 414Uniform average error 0.78 3.50Gaussian number of victories 95692 4308Gaussian average error 0.67 1.45Table 10: k = 2 vs. 10 with rounding compression. For k = 10 only scores permitted between 3 and 7.12 Experiments
The empirical analysis of ranking-based datasets depends on the availability of large amounts of data depict-ing different types of real scenarios. For our experimental setup we used two different datasets from “Ratingand Combinatorial Preference Data” of . One of these datasetscontains 675,069 ratings on scale 1-5 of 1,842 hotels from the TripAdvisor website. The other consists of398 approval ballots and subjective ratings on a 20-point scale collected over 15 potential candidates forthe 2002 French Presidential election. The rating was provided by students at Institut d’Etudes Politiquesde Paris. The main quantities we are interested in computing are the number of times that each value of k produces the lowest error over the items, and the average value of the errors over all items for each k value.We also provide experimental results from the Jester Online Recommender System on joke ratings. In the first set of experiments, the dataset contains different types of ratings based on the price, quality ofrooms, proximity of location, etc., as well as overall rating provided by the users scraped from TripAdvisor.We compared performance between using k = 2 , , , to see for how many of the trials each value of k produced the minimal error using the floor approach (Tables 11 and 12). Surprisingly, we see that thenumber of victories sometimes decreases with the increase in value of k , while the average error decreasedmonotonically (recall that we would have zero error if we set k to the actual maximum rating point). Thenumber of victories increases for some cases with k=2 vs. 3 compared to 2 vs. 4 (Table 13).We next explored rounding to generate the ratings (Tables 14–17). For each value of k , all ratings pro-vided by users were compressed with the computed k midpoints and the average score was calculated.Table 14 shows the average error induced by the compression which performs better than the floor approachfor this dataset. An interesting observation found for rounding is that using k = n = 5 was outperformedby using k = 4 for several ratings, using both the average error and number of victories metrics, as shownin Table 17. Average error k = 2 3 4Overall 1.04 0.31 0.15Price 1.07 0.27 0.14Rooms 1.06 0.32 0.16Location 1.47 0.42 0.16Cleanliness 1.43 0.40 0.16Front Desk 1.34 0.33 0.14Service 1.24 0.32 0.14Business Service 0.96 0.28 0.18Table 11: Average flooring error for hotel ratings. We next experimented on data from the 2002 French Presidential Election. This dataset had both approvalballots and subjective ratings of the candidates by each voter. Voters rated the candidates on a scale of 20where 0.0 is the lowest possible rating and -1.0 indicates a missing value (our experiments ignored the candi-dates with -1). The number of victories and minimal flooring error were consistent for all comparisons, withhigher error achieved for lower k values for each candidate. On the other hand, with rounding compression13inimal error k = 2 3 4Overall 235 450 1157Price 181 518 1143Rooms 254 406 1182Location 111 231 1500Cleanliness 122 302 1418Front Desk 120 387 1335Service 140 403 1299Business Service 316 499 1027Table 12: Number of times each k minimizes flooring error. k minimizes flooring error.Average error k = 2 3 4Overall 0.50 0.28 0.15Price 0.48 0.31 0.15Rooms 0.48 0.30 0.16Location 0.63 0.41 0.22Cleanliness 0.6 0.4 0.21Front Desk 0.55 0.39 0.21Service 0.52 0.36 0.18Business Service 0.39 0.36 0.18Table 14: Average error using rounding approach.Minimal error k = 2 3 4Overall 82 132 1628Price 92 74 1676Rooms 152 81 1609Location 93 52 1697Cleanliness 79 44 1719Front Desk 89 50 1703Service 102 29 1711Business Service 246 123 1473Table 15: Number of times k minimizes error with rounding.14 of victories k= 2 vs. 3 2 vs. 4 3 vs. 4Overall 161, 1681 113, 1729 486, 1356Price 270, 1572 101, 1741 385, 1457Rooms 344, 1498 173, 1669 575, 1267Location 275, 1567 109, 1733 344, 1498Cleanliness 210, 1632 90, 1752 289, 1553Front Desk 380, 1462 95, 1747 332, 1510Service 358, 1484 109, 1733 399, 1443Business Service 870, 972 278, 1564 853, 989Table 16: Number of times k minimizes error with rounding.Overall Average error 0.15, 0.21 k in { } .the minimal error was achieved for k = 2 for one candidate, while it was achieved for the two highest values k = 8 or 10 for the others. We also experimented on anonymous ratings data from the Jester Online Joke Recommender System [12].Data was collected from 73,421 anonymous users between April 1999–May 2003 who have rated 36 ormore jokes with ratings of real values ranging from − . to +10 . . We included data from 24,983 usersin our experiment. Each row of the dataset represents the rating from single user. The first column containsthe number of jokes rated by a user and the next 100 columns give the ratings for jokes 1–100. Due to spacelimitations we only experimented on a subset of columns (the ten most densely populated). The results areshown in Tables 20 and 21.For the TripAdvisor and French election data, the errors decrease intuitively as the number of choicesincrease. But surprisingly for the Jester dataset we observe that the average errors are very close for all of theoptions ( k = 2 , , , , ) with rounding compression (though with flooring they decrease monotonicallywith increasing k value). These results suggest that while using more options seems to generally be betteron real data using our models and metrics, this is not always the case. In the future we would like to explore15verage error 2 3 4 5 8 10Francois Bayrou 3.18 1.5 0.94 0.66 0.3 0.2Olivier Besancenot 1.7 0.8 0.5 0.35 0.16 0.1Christine Boutin 1.15 0.54 0.34 0.24 0.11 0.07Jacques Cheminade 0.64 0.3 0.19 0.13 0.06 0.04Jean-Pierre Chevenement 3.69 1.74 1.09 0.77 0.35 0.23Jacques Chirac 3.48 1.64 1.03 0.72 0.33 0.21Robert Hue 2.39 1.12 0.7 0.49 0.22 0.14Lionel Jospin 5.45 2.57 1.61 1.13 0.52 0.33Arlette Laguiller 2.2 1.04 0.65 0.46 0.21 0.13Brice Lalonde 1.53 0.72 0.45 0.32 0.14 0.09Corine Lepage 2.24 1.06 0.67 0.47 0.22 0.14Jean-Marie Le Pen 0.4 0.19 0.12 0.08 0.04 0.02Alain Madelin 1.93 0.91 0.57 0.4 0.18 0.12Noel Mamere 3.68 1.74 1.09 0.77 0.35 0.23Bruno Maigret 0.31 0.15 0.09 0.06 0.03 0.02Table 18: Average flooring error for French election.Average error 2 3 4 5 8 10Francois Bayrou 1.65 0.73 0.91 0.75 0.48 0.62Olivier Besancenot 3.88 2.39 2.14 1.7 1.31 1.25Christine Boutin 3.87 2.39 1.84 1.5 0.9 0.86Jacques Cheminade 4.34 2.72 2.07 1.65 1.02 0.88Jean-Pierre Chevenement 1.47 0.65 1.2 0.82 0.55 0.61Jacques Chirac 1.64 1.0 1.13 0.88 0.55 0.64Robert Hue 2.51 1.27 1.14 1.09 0.67 0.77Lionel Jospin 0.33 0.49 0.87 0.67 0.51 0.63Arlette Laguiller 2.62 1.34 1.34 1.02 0.6 0.63Brice Lalonde 3.45 1.9 1.55 1.21 0.66 0.78Corine Lepage 2.89 1.59 1.56 1.16 0.79 0.87Jean-Marie Le Pen 4.92 3.26 2.55 2.06 1.39 1.2Alain Madelin 3.18 1.8 1.52 1.17 0.72 0.7Noel Mamere 2.02 1.55 1.77 1.44 1.29 1.41Bruno Maigret 4.88 3.23 2.46 1.99 1.28 1.1Table 19: Average rounding error for French election.deeper and understand what properties of the distribution and dataset determine when a smaller value of k can outperform the larger ones. Settings in which humans must rate items or entities from a small discrete set of options are ubiquitous. Wehave singled out several important applications—rating attractiveness for dating websites, assigning gradesto students, and reviewing academic papers. The number of available options can vary considerably, evenwithin different instantiations of the same application. For instance, we saw that three popular sites forattractiveness rating use completely different systems: Hot or Not uses a 1–10 system, OkCupid uses 1–516verage error 2 3 4 5 10Joke 5 0.57 0.53 0.52 0.51 0.5Joke 7 1.32 0.88 0.74 0.66 0.54Joke 8 1.51 0.97 0.8 0.71 0.56Joke 13 2.52 1.45 1.09 0.91 0.61Joke 15 2.48 1.43 1.08 0.91 0.62Joke 16 3.72 2.01 1.44 1.16 0.69Joke 17 1.94 1.18 0.92 0.8 0.58Joke 18 1.51 0.97 0.79 0.71 0.56Joke 19 0.8 0.64 0.58 0.56 0.51Joke 20 1.77 1.1 0.87 0.76 0.57Table 20: Average flooring error for Jester dataset.Average error 2 3 4 5 10Joke 5 0.48 0.47 0.48 0.47 0.48Joke 7 1.2 1.2 1.2 1.2 1.2Joke 8 1.44 1.43 1.42 1.43 1.42Joke 13 2.43 2.43 2.43 2.42 2.42Joke 15 2.34 2.34 2.33 2.33 2.33Joke 16 3.59 3.58 3.57 3.57 3.57Joke 17 1.84 1.82 1.82 1.81 1.81Joke 18 1.45 1.44 1.44 1.44 1.44Joke 19 0.72 0.72 0.71 0.71 0.71Joke 20 1.65 1.63 1.63 1.63 1.63Table 21: Average rounding error for Jester dataset.“star” system, and Tinder uses a binary 1–2 “swipe” system.Despite the problem’s importance, we have not seen it studied theoretically previously. Our goal is toselect k to minimize the average (normalized) error between the compressed average score and the groundtruth average. We studied two natural models for generating the scores. The first is a uniform model wherethe scores are selected independently and uniformly at random, and the second is a Gaussian model wherethey are selected according to a more structured procedure that gives preference for the options near thecenter. We provided a closed-form solution for continuous distributions with arbitrary cdf. This allows us tocharacterize the relative performance of choices of k for a given distribution. We saw that, counterintuitively,using a smaller value of k can actually produce lower error for some distributions (even though we knowthat as k approaches n the error approaches 0): we presented specific distributions for which using k = 2 outperforms 3 and 10.We performed numerous simulations comparing the performance between different values of k for dif-ferent generative models and metrics. The main metric was the absolute number of times for which valuesof k produced the minimal error. We also considered the average error over all simulated items. Not surpris-ingly, we observed that performance generally improves monotonically with k as expected, and more so forthe Gaussian model than uniform. However, we observe that small k values can be optimal a non-negligibleamount of the time, which is perhaps counterintuitive. In fact, using k = 2 outperformed k = 3 , , , and on 5.6% of the trials in the uniform setting. Just comparing 2 vs. 3, k = 2 performed better around37% of the time. Using k = 2 outperformed 10 8.3% of the time, and when we restricted k = 10 to only17ssign values between 3 and 7 inclusive, k = 2 actually produced lower error 93% of the time! This couldcorrespond to a setting where raters are ashamed to assign extreme scores (particularly extreme low scores).We compared two different natural compression rules—one based on the floor function and one based onrounding—and weighed the pros and cons of each. For smaller k rounding leads to significantly lower errorthan flooring, with k = 3 the clear optimal choice, while for larger k rounding leads to much larger error.A future avenue is to extend our analysis to better understand specific distributions for which different k values are optimal, while our simulations are in aggregate over many distributions. Application domains willhave distributions with different properties, and improved understanding will allow us to determine which k is optimal for the types of distributions we expect to encounter. This improved understanding can be coupledwith further data exploration. References [1] Duane F. Alwin. Feeling thermometers versus 7-point scales: Which are better?
Sociological Methodsand Research , 25:318–340, February 1997.[2] Jerrod Ankenman and Bill Chen.
The Mathematics of Poker . ConJelCo LLC, Pittsburgh, PA, USA,2006.[3] Laura Bl´edait´e and Francesco Ricci. Pairwise preferences elicitation and exploitation for conversa-tional collaborative filtering. In
Proceedings of the 26th ACM Conference on Hypertext & , HT ’15, pages 231–236, New York, NY, USA, 2015. ACM.[4] Carolyn C Preston and Andrew Colman. Optimal number of response categories in rating scales:Reliability, validity, discriminating power, and respondent preferences.
Acta psychologica , 104:1–15,04 2000.[5] Dan Cosley, Shyong K. Lam, Istvan Albert, Joseph A. Konstan, and John Riedl. Is seeing believing?:How recommender system interfaces affect users’ opinions. In
Proceedings of the SIGCHI Conferenceon Human Factors in Computing Systems , CHI ’03, pages 585–592, New York, NY, USA, 2003. ACM.[6] Eli P. Cox III. The optimal number of response alternatives for a scale: A review.
Journal of MarketingResearch , 17:407–442, 1980.[7] Pradeep Dubey and John Geanakoplos. Grading exams: 100, 99, 98, . . . or A, B, C?
Games andEconomic Behavior , 69:72–94, May 2010.[8] Baruch Fischhoff. Value elicitation: Is there anything in there?
American Psychologist , 46:835–847,1991. 10.1037/0003-066X.46.8.835.[9] Hershey H. Friedman, Yonah Wilamowsky, and Linda W. Friedman. A comparison of balanced andunbalanced rating scales.
The Mid-Atlantic Journal of Business , 19(2):1–7, 1981.[10] Hannah Fry.
The Mathematics of Love . TED Books, New York, NY, 2015.[11] Ron Garland. The mid-point on a rating scale: is it desirable?
Marketing Bulletin , 2:66–70, 1991.Research Note 3.[12] Ken Goldberg, Theresa Roeder, Dhruv Gupta, and Chris Perkins. Eigentaste: A constant time collab-orative filtering algorithm.
Information Retrieval , 4(2):133–151, 2001. http://eigentaste.berkeley.edu.1813] Saikishore Kalloori. Pairwise preferences and recommender systems. In
Proceedings of the 22NdInternational Conference on Intelligent User Interfaces Companion , IUI ’17 Companion, pages 169–172, New York, NY, USA, 2017. ACM.[14] Daniel Kluver, Tien T. Nguyen, Michael Ekstrand, Shilad Sen, and John Riedl. How many bits perrating? In
Proceedings of the Sixth ACM Conference on Recommender Systems , RecSys ’12, pages99–106, New York, NY, USA, 2012. ACM.[15] Yehuda Koren and Joe Sill. Ordrec: An ordinal model for predicting personalized item rating distri-butions. In
Proceedings of the Fifth ACM Conference on Recommender Systems , RecSys ’11, pages117–124, New York, NY, USA, 2011. ACM.[16] James R. Lewis and Oˇguzhan Erdinc¸. User experience rating scales with 7, 11, or 101 points: Does itmatter?
J. Usability Studies , 12(2):73–91, feb 2017.[17] Rensis A. Likert. A technique for measurement of attitudes.
Archives of Psychology , 22:1–, 01 1932.[18] Luis Lozano, Eduardo Garca-Cueto, and Jos Muiz. Effect of the number of response categories on thereliability and validity of rating scales.
Methodology , 4:73–79, 05 2008.[19] Michael S. Matell and Jacob Jacoby. Is there an optimal number of alternatives for likert scale items?Study 1: reliability and validity.
Educational and psychological measurements , 31:657–674, 1971.[20] Michael S. Matell and Jacob Jacoby. Is there an optimal number of alternatives for likert scale items?Effects of testing time and scale properties.
Journal of Applied Psychology , 56(6):506–509, 1972.[21] Aubrey C. McKennell. Surveying attitude structures: A discussion of principles and procedures.
Qual-ity and Quantity , 7(2):203–294, Sep 1974.[22] Bruno Pradel, Nicolas Usunier, and Patrick Gallinari. Ranking with non-random missing ratings: Influ-ence of popularity and positivity on evaluation metrics. In
Proceedings of the Sixth ACM Conferenceon Recommender Systems , RecSys ’12, pages 147–154, New York, NY, USA, 2012. ACM.[23] Donald R. Lehmann and James Hulbert. Are three-point scales always good enough?
Journal ofMarketing Research , 9:444, 11 1972.[24] Sheena S. Iyengar and Mark Lepper. When choice is demotivating: Can one desire too much of a goodthing?
Journal of personality and social psychology , 79:995–1006, 01 2001.[25] Seymour Sudman, Norman M. Bradburn, and Norbert Schwartz.
Thinking About Answers: The Appli-cation of Cognitive Processes to Survey Methodology . Jossey-Bass Publishers, San Francisco, Califor-nia, 1996.[26] Albert R. Wildt and Michael B. Mazis. Determinants of scale response: label versus position.