[PDF] Efficient Online Scalar Annotation with Bounded Support

Abstract

We describe a novel method for efficiently eliciting scalar annotations for dataset construction and system quality estimation by human judgments. We contrast direct assessment (annotators assign scores to items directly), online pairwise ranking aggregation (scores derive from annotator comparison of items), and a hybrid approach (EASL: Efficient Annotation of Scalar Labels) proposed here. Our proposal leads to increased correlation with ground truth, at far greater annotator efficiency, suggesting this strategy as an improved mechanism for dataset creation and manual system evaluation.

Full PDF

EEfﬁcient Online Scalar Annotation with Bounded Support

Keisuke Sakaguchi and

Benjamin Van Durme

Johns Hopkins University { keisuke,vandurme } @cs.jhu.edu Abstract

We describe a novel method for efﬁcientlyeliciting scalar annotations for datasetconstruction and system quality estima-tion by human judgments. We contrast di-rect assessment (annotators assign scoresto items directly), online pairwise rank-ing aggregation (scores derive from anno-tator comparison of items), and a hybridapproach (EASL: E fﬁcient A nnotation of S calar L abels) proposed here. Our pro-posal leads to increased correlation withground truth, at far greater annotator ef-ﬁciency, suggesting this strategy as an im-proved mechanism for dataset creation andmanual system evaluation. We are concerned here with the construction ofdatasets and evaluation of systems within naturallanguage processing (NLP). Speciﬁcally, humansproviding responses that are used to derive gradedvalues on natural language contexts, or in the or-dering of systems corresponding to their perceivedperformance on some task.Many NLP datasets involve eliciting from an-notators some graded response. The most pop-ular annotation scheme is the n -ary ordinal ap-proach as illustrated in Figure 1(a). For example,text may be labeled for sentiment as positive , neu-tral or negative (Wiebe et al., 1999; Pang et al.,2002; Turney, 2002, inter alia); or under politi-cal spectrum analysis as liberal , neutral , or con-servative (O’Connor et al., 2010; Bamman andSmith, 2015). A response may correspond to alikelihood judgment, e.g., how likely a predicateis factive (Lee et al., 2015), or that some naturallanguage inference may hold (Zhang et al., 2017).Responses may correspond to a notion of semantic Figure 1: Elicitation strategies for graded response includedirect assessment via ordinal or scalar judgments, and pair-wise comparisons aggregated via an assumption of latent dis-tributions such as Gaussians, or novel here: Beta distribu-tions, providing bounded support. The example concernssubjective assessments of the lexical frequency of dog . Inpairwise comparison, we assess it by comparison such as“burrito” is less frequent ( ≺ ) than “dog”. similarity, e.g., whether one word can be substi-tuted for another in context (Pavlick et al., 2015),or whether an entire sentence is more or less simi-lar than another (Marelli et al., 2014), and so on.Less common in NLP are system comparisonsbased on direct human ratings, but an exceptionincludes the annual shared task evaluations ofthe Conference on Machine Translation (WMT).There, MT practitioners submit system outputsbased on a shared set of source sentences, whichare then judged relative to other system out-puts. Various aggregation strategies have been em-ployed over the years to take these relative com-parisons and derive competitive rankings betweenshared task entrants (Callison-Burch et al., 2012; a r X i v : . [ c s . C L ] J un ojar et al., 2013, 2014, 2015, 2016, 2017).Inspired by prior work in MT system evalua-tion, we propose a procedure for eliciting gradedresponses that we demonstrate to be more efﬁcientthan prior work. While remaining applicable tosystem evaluation, our experimental results sug-gest our approach as a more general frameworkfor a variety of future data creation tasks, allowingfor higher quality data in less time and cost.We consider three different approaches forscalar annotation: direct assessment (DA), onlinepairwise ranking aggregation (RA), and a hybridmethod which we call EASL ( E fﬁcient A nnotationof S calar L abels). DA scalar annotation, shownin Figure 1(b), directly annotates absolute judg-ments on some scale (e.g., 0 to 100), indepen-dently per item ( § § §

4) that combines beneﬁts of DA and RA.We illustrate the improvements enabled by ourproposal on three example tasks ( § For exam-ple, we ﬁnd that in the commonly employed condi-tion of 3-way redundant annotation, our approachon multiple tasks gives similar quality with just 2-way redundancy: this translates to a potential 50%increase in dataset size for the same cost.

Direct assessment or direct annotation (DA) is astraightforward method for collecting graded re-sponse from annotators. The most popular schemeis n -ary ordinal labeling, as illustrated in Fig-ure 1(a), where annotators are shown one instance(i.e., sample point) and asked to label one of the n -ary ordered classes.According to the level of measurement in psy-chometrics (Stevens, 1946, inter alia), which clas-siﬁes the numerals based on certain properties(e.g., identity, order, quantity), ordinal data do notallow for degree of difference. Namely, there isno guarantee that the distance between each label Pronounced as “easel”. We release the code at http://decomp.net/ . is equal, and instances in the same class are notdiscriminated. For example, in a typical ﬁve-levelLikert scale (Likert, 1932) of likelihood – very un-likely, unlikely, unsure, likely, very likely – wecannot conclude that very likely instances are ex-actly twice as likely those marked likely , nor canwe assume two instances with the same label haveexactly the same likelihood.The issue of distance between ordinals is per-haps obviated by using scalar annotations (i.e., ratio scale in Stevens’s terminology), which di-rectly correspond to continuous quantities (Fig-ure 1(b)). In scalar DA, each instance in the col-lection ( S i ∈ S N ) is annotated with values (e.g.,on the range 0 to 100) often by several annota-tors. The notion of quantitative difference is en-abled by the property of absolute zero: the scaleis bounded . For example, distance, length, mass,size etc. are represented by this scale. In the an-nual shared task evaluation of the WMT, DA hasbeen used for scoring adequacy and ﬂuency of ma-chine learning system outputs with human evalua-tion (Graham et al., 2013, 2014; Bojar et al., 2016,2017), and has separately been used in creatingdatasets such as for factuality (Lee et al., 2015).Why perhaps obviated? Because of two con-cerns: (1) annotators may not have a pre-existing,well-calibrated scale for performing DA on a par-ticular collection according to a particular task; and (2) it is known that people may be biasedin their scalar estimates (Tversky and Kahneman,1974). Regarding (1), this motivates us to considerRA on the intuition that annotators may give morecalibrated responses when performed in the con-text of other elements. Regarding (2), our goal isnot to correct for human bias, but simply to moreefﬁciently converge to the same consensus judg-ments already being pursued by the community intheir annotation protocols, biased or otherwise. Pairwise ranking aggregation (Thurstone, 1927) isa method to obtain a total ranking on instances, In the rest of the paper, we take DA to mean scalar an-notation rather than ordinals. E.g., try to imagine your level of calibration to a hypo-thetical task described as ”On a scale of 1 to 100, label thistweet according to a conservative / liberal political spectrum.” There has been a line of work on relative weighting of annotators , based on their agreement with others (Whitehillet al., 2009; Welinder et al., 2010; Hovy et al., 2013). In thispaper, however, we do not perform such annotator weighting. ssuming that scalar value for each sample pointfollows a Gaussian distribution, N ( µ i , σ ) . Theparameters { µ i } are interpreted as mean scalar an-notation. Given the parameters, the probability that S i ispreferred ( (cid:31) ) over S j is deﬁned as p ( S i (cid:31) S j ) = Φ (cid:18) µ i − µ j √ σ (cid:19) , (1)where Φ( · ) is the cumulative distribution functionof the standard normal distribution. The objectiveof pairwise ranking aggregation (including all thefollowing models) is formulated as a maximumlog-likelihood estimation: max { S N } (cid:88) S i ,S j ∈{ S N } log p ( S i (cid:31) S j ) . (2)TrueSkill TM (Herbrich et al., 2006) extends theThurstone model by applying a Bayesian onlineand active learning framework, allowing for ties.TrueSkill has been used in the Xbox Live onlinegaming community, and has been applied for var-ious NLP tasks, such as question difﬁculty esti-mation (Liu et al., 2013), ranking speech qual-ity (Baumann, 2017), and ranking machine trans-lation and grammatical error correction systemswith human evaluation (Bojar et al., 2014, 2015;Sakaguchi et al., 2014, 2016)In the same way as the Thurstone model,TrueSkill assumes that scalar values for each in-stance S i (i.e., skill level for each player in thecontext of TrueSkill) follow a Gaussian distribu-tion N ( µ i , σ i ) , where σ i is also parameterized asthe uncertainty of the scalar value for each in-stance. Importantly, TrueSkill uses a Bayesian on-line learning scheme, and the parameters are iter-atively updated after each observation of pairwisecomparison (i.e., game result: win ( (cid:31) ), tie ( ≡ ), orloss ( ≺ )) in proportion to how surprising the out-come is. Let t i (cid:31) j = µ i − µ j , the difference inscalar responses (skill levels) when we observe i wins j , and (cid:15) (cid:62) be a parameter to specify thetie rate. The update functions are formulated asfollows: µ i = µ i + σ i c · v (cid:18) tc , (cid:15)c (cid:19) (3) µ j = µ j − σ j c · v (cid:18) tc , (cid:15)c (cid:19) , (4) Thurstone and another popular ranking method by Elo(1978) use a ﬁxed σ for all instances. (a) v i (cid:31) j (b) v i ≡ j (c) w i (cid:31) j (d) w i ≡ j Figure 2: Surprisal of the outcome for µ and σ ( (cid:15) = 0 . . where c = 2 γ + σ i + σ j , and v are multiplicativefactors that affect the amount of change (surprisalof the outcome) in µ . In the accumulation of thevariances ( c ), another free parameter called “skillchain”, γ , indicates the width (or difference) ofskill levels that two given players have 0.8 (80%)probability of win/lose. The multiplicative factordepends on the observation (wins or ties): v i (cid:31) j ( t, (cid:15) ) = ϕ ( − (cid:15) + t )Φ( − (cid:15) + t ) , (5) v i ≡ j ( t, (cid:15) ) = ϕ ( − (cid:15) − t ) − ϕ ( (cid:15) − t )Φ( (cid:15) − t ) − Φ( − (cid:15) − t ) , (6)where ϕ ( · ) is the probability density function ofthe standard normal distribution. As shown in Fig-ure 2 (a) and (b), v i (cid:31) j increases exponentially as t becomes smaller (i.e., the observation is unex-pected), whereas v i ≡ j becomes close to zero when | t | is close to zero. In short, v becomes larger asthe outcome is more surprising.In order to update variance ( σ ), another set ofupdate functions is used: σ i = σ i · (cid:20) − σ i c · w (cid:18) tc , (cid:15)c (cid:19)(cid:21) (7) σ j = σ j · (cid:34) − σ j c · w (cid:18) tc , (cid:15)c (cid:19)(cid:35) , (8)here w serve as multiplicative factors that affectthe amount of change in σ . w i (cid:31) j ( t, (cid:15) ) = v i (cid:31) j · ( v i (cid:31) j + t − (cid:15) ) (9) w i ≡ j ( t, (cid:15) ) = v i ≡ j + ( (cid:15) − t ) · ϕ ( (cid:15) − t ) + ( (cid:15) + t ) · ϕ ( (cid:15) + t )Φ( (cid:15) − t ) − Φ( − (cid:15) − t ) . (10) As shown in Figure 2 (c) and (d), the value of w isbetween 0 and 1. The underlying idea for the vari-ance updates is that these updates always decreasethe size of the variances σ , which means uncer-tainty of the instances ( S i , S j ) always decreases aswe observe more pairwise comparisons. In otherwords, TrueSkill becomes more conﬁdent in thecurrent estimate of µ i and µ j . Further details areprovided by Herbrich et al. (2006). Another important property of TrueSkill is“match quality (chance to draw)”. The matchquality helps selecting competitive players tomake games more interesting. More broadly, thematch quality enables us to choose similar in-stances to be compared to maximize the informa-tion gain from pairwise comparisons, as in the ac-tive learning literature (Settles et al., 2008). Thematch quality between two instances (players) iscomputed as follows: q ( γ, S i , S j ) := (cid:114) γ c exp (cid:18) − ( µ i − µ j ) c (cid:19) (11) Intuitively, the match quality is based on the differ-ence µ i − µ j . As the difference becomes smaller,the match quality goes higher, and vice versa.As mentioned, TrueSkill has been used for NLPtasks to infer continuous values for instances.However, it is important to note that the supportof a Gaussian distribution is unbounded, namely R = ( −∞ , ∞ ) . This does not satisfy the propertyof absolute zero of scalar annotation in the level ofmeasurement ( § TrueSkill can induce a continuous spectrum of in-stances (such as skill level of game players) by The following material is also useful to understandthe math behind TrueSkill ( ). assuming that each instance is represented as aGaussian distribution. However, the Gaussian dis-tribution has unbounded support, namely R =( −∞ , ∞ ) , which does not satisfy the property of absolute bounds for appropriate scalar annotation(i.e., ratio scale in the level of measurement).Thus, we propose a variant of TrueSkill bychanging the latent distribution from a Gaussianto a beta, using a heuristic algorithm based onTrueSkill for inference. The Beta distribution hasnatural [0 , upper and lower bounds and a simpleparameterization: S i ∼ B i ( α i , β i ) . We choose thescalar response as the mode M [ S i ] of the distribu-tion and the variance as uncertainty: M i = α i − α i + β i − (12)Var i = σ i = α i β i ( α i + β i ) ( α i + β i + 1) (13)As in TrueSkill, we iteratively update param-eters of instances B ( α, β ) according to each ob-servation and how it is surprising. Similarly toEqns. (3) and (4), we choose the update functionsas follows; ﬁrst, in case that an annotator judgedthat S i is preferred to S j ( S i (cid:31) S j ), α i = α i + σ i c · (1 − p i (cid:31) j ) (14) β j = β j + σ j c · (1 − p j ≺ i ) (15)in case of ties with | D | > (cid:15) and M i > M j , α j = α j + σ j c · (1 − p i ≡ j ) (16) β i = β i + σ i c · (1 − p i ≡ j ) (17)and in case of ties with | D | (cid:54) (cid:15) , for both S i , S j , α i,j = α i,j + σ i,j c · (1 − p i ≡ j ) (18) β i,j = β i,j + σ i,j c · (1 − p i ≡ j ) . (19) We may have instead used the mean ( E [ S i ] = α i α i + β i )of the distribution, where in a beta ( α, β > ) the mean isalways closer to 0.5 than the mode, whereas mean and modeare always the same in a Gaussian distribution. The modewas selected owing to better performance in development. There may be other potential update (and surprisal) func-tions such as − log p , instead of − p . As in our use of themode rather than mean as scalar response, we empirically de-veloped our update functions with respect to annotation efﬁ-ciency observed through experimentation ( § − p i (cid:31) j (b) − p i ≡ j Figure 3: Surprisal of the outcome for the bounded variant( (cid:15) = 0 . . Regarding the probability of pairwise comparisonbetween instances, we follow Bradley and Terry(1952) and Rao and Kupper (1967) to describe thechance of win, tie, or loss, as follows: p ( S i (cid:31) S j ) = p ( D > (cid:15) ) = π i π i + θπ j (20) p ( S i ≺ S j ) = p ( D < − (cid:15) ) = π j θπ i + π j (21) p ( S i ≡ S j ) = p ( | D | (cid:54) (cid:15) ) = ( θ − π i π j ( π i + θπ j )( θπ i + π j ) (22) where D = M i − M j , (cid:15) (cid:62) is a parameter tospecify the tie rate, θ = exp ( (cid:15) ) , and π is an expo-nential score function of S ; π i = exp( M i ) .It is important to note that α and β never de-crease (because − p ≥ as shown Figure 3),which satisﬁes the property that variance (uncer-tainty) always decreases as we observe more judg-ments, as seen in TrueSkill ( § µ and σ , since the mode and variance in beta dis-tribution depend on two shared parameters α, β (Eqns. 12 and 13).Regarding match quality, we use the same for-mulation as the TrueSkill (Eqn. 11), except that thebounded model uses M instead of µ : q ( γ, S i , S j ) = (cid:114) γ c exp (cid:18) − ( M i − M j ) c (cid:19) (23) In the previous section, we propose a bounded online ranking aggregation model for scalar an-notation. However, the amount of update by apairwise judgment depends only on the distancebetween instances, not on the distance from thebounds (i.e., 0 and 1). To integrate this prop-erty into the online ranking aggregation model,

Figure 4: Illustrative example of the EASL protocol. Eachinstance is represented as a beta distribution. Instances arechosen to annotate according to the variance and match qual-ity, and the parameters are updated iteratively. we propose EASL ( E fﬁcient A nnotation of S calar L abels) that combines beneﬁts from both direct as-sessment (DA) and bounded online ranking aggre-gation model (RA). Similarly to RA, EASL parameterizes each in-stance by a beta distribution (Eqns. 12 and 13),and the parameters are inferred using a compu-tationally efﬁcient and easy-to-implement heuris-tic. The difference from RA is the type of annota-tion. While we ask for discrete pairwise judgment( (cid:31) , ≺ , ≡ ) between S i and S j in RA, here we di-rectly ask for scalar values for them (denoted as s i and s j ) as in DA. Thus, given an annotated score s i which is normalized between [0,1], we changethe update functions as follows: α i = α i + s i (24) β i = β i + (1 − s i ) (25)This procedure may look similar to DA, where s i is simply accumulated and averaged at the end.However, there are two differences. First, as illus-trated in Figure 4, EASL parameterizes each in-stance as a probability distribution while DA doesnot. Second, DA elicits annotations independentlyper element, whereas EASL elicits annotations onelements in the context of other elements selectedjointly according to match quality.Further, DA generally uses a batch style annota-tion scheme, where the number of annotations perinstance is independent from the latent scalar val-ues. On the other hand, EASL uses online learn-ing, which impacts the calculation of match qual-ity. This allows us to choose instances to annotate Novikova et al. (2018) recently proposed a similar ap-proach named RankME, which is a variant of DA with com-paring multiple instances at a time. It can also be regardedas a batch-learning variant of EASL without probabilistic pa-rameterization.igure 5: Example of partial ranking with scalars (HITS) by order of uncertainty for each instance, and asin RA, the match quality (Eqn. 23) enables us toconsider similar instances in the same context.

To compare different annotation methods, we con-duct three experiments: (1) lexical frequency in-ference, (2) political spectrum inference, and (3)human evaluation for machine translation systems.In all experiments, data collection is conductedthrough Amazon Mechanical Turk (AMT). We askannotators who meet the following minimum re-quirements: living in the US, overall approvalrate > > partial ranking framework with scalars,where annotators are asked to rank and score n instances at one time as illustrated in Figure 5. Inall three experiments, we ﬁx n = 5 . The partialranking yields (cid:0) n (cid:1) pairwise comparisons for RAand n scalar values for EASL. It is important tonote that we can simultaneously retrieve pairwise In all experiments, we set the reward of single instanceto be $0.01 (i.e., $0.05 in RA and EASL). This is $8/hour, as-suming that annotating one instance takes ﬁve seconds. Priorto annotation, we run a pilot to make sure that the participantsunderstand the task correctly and the instructions are clear. The partial ranking can be regarded as mini-batching.

Algorithm 1:

Online pairwise ranking aggre-gation with bounded support.

Input:

Instances { S N } Output:

Updated instances { S N } /* Initialize params */ ( α i , β i ) ∈ S = ( α init i , β init i ) /* Update S over iterations */ foreach iteration do HITS = SampleByMatchQuality(

S, N, n ) A = Annotate(HITS) for obs ∈ A do // Update S i, j, d = parseObservation( obs ) α i,j , β i,j = update( i, j, d ) return S Function

SampleByMatchQuality(

S, N, n ) k = N/n descendingSort( S , key=Var[ S ]) S (cid:48) = top-k instances of S HITS = [] foreach S i ∈ S (cid:48) do m = [] foreach S j ∈ S /S (cid:48) do m.append([matchQuality( S i , S j ), j ]) p = normalize(m) ˜ S = sampling n-1 items by p HITS.append([ S i , ˜ S ]) return HITS judgments ( (cid:31) , ≺ , ≡ ) as well as scalar values fromthis format.In each iteration, n instances are selected byvariance and match quality. We ﬁrst select topk ( = N/n ) instances according to the variance,and for each selected instance we choose the other n − instances to be compared based on matchquality. This approach has been used in the NLPcommunity in tasks such as for assessing machinetranslation quality (Bojar et al., 2014; Sakaguchiet al., 2014; Bojar et al., 2015, 2016) to collectpairwise judgments efﬁciently. The detailed pro-cedure of iterative parameter updates in the RAand EASL is described in Algorithm 1. As men-tioned in Section 4, the main difference betweenRA and EASL is the update functions (line 7).Model hyper-parameters in RA and EASL areset as follows; each instance is initialized as α init i = 1 . , β init i = 1 . . The skill chain param-eter γ and tie-rate parameter (cid:15) are set to be 0.1. In the ﬁrst experiment, we compare the threescalar annotation approaches on lexical frequencyinference, in which we ask annotators to judge fre-quency (from very rare to very frequent) of verbs We explored the hyper-parameters γ, (cid:15) in a pilot task.igure 6: Spearman’s (top) and Pearson’s (bottom) correla-tions with three difference methods on lexical frequency in-ference annotation: direct assessment (DA), online rankingaggregation (RA), and EASL. The shade for each line indi-cates 95% conﬁdence intervals by bootstrap resampling (run-ning 100 times). that are randomly selected from the corpus of Con-temporary American English (COCA) . We in-clude this task for evaluation owing to its non-subjective ground truth (relative corpus frequency)which can be used as an oracle response we wouldlike to maximally correlate with. We randomly select 150 verbs from COCA; thelog frequency ( log ) is regarded as the oracle.In DA, each instance is annotated by 10 differ-ent annotators. In the RA and EASL, annota-tors are asked to rank/score ﬁve verbs for each HIT( n = 5 ). Each iteration contains 20 HITS and werun 10 iterations, which means that total number ofannotations is the same in DA, RA, and EASL. Figure 6 presents Spearman’s and Pearson’scorrelations, indicating how accurately each an-notation method obtains scalar values for each in-stance. Overall, in all three methods, the correla-tions are increased as more annotations are made.The result also shows that RA and EASL ap- Lexical frequency inference is an established experimentin (computational) psycholinguistics. E.g., human behavioralmeasures have been compared with predictability and bias invarious corpora (Balota et al., 1999; Fine et al., 2014). The agreement rate in DA (10 annotators) is 0.37 inSpearman’s ρ . Considering the difﬁculty of ranking 150verbs, this rate is fair. Technically, the number of annotations per instance varyin RA and EASL, because they choose instances by matchquality at each iteration. Figure 7: Histograms of scalar values on lexical frequencyobtained by each annotation scheme (direct assessment (DA),online ranking aggregation (RA), and EASL), and the ora-cle. The scalar annotations are put into ﬁve bins to see theoverall distribution. The scalar in the oracle is normalized as log (frequency( S i )) / max log (frequency( S )).(a) Iter 0 (b) Iter 3 (c) Iter 6 (d) Iter 9Figure 8: Heatmaps of match quality distribution acrossthe cross-product of instances ordered by the oracle (i.e., log (frequency)). proaches achieve high correlation more efﬁcientlythan DA. The gain of efﬁciency from DA to EASLis about 50%; two iterations in EASL achieves aclose Spearman’s ρ to three annotators in DA.Figure 7 presents the results of the ﬁnal scalarvalues that each method annotated. The distri-bution of the histograms shows that overall threemethods successfully capture the latent distribu-tion of scalar values in the data.Figure 8 shows a dynamic change of matchquality. In the beginning (iteration 0), all the in-stances are equally competitive because we haveno information about them and initialize them withthe same parameters. As iterations go on, theinstances along the diagonal have higher matchquality, indicating that competitive matches aremore likely to be selected for a next iteration. Inother words, match-quality helps to choose infor-mative pairs to compare at each iteration, whichreduces the number of less informative annota-tions (e.g., a pairwise comparison between thehighest and lowest instances). igure 9: Spearman’s (top) and Pearson’s (bottom) correla-tions with three difference methods on political spectrum an-notation: direct assessment (DA), online ranking aggregation(RA), and EASL In the second experiment, we compare the threescalar annotation methods for political spectruminference. We use the Fine-Grained PoliticalStatements dataset (Bamman and Smith, 2015),which consists of 766 propositions collected frompolitical blog comments, paired with judgmentsabout the political belief of the statement (or theperson who would say it) based on the ﬁve ordi-nals: very conservative (-2), slightly conservative (-1), neutral (0), slightly liberal (1), and very lib-eral (2). We normalize the ordinal scores between0 and 1. The dataset contains the mean scores byaggregating 7 annotations for each proposition. We randomly choose 150 political propositionsfrom the dataset (see the histogram in Figure 10oracle). The experimental setting (i.e., the num-ber of annotations per instance, the number of iter-ations, and the number of HITS in each iteration)is the same as the lexical frequency inference ex-periment ( § We stress that the oracle here derives from subjective an-notations: it does not necessarily reﬂect the true latent scalarvalues for each instance. However, in this experiment, we usethem as a tentative oracle to compare three scalar annotationmethods objectively. The agreement rate in DA (among 10 annotators) is 0.67in Spearman’s ρ . This is signiﬁcantly high, considering thedifﬁculty of ranking 150 instances in order. Figure 10: Histograms of scalar values on political spectrumobtained by each annotation scheme (DA, RA, EASL) andthe oracle. Scalars are put into ﬁve bins to see the overalldistribution. Propositions Gold DA RA EASL the republicans are useless 100 91.7 75.8 91.9obama is right 92.9 90.1 74.6 90.0hillary will win 78.6 86.3 72.9 86.4aca is a success 75.0 78.2 68.3 77.3harry reid is a democrat 53.6 55.5 55.8 55.9ebola is a virus 50.0 53.0 53.8 53.5cruz is eligible 32.2 31.0 44.0 31.4global warming is a religion 28.6 22.4 37.3 23.0bush kept us safe 10.7 9.6 31.5 9.6democrats are corrupt 0.0 7.1 29.9 7.4Table 1: Example propositions and the scalar political spec-trum ranged between 0 ( very conservative ) and 100 ( very lib-eral ) by each approach: direct assessment, online ranking ag-gregation, and EASL. The dashed lines indicate a split by 5-ary ordinal scale. § ρ to 6-way redundancy in DA.Figure 10 presents the results of the annotatedscalar values by each method. The distribution ofthe histograms shows that DA and EASL success-fully ﬁt to the distribution in the oracle, whereasRA converges to a rather narrow range. This isbecause of the “lack of distance from bounds” inRA that is explained in §

4. We note that renor-malizing the distribution in RA will not addressthe issue. For instance, when the dataset has onlyliberal propositions, RA still fails to capture thelatent distribution because it looks only at rela-tive distances between instances but not the dis-tance from bounds. Table 1 shows the examplesof scalar annotations by each method. Again, weee that RA approach has a narrower range thanthe oracle, DA, and EASL.

In the third experiment, we apply the scalar anno-tation methods for evaluating machine translationsystems. This is different from two previous ex-periments, because the main purpose is to rank theMT systems ( S N ) rather than the adequacy ( q ) ofeach MT output for a given source sentence ( m ).Namely, we want to rank S i by observing q i,m .We use WMT16 German-English translationdataset (Bojar et al., 2016), which consists of2,999 test set sentences and the translations from10 different systems with DA annotation. Eachsentence has its adequacy score annotation be-tween 0 and 100, and the average adequacy scoresare computed for each system for ranking. In thissetting, annotators are asked to judge adequacy ofsystem output(s) with the reference being given.The ofﬁcial scores (made by DA) and ranking inWMT16 are used as the oracle in this experiment.In this experiment, we replicate DA and runEASL to compare the efﬁciency. We omit RAin this experiment, because it does not necessarilycapture the distance from bounds as shown in theprevious experiment ( § q ) of anMT output by system ( S i ) for a given source sen-tence ( m ) is drawn from beta distribution: q i,m ∼B ( α i , β i ) . Annotators are asked to judge ade-quacy of system outputs by scoring 0 and 100.Similarly to the previous experiments ( § § n = 5 system outputs (for the samesource sentence l ) to annotate at a time. The proce-dure of parameter updates is the same as previousexperiments (Algorithm 1).We compare the correlations (Spearman’s ρ ) ofsystem ranking with respect to the number of an-notations per system, and the result is shown inFigure 11. As seen in the previous two exper-iments, EASL achieves higher Spearmans corre-lation on ranking MT systems with smaller num-ber of annotations than the baseline method (DA), This is the same setting as WMT14, WMT15, andWMT16 (Bojar et al., 2014, 2015), although they usedTrueSkill (Gaussian) instead of EASL to rank systems. Figure 11: Spearman’s correlation on ranking machine trans-lation systems on WMT16 German-English data: direct as-sessment (DA), and EASL. The shade for each line indicates95% conﬁdence intervals by bootstrap resampling (running100 times). which means EASL is able to collect annotationmore efﬁciently. The result shows that EASL canbe applied for efﬁcient system evaluation in addi-tion to data curation.

We have presented an efﬁcient, online model toelicit scalar annotations for computational linguis-tic datasets and system evaluations. The modelcombines two approaches for scalar annotation:direct assessment and online pairwise ranking ag-gregation. We conducted three illustrative exper-iments on lexical frequency inference, politicalspectrum inference, and ranking machine transla-tion systems. We have shown that our approach,EASL ( E fﬁcient A nnotation of S calar L abels),outperforms direct assessment in terms of annota-tion efﬁciency and outperforms online ranking ag-gregation in terms of accurately capturing the la-tent distributions of scalar values. The signiﬁcantgains demonstrated suggests EASL as a promis-ing approach for future dataset curation and sys-tem evaluation in the community. Acknowledgments

We are grateful to Rachel Rudinger, Adam Te-ichert, Chandler May, Tongfei Chen, PushpendreRastogi, and anonymous reviewers for their use-ful feedback. This work was supported in partby IARPA MATERIAL and DARPA LORELEI.The U.S. Government is authorized to reproduceand distribute reprints for Governmental purposes.The views and conclusions contained in this pub-lication are those of the authors and should not beinterpreted as representing ofﬁcial policies of theU.S. Government. eferences

David A. Balota, Cortese Michael J., and Maura Pilotti.1999. Item-level analyses of lexical decision perfor-mance: Results from a mega-study. In

Abstracts ofthe 40th Annual Meeting of the Psychonomics Soci-ety , page 44, Los Angeles, California. PsychonomicSociety.David Bamman and Noah A. Smith. 2015. Open ex-traction of ﬁne-grained political statements. In

Pro-ceedings of the 2015 Conference on Empirical Meth-ods in Natural Language Processing , pages 76–85, Lisbon, Portugal. Association for ComputationalLinguistics.Timo Baumann. 2017. Large-scale speaker rankingfrom crowdsourced pairwise listener ratings. In

Pro-ceedings of Interspeech .Ondrej Bojar, Christian Buck, Christian Federmann,Barry Haddow, Philipp Koehn, Johannes Leveling,Christof Monz, Pavel Pecina, Matt Post, HerveSaint-Amand, Radu Soricut, Lucia Specia, and AleˇsTamchyna. 2014. Findings of the 2014 workshop onstatistical machine translation.Ondˇrej Bojar, Christian Buck, Chris Callison-Burch,Christian Federmann, Barry Haddow, PhilippKoehn, Christof Monz, Matt Post, Radu Soricut, andLucia Specia. 2013. Findings of the 2013 Work-shop on Statistical Machine Translation. In

Pro-ceedings of the Eighth Workshop on Statistical Ma-chine Translation , pages 1–44, Soﬁa, Bulgaria. As-sociation for Computational Linguistics.Ondˇrej Bojar, Rajen Chatterjee, Christian Federmann,Yvette Graham, Barry Haddow, Shujian Huang,Matthias Huck, Philipp Koehn, Qun Liu, VarvaraLogacheva, Christof Monz, Matteo Negri, MattPost, Raphael Rubino, Lucia Specia, and MarcoTurchi. 2017. Findings of the 2017 conference onmachine translation (wmt17). In

Proceedings of theSecond Conference on Machine Translation , pages169–214, Copenhagen, Denmark. Association forComputational Linguistics.Ondˇrej Bojar, Rajen Chatterjee, Christian Federmann,Yvette Graham, Barry Haddow, Matthias Huck,Antonio Jimeno Yepes, Philipp Koehn, VarvaraLogacheva, Christof Monz, Matteo Negri, Aure-lie Neveol, Mariana Neves, Martin Popel, MattPost, Raphael Rubino, Carolina Scarton, Lucia Spe-cia, Marco Turchi, Karin Verspoor, and MarcosZampieri. 2016. Findings of the 2016 conferenceon machine translation. In

Proceedings of the FirstConference on Machine Translation , pages 131–198, Berlin, Germany. Association for Computa-tional Linguistics.Ondˇrej Bojar, Rajen Chatterjee, Christian Federmann,Barry Haddow, Matthias Huck, Chris Hokamp,Philipp Koehn, Varvara Logacheva, Christof Monz,Matteo Negri, Matt Post, Carolina Scarton, LuciaSpecia, and Marco Turchi. 2015. Findings of the 2015 workshop on statistical machine translation.In

Proceedings of the Tenth Workshop on StatisticalMachine Translation , pages 1–46, Lisbon, Portugal.Association for Computational Linguistics.Ralph Allan Bradley and Milton E. Terry. 1952. Rankanalysis of incomplete block designs: I. the methodof paired comparisons.

Biometrika , 39(3/4):324–345.Chris Callison-Burch, Philipp Koehn, Christof Monz,Matt Post, Radu Soricut, and Lucia Specia. 2012.Findings of the 2012 workshop on statistical ma-chine translation. In

Proceedings of the SeventhWorkshop on Statistical Machine Translation , pages10–51, Montr´eal, Canada. Association for Compu-tational Linguistics.Arpad E. Elo. 1978.

The rating of chessplayers, pastand present . Arco Pub.Alex B. Fine, Austin F. Frank, T. Florian Jaeger, andBenjamin Van Durme. 2014. Biases in predictingthe human language model. In

Proceedings of the52nd Annual Meeting of the Association for Compu-tational Linguistics (Volume 2: Short Papers) , pages7–12, Baltimore, Maryland. Association for Compu-tational Linguistics.Yvette Graham, Timothy Baldwin, Alistair Moffat, andJustin Zobel. 2013. Continuous measurement scalesin human evaluation of machine translation. In

Pro-ceedings of the 7th Linguistic Annotation Workshopand Interoperability with Discourse , pages 33–41,Soﬁa, Bulgaria. Association for Computational Lin-guistics.Yvette Graham, Timothy Baldwin, Alistair Moffat, andJustin Zobel. 2014. Is machine translation gettingbetter over time? In

Proceedings of the 14th Confer-ence of the European Chapter of the Association forComputational Linguistics , pages 443–451, Gothen-burg, Sweden. Association for Computational Lin-guistics.Ralf Herbrich, Tom Minka, and Thore Graepel. 2006.TrueSkill TM : A Bayesian Skill Rating System. In Proceedings of the Twentieth Annual Conference onNeural Information Processing Systems , pages 569–576, Vancouver, British Columbia, Canada.Dirk Hovy, Taylor Berg-Kirkpatrick, Ashish Vaswani,and Eduard Hovy. 2013. Learning whom to trustwith mace. In

Proceedings of the 2013 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies , pages 1120–1130, Atlanta, Georgia.Association for Computational Linguistics.Kenton Lee, Yoav Artzi, Yejin Choi, and Luke Zettle-moyer. 2015. Event detection and factuality as-sessment with non-expert supervision. In

Proceed-ings of the 2015 Conference on Empirical Meth-ods in Natural Language Processing , pages 1643–1648, Lisbon, Portugal. Association for Computa-tional Linguistics.ensis Likert. 1932. A technique for the measurementof attitudes.

Archives of psychology .Jing Liu, Quan Wang, Chin-Yew Lin, and Hsiao-WuenHon. 2013. Question difﬁculty estimation in com-munity question answering services. In

Proceedingsof the 2013 Conference on Empirical Methods inNatural Language Processing , pages 85–90, Seattle,Washington, USA. Association for ComputationalLinguistics.Marco Marelli, Stefano Menini, Marco Baroni, LuisaBentivogli, Raffaella bernardi, and Roberto Zampar-elli. 2014. A sick cure for the evaluation of compo-sitional distributional semantic models. In

Proceed-ings of the Ninth International Conference on Lan-guage Resources and Evaluation (LREC-2014) . Eu-ropean Language Resources Association (ELRA).J. Novikova, O. Duˇsek, and V. Rieser. 2018. RankME:Reliable Human Ratings for Natural Language Gen-eration.

ArXiv e-prints .Brendan O’Connor, Ramnath Balasubramanyan,Bryan R Routledge, and Noah A Smith. 2010. Fromtweets to polls: Linking text sentiment to publicopinion time series.

ICWSM , 11(122-129):1–2.Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan.2002. Thumbs up?: sentiment classiﬁcation usingmachine learning techniques. In

Proceedings of theACL-02 conference on Empirical methods in naturallanguage processing-Volume 10 , pages 79–86. As-sociation for Computational Linguistics.Ellie Pavlick, Travis Wolfe, Pushpendre Rastogi,Chris Callison-Burch, Mark Dredze, and BenjaminVan Durme. 2015. Framenet+: Fast paraphrastictripling of framenet. In

Proceedings of the 53rd An-nual Meeting of the Association for ComputationalLinguistics and the 7th International Joint Confer-ence on Natural Language Processing (Volume 2:Short Papers) , pages 408–413, Beijing, China. As-sociation for Computational Linguistics.P. V. Rao and L. L. Kupper. 1967. Ties in paired-comparison experiments: A generalization of thebradley-terry model.

Journal of the American Sta-tistical Association , 62(317):194–204.Keisuke Sakaguchi, Courtney Napoles, Matt Post, andJoel Tetreault. 2016. Reassessing the goals of gram-matical error correction: Fluency instead of gram-maticality.

Transactions of the Association for Com-putational Linguistics , 4:169–182.Keisuke Sakaguchi, Matt Post, and BenjaminVan Durme. 2014. Efﬁcient elicitation of annota-tions for human evaluation of machine translation.In

Proceedings of the Ninth Workshop on StatisticalMachine Translation , pages 1–11, Baltimore,Maryland, USA. Association for ComputationalLinguistics. Burr Settles, Mark Craven, and Lewis Friedland. 2008.Active learning with real annotation costs. In

Pro-ceedings of the NIPS workshop on cost-sensitivelearning , pages 1–10.S. S. Stevens. 1946. On the theory of scales of mea-surement.

Science , 103(2684):677–680.Louis L Thurstone. 1927. The method of paired com-parisons for social values.

The Journal of Abnormaland Social Psychology , 21(4):384.Peter D Turney. 2002. Thumbs up or thumbs down?:semantic orientation applied to unsupervised classi-ﬁcation of reviews. In

Proceedings of the 40th an-nual meeting on association for computational lin-guistics , pages 417–424. Association for Computa-tional Linguistics.Amos Tversky and Daniel Kahneman. 1974. Judgmentunder uncertainty: Heuristics and biases.

Science ,185(4157):1124–1131.Peter Welinder, Steve Branson, Pietro Perona, andSerge J Belongie. 2010. The multidimensional wis-dom of crowds. In

Advances in neural informationprocessing systems , pages 2424–2432.Jacob Whitehill, Ting-fan Wu, Jacob Bergsma, Javier RMovellan, and Paul L Ruvolo. 2009. Whose voteshould count more: Optimal integration of labelsfrom labelers of unknown expertise. In

Advances inneural information processing systems , pages 2035–2043.Janyce M Wiebe, Rebecca F Bruce, and Thomas PO’Hara. 1999. Development and use of a gold-standard data set for subjectivity classiﬁcations. In

Proceedings of the 37th annual meeting of the As-sociation for Computational Linguistics on Compu-tational Linguistics , pages 246–253. Association forComputational Linguistics.Sheng Zhang, Rachel Rudinger, Kevin Duh, and Ben-jamin Van Durme. 2017. Ordinal common-sense in-ference.