[PDF] Studying the Effects of Cognitive Biases in Evaluation of Conversational Agents

Abstract

Humans quite frequently interact with conversational agents. The rapid advancement in generative language modeling through neural networks has helped advance the creation of intelligent conversational agents. Researchers typically evaluate the output of their models through crowdsourced judgments, but there are no established best practices for conducting such studies. Moreover, it is unclear if cognitive biases in decision-making are affecting crowdsourced workers' judgments when they undertake these tasks. To investigate, we conducted a between-subjects study with 77 crowdsourced workers to understand the role of cognitive biases, specifically anchoring bias, when humans are asked to evaluate the output of conversational agents. Our results provide insight into how best to evaluate conversational agents. We find increased consistency in ratings across two experimental conditions may be a result of anchoring bias. We also determine that external factors such as time and prior experience in similar tasks have effects on inter-rater consistency.

Full PDF

SStudying the Effects of Cognitive Biases in Evaluation ofConversational Agents

Sashank Santhanam, Alireza Karduni and Samira Shaikh

University of North Carolina at CharlotteCharlotte, USA{ssantha1,akarduni,samirashaikh}@uncc.edu

ABSTRACT

Humans quite frequently interact with conversational agents.The rapid advancement in generative language modelingthrough neural networks has helped advance the creation of in-telligent conversational agents. Researchers typically evaluatethe output of their models through crowdsourced judgments,but there are no established best practices for conducting suchstudies. Moreover, it is unclear if cognitive biases in decision-making are affecting crowdsourced workers’ judgments whenthey undertake these tasks. To investigate, we conducted abetween-subjects study with 77 crowdsourced workers to un-derstand the role of cognitive biases, speciﬁcally anchoringbias, when humans are asked to evaluate the output of conver-sational agents. Our results provide insight into how best toevaluate conversational agents. We ﬁnd increased consistencyin ratings across two experimental conditions may be a resultof anchoring bias. We also determine that external factorssuch as time and prior experience in similar tasks have effectson inter-rater consistency.

Author Keywords

Conversational agents; Human evaluation; Anchoring bias;Experiment design

CCS Concepts • Human-centered computing → HCI design and evalua-tion methods; User studies; • Computing methodologies → Discourse, dialogue and pragmatics;

INTRODUCTION

Conversational agents, also commonly known as chatbots,are typically designed with the intention of generating mean-ingful, informative and coherent responses that keep humansengaged in conversation. Conversational agents have becomeextremely popular and have been heralded as one of the recent

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor proﬁt or commercial advantage and that copies bear this notice and the full citationon the ﬁrst page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior speciﬁc permission and/or afee. Request permissions from [email protected].

CHI ’20, April 25–30, 2020, Honolulu, HI, USA.

Copyright is held by the owner/author(s). Publication rights licensed to ACM.ACM ISBN 978-1-4503-6708-0/20/04 ...$15.00.http://dx.doi.org/10.1145/3313831.3376318 breakthrough technologies. The development of conversa-tional agents has evolved from simple rule-based approachessuch as Eliza [61] and PARRY [16] to more sophisticatedtemplates-based [43, 55] and data-driven approaches [35, 6].Extant approaches towards building conversational agents areend-to-end systems that employ seq2seq architectures [58, 52], language modeling [11, 44] or transformer architectures [56].Even with the rapid advancement in the development of con-versational agents through these neural approaches, there areno established set of best practices towards evaluating theirperformance. Evaluation procedures vary from one researcharticle to the next, leading to a fragmented view of how theﬁeld is advancing. Overall, the output generated from thesemodels is evaluated using automated metrics and/or (crowd-sourced) human judgments. With respect to automated metrics,measures including BLEU [47], METEOR [5], ROUGE [39]and word-embedding based metrics [41], which can be cal-culated based on word overlap have been used. However,prior research has shown that these metrics show little to nocorrelation with human ratings [45, 42, 41]. Due to theselimitations of automated metrics, evaluation of chatbots is in-creasingly conducted by obtaining qualitative judgments fromcrowd-sourced workers [44, 63]. This puts a major imperativeon how the experiments to collect crowd-sourced judgmentsare designed. However, research advancing best practices forexperiment design for evaluating chatbots performance andobtaining more reliable and consistent ratings from crowd-sourced workers is very limited. Our work seeks to ﬁll thisresearch gap.Consider the simple choice of the type of question used toelicit human judgments. Most current experiments for evalu-ating conversational agent output use Likert scales; a typicalquestion would be to ask the humans to rate the Readabilityof chatbot output on a scale of 1–5. However, research byBelz and Kow [10] has shown that using Likert scales mayaffect rating consistency, for example, some individuals maytend to avoid the extremes of the scale while others may not.Novikova et al. [46] have shown that continuous scales helpimprove the consistency and reliability of human ratings acrossseveral language evaluation tasks as opposed to Likert scales.In their experiments, Novikova et al. [46] found that consis-tency of crowd-sourced workers improved when workers wereasked to rate the conversational agent output by comparing a r X i v : . [ c s . C L ] F e b t against a given (gold) standard. A sample question in theirstudy would ask human raters to input a number to rate theReadability of an algorithm output by comparing it againstthe provided gold standard response (with a standard responsevalue of 100). But what if this increased consistency is a resultof the very presence of the predetermined gold standard, possi-bly because humans evaluators are anchored on that standardvalue of 100? Anchoring bias , which is the tendency of people to focus on theﬁrst piece of information presented; also deﬁned as “ inabilityof the people to make sufﬁcient adjustments starting from theinitial value (anchor) to yield the ﬁnal answer ” [28]. Decadesof research has resulted in a robust ﬁnding that humans areprone to cognitive biases when engaged in decision-making[54, 28, 22, 15, 62, 17], which are heuristics that help humansreach decisions quickly [28]. To the best of our knowledge,the impact of anchoring bias when humans evaluate conver-sational agent output has not heretofore been studied, evenas human evaluation has become an integral piece of mostcurrent research evaluating chatbots.To investigate the effects of cognitive biases, speciﬁcally an-choring bias, on decision-making around evaluating chatbotoutput, we designed a 2 X • We ﬁnd systematic effects of anchoring in the magnitude of participants’ ratings: participants who are presented withan anchor will provide a rating that is closer to the anchorvalue than those who are not presented with an anchor. • We ﬁnd systematic effects of anchoring in the consistency of participants’ ratings: participants who are presented withan anchor will be (generally) more consistent in their ratingsthan those who are not presented with an anchor. • We ﬁnd that interpretation of metrics affects consistency:participants were more consistent with their ratings on Read-ability than in their ratings on Coherence, potentially be-cause the interpretation of Coherence is more subjectivethan Readability.Our ﬁndings demonstrate the impact anchoring bias mighthave in designing evaluation experiments. Along with explor-ing the impact of anchoring, we also provide insights into howthe prior experience of being involved in similar research stud-ies as well as time taken to complete the task as factors thatcan affect rating consistency. Our ﬁndings have the potentialto advance the ﬁeld of human-agent interaction by extendingthe reproducibility of conversational agent evaluation experi-ments. The ﬁndings of this paper are applicable to other areasof natural language processing, including text summarizationand story generation, that also rely on human evaluation tostudy the quality of the algorithms. More broadly, the designof experiments in this paper can be adapted to investigate theeffects of cognitive biases in a range of human-computer inter-action tasks, building upon prior work in Explainable AI [60]and bias mitigation [51].

RELATED WORK

Our work relates to three primary areas of research; we presentrelated work in each area in this subsection.

Cognitive Biases in Decision Making

Evaluating algorithm output is an inherently subjective task.Cognitive biases, simple heuristics that are effective but maylead to suboptimal decision-making, especially when uncer-tainty is involved [54], are a critical concern but surprisinglyunderstudied when evaluating conversational agent output.Cognitive biases were ﬁrst introduced by Tversky and Kahne-man, and have been studied extensively in the ﬁeld of psychol-ogy [30, 54, 21]. One form of cognitive bias is anchoring bias,which is when humans rely on a single piece of information(“anchor”) to make a decision [29]. Tversky and Kahneman[54] found evidence that when individuals are asked to providean estimate, their estimates were pretty close to the referencevalue or anchor. Anchoring can thus affect decision-making invisual analytics [15, 62], valuations [1], even general knowl-edge [25, 22]. For Natural Language Processing tasks how-ever, there has been little research studying the impact ofanchoring bias. One prior study by Berzak et al. [12] evalu-ated the impact of anchoring bias in the creation of syntacticparsers. When it comes to evaluating the output of conversa-tional agents, there has been no prior work on understandingthe impact of cognitive biases. Our work is the ﬁrst step inthat direction.

Evaluation of Dialogue Systems

There are two main domains in which conversational agentsare deployed: open-domain [20, 50, 2] and goal-oriented [40]conversational settings. Goal-oriented systems are designedto achieve a speciﬁc goal, such as restaurant booking [13] ormovie ticket booking [38]. Open-domain systems, also knowncommonly as chit-chat systems, engage with a conversationpartner towards no predetermined goal [63]. Typically, naturallanguage generation in conversational agents is achieved bytraining seq2seq architectures [58, 52]. Prior research hasshown that agents built using seq2seq frameworks suffer fromgenerating dull and generic responses [58, 36]. Evaluatingthe quality of responses generated by these models in open-domain situations is thus an important area of research becauseit affects user satisfaction and engagement [59, 57].To evaluate output automatically , researchers have adoptedmetrics such as BLEU [47], METEOR [5] and ROUGE [39]from machine translation and text summarization [41] tasks.BLEU, METEOR and ROUGE can be computed based onword overlap between the proposed and ground truth re-sponses; however, they do not adequately account for thediversity of responses that are possible for a given input utter-ance. Experiments show that these automated metrics alongwith word embedding based metrics [41] show little to no cor-relation with human ratings [41, 42]. With the lack of properautomated metrics for evaluation, obtaining human ratings is aprimary evaluation method for evaluating chatbots. Even withhuman evaluation, a variety of metrics have been proposed,including ease of answering [37], coherence [37], informa-tion ﬂow [37] , naturalness [2], ﬂuency [63] and engagement [57]. Our current study builds upon this prior research andeeks to investigate the use of appropriate metrics in evaluat-ing chatbots. As an experiment design choice, we also askedcrowdsourced workers which metrics they would themselvesconsider most important while undertaking these tasks.

Experiment Design in Language Evaluation

Our focus in this paper is on experiment design. Our moti-vation to do so is based on prior research that demonstratedthe effectiveness of different questions types (e.g. continuousscales, magnitude estimation, etc.) to obtain human ratingsinstead of using discrete scales (e.g. Likert scales) [46, 10, 9,32]. Likert Scales are widely used to obtain human ratingsfor conversational agent output [18, 63, 14]. However, Likertscales suffer from a number of limitations such as inconsisten-cies in ratings by different annotators, scale region bias andﬁxed granularity [32, 48, 8]. Recent work done by Novikova et al. [46] addresses the issue of inconsistency in ratings, al-though in goal-oriented systems. Their work demonstratesthe effectiveness of using continuous scales towards increasedconsistency for language evaluation tasks. However, the extentto which anchoring bias may affect consistency has not beenpreviously studied. Prior research from Novikova et al. [46]also demonstrates an increase in consistency when the ratingtasks are split so that each metric is rated individually (ratingReadability followed by rating Coherence). Taking inspirationfrom this, our experiment design has explicit conditions toinvestigate the effects of splitting the rating tasks.To summarize this Related Work section, evaluation of dia-logue system output relies increasingly on human evaluation,yet not a lot of research focuses on experiment design for thistask. Also, we ﬁnd very little work towards understandingthe impact of cognitive biases that might affect ratings ob-tained from crowd-sourced workers. Our present study seeksto ﬁll this research gap and propose better experiment designprocedures for use by fellow researchers in this area.

CORPUS AND MODELS

To obtain ratings on conversational agent output, we trainedthree models from scratch to generate responses. Code forthese models was made available by Dziri et al. [20] ( https://github.com/nouhadziri/THRED ). We ﬁrst describe the corpuswe used to train the models.

Corpus

We used the Reddit Conversational Corpus made availableby Dziri et al. [20]. This corpus consists of conversationsobtained from 95 different subreddits, curated out of 1.1Msubreddits. The date range is a 20-month period from Novem-ber, 2016 until August, 2018. Table 1 shows overall descriptivestatistics of the corpus, where the average length of utterancesis consistent across the Training, Validation and Test sets.

Train Valid. TestDialogues

Avg. Length of Utterances

Table 1. Descriptive statistics of the corpus used in our experiments.

Models

All three models used in our experiments are based on seq2seq approaches that contain an encoder and decoder component.

Seq2seq approaches are commonly used in language genera-tion tasks, such as machine translation and dialogue genera-tion. For dialogue generation, the encoder receives the inputsequence X = x , x , ...., x n as input. Each input sequenceis passed through an LSTM [26] on the encoder side whichproduces a hidden state representation (Eq 1.) h enct = f ( h enct − , x t ) . (1)where h enct − represents the previous hidden state and f repre-sents a non-linear activation function. The decoder uses thelast hidden state of the encoder as the initial state and out-put tokens are conditioned on the input (Eq 2.) where y t − represents the ground truth input into the decoder. s dect = f ( s dect − , y t − ) (2)1. Seq2Seq:

Our ﬁrst model is a traditional seq2seq modelwith attention mechanism. We use the attention mecha-nism proposed by Bahdanau et al. [4]. Attention assiststhe decoder to attend to different parts of the input whilegenerating the response. The decoder produces a contextvector c t at each time step by attending to the encoder hid-den state h enct along with the last hidden of the decoder s t − (represented through Eq 3.) where α represents the relativeimportance on the input side. The output from the model y t is produced through a softmax function (Eq 4.). c i = n ∑ i = α i h enci α i = exp ( e i ) ∑ nj = exp ( e j ) e i = f ( s t − , h i ) (3) y t = so f tmax ( y t − , s t , c t ) (4)2. HRED:

Our second model uses

Hierarchical Encoder-Decoder [50] architecture. This model is an advancementover traditional seq2seq models . HRED overcomes the bot-tlenecks of traditional seq2seq models by capturing longercontext from dialogue histories. HRED model introduces atwo-level hierarchy to capture long term context. The ﬁrstlayer is called the utterance layer that captures the meaningof each sentence, similar to traditional seq2seq models. Itfurther encodes the hidden states of the utterance layer tothe inter-utterance layers that capture the context and inputinformation [53].3.

THRED:

Our last model is the

Topic Augmented Hierar-chical Encoder-Decoder [20]. This model uses topic wordsalong with a hierarchical encoder-decoder to produce a re-sponse. The topic words were obtained using a pre-trainedLDA model [27]. This model also makes uses of attentionmechanism on the context along with the topic words fromthe input sequence.

Sample Output from Models:

In Figure 1 (top-left), weshow a sample conversation from the Reddit Corpus. It con-sists of two sentences, spoken by Person A and B. The corpuslso provides the target (or gold-Standard) response againstwhich the model can be trained, and also against which per-formance can be evaluated. This is shown in the StandardResponse in Figure 1 screenshot. In the bottom of the screen-shot, the output from the three generative models described inthis section is shown ( seq2seq , HRED and THRED output inResponse 1, 2 and 3 respectively).

EXPERIMENT DESIGN

Having obtained the outputs from our three models, we builtan interface to allow participants to evaluate the generatedresponses. We initially focus on two metrics:

Readability and

Coherence . Readability and Coherence are frequently usedin obtaining evaluation ratings from crowd-sourced workers[45, 24, 63, 2].

Readability measures the linguistic qualityof text and helps quantify the difﬁculty of understanding thetext by the reader [23, 45].

Coherence measures the abilityto produce responses consistent with the topic or context ofconversation [57]. Based on prior ﬁndings of the limitationsof Likert scales [10],we instead use magnitude estimation (ME) questions to ob-tain ratings from crowdsourced workers.

Magnitude Estima-tion allows participants to rate the responses over a free scalewithout being constrained. Recently, Novikova et al. [46]demonstrated that use of magnitude estimation helps improveconsistency amongst crowd-sourced workers when evaluatingresponses from goal-oriented systems. We build upon thisprior work but speciﬁcally focus on investigating the impactof cognitive biases to design our experiments.Accordingly, we design four experiment conditions, namely

Anchor : With or Without Anchor and

Presentation Order :Both Questions or Single Question (on a single screen). Ta-ble 2 shows the four different experiment conditions in ourexperiment design, while Figure 1 shows two sample screen-shots from the study interface.

No Anchor AnchorBoth Questions (Setup 1)

18 22

Single Question (Setup 2)

18 19

Table 2. X experiment design with four experiment conditions andnumber of participants across each condition As shown in Figure 1, participants across all experiment condi-tions are shown the Conversation Context (A). Participants inthe Anchor conditions are shown the Standard Response andthe Readability and Coherence value of the Standard Response(set to 100 in this study, following prior work done by [46]);together these form the Numerical and Textual Anchor (B)(Figure 1-left). Participants in the No Anchor condition areshown neither the Standard Response nor the Readability andCoherence value of the standard response (B’) (Figure 1-right).Participants in the Both Questions (Setup 1) condition areasked to input their ratings of Readability and Coherence ona single screen (C) (as shown in Figure 1-left). Participantsin the Single Question condition (Setup 2) are asked to inputtheir ratings on a single metric on single screen (as shown Fig-ure 1-right (C’) for Readability), and then input their ratings on the Coherence metric on the next screen when they clickthe next button (not shown).Figure 2 provides the ﬂow of steps taken by workers in the ex-periment, beginning with the informed consent procedure andpre-questionnaire, followed by the task of evaluating 50 setsof outputs on two metrics of Readability and Coherence andending with the post-questionnaire. In the pre-questionnaire,we asked two questions about the prior experience of workers:(Q1)

Have you taken part in previous studies that involve evalu-ating conversational responses? and (Q2)

Have you taken partin previous studies that involve talking to a chatbot?

Our mo-tivation behind asking these questions is to understand if priorexperience participating in similar studies affects inter-raterconsistency. In the post-questionnaire, we obtain participantdemographics including their age, gender, race, and education.We also ask them if they ﬁnd it preferable to provide ratings asmagnitude estimation question or on Likert scales. In addition,we obtain their free-form responses on which metrics theywould consider important for evaluating conversational agentoutput. These post-questionnaire questions are designed toobtain qualitative data to better inform our future studies.

Research Questions

Following the review of prior work in this area and our de-cisions on the experiment design, we developed three mainresearch questions for our study. • RQ1:

Which factors affect the magnitude of ratings pro-vided by the participants?

Rationale:

The presence of ananchor may orient participants towards that number (100)and also the reference text, thus we expect that participantsin the anchoring conditions will have higher ratings (closerto 100) than do participants in the no anchor conditions. Inaddition, we investigate if the presentation order of ques-tions (Setup 1 vs. Setup 2) has an effect on how highparticipants’ ratings are on the task. We also investigatewhether the time to complete the task has any effect onthe magnitude of ratings. We use the responses on the pre-questionnaire about the prior experience to analyze whetherhaving taken part in similar studies or conversing with achatbot has any effect on the magnitude of ratings. • RQ2:

Which factors affect the consistency in ratings pro-vided by participants?

Rationale:

Similar to RQ1, we ex-pect that the presence of an anchor may orient participantstowards that number (100) and also the reference text, thuswe expect that participants in the anchoring conditions willhave higher consistency in their ratings than do participantsin the no anchor conditions. In addition, we investigate ifthe presentation order of questions (Setup 1 vs. Setup 2)has an effect on the consistency of ratings on the task. Wealso investigate whether the time to complete the task andprior experience affect inter-rater consistency. • RQ3:

Are participants more consistent in their ratings ofreadability than coherence?

Rationale:

Across both setups,we except higher consistency in readability ratings than co-herence. We also expect the impact of anchoring to be morepronounced for readability over coherence. We contendcoherence is more subjective to evaluate than is readability,since humans have judge whether the response is related to igure 1. Sample screen showing variations in the experiment conditions. (A) represents the conversational context that is shown across all conditions.(B) is the numerical and textual anchor presented to participants in anchoring conditions. (B’) shows the screenshot of conditions where no anchoris presented. (C) is used in Setup 1 where both questions of readability and coherence ratings are shown together. (C’) is used in Setup 2 where thereadability and coherence are treated as individual tasks and only one is shown at a time to the participant.Figure 2. The experiment ﬂow for each crowd-sourced worker takingpart in this study. the context of the conversation [19, 45]. Readability on theother hand has been evaluated across other ﬁelds throughautomated metrics and is more well-deﬁned [31].

RESULTS

We present the results of our analysis in this section. We beginby describing the pool of participants we recruited and thequality checks we put in place to ensure high-quality crowd-sourced data.

Descriptive Statistics

Our study was approved under our institution’s InstitutionalReview Board (IRB) policies (IRB The participants were assigned toexperiment condition randomly. We allowed each partici-pant a maximum of 4 hours to complete the study. In or-der to ensure high-quality data, we had stringent qualifying criteria: (1) Workers should have a Masters qualiﬁcation; (2) HIT Approval Rate to be >

80; and (3) Number of ap-proved HITs > https://github.com/sashank06/ConvEvaluation_CHI2020 .A total of 77 crowdsourced workers participated in our study.The gender distribution was 67.5% male (52), 31.17% female(24) and 1.33% other (1). The age of workers was between20 and 60 years (mean=34.85 years). A majority of the par-ticipants had an undergraduate degree ( n = n = n = n = n =

37 were Indian, along with White ( n = n = n = n =

3) and Native American( n =

1) making up rest of the demographics.In the pre-questionnaire, we also asked participants to indicate:(Q1) whether the participant has taken part in prior researchstudies evaluating conversational responses; and (Q2) interact-ing with a chatbot. Table 3 provides the number of participants’response across both setups to the pre-questionnaire questions.

Analysis and Results for RQ1

Effects of anchor and type of setup on magnitude of ratings

We ﬁnd signiﬁcant differences between the magnitude of re-sponses provided by participants across the both setups withp < M = .

92) that are signiﬁcantlylower than ratings provided by participants in anchor condi-tion ( M = . uestion 1 Question 2Yes No Yes NoSetup 1 No Anchor 5 13 5 13Anchor 4 18 5 17Setup 2 No Anchor 7 11 8 10Anchor 6 13 7 12 Table 3. Number of participants in each category: we refer to priorexperience on evaluating conversational output as Question 1 and priorexperience of engaging with chatbots as Question 2. with no anchor, resulting in a mean rating of 61 .

25, while rat-ings in anchor condition responses have a mean of 69 .

02. Weanalyze the ratings on Readability and Coherence separately(Figure 4): the presence of numerical and textual anchors re-sults in higher (on average) ratings than the absence of theanchor (statistically signiﬁcant with p < Figure 3. Mean of the responses bootstrapped with 95% conﬁdence in-tervals across setups 1 and setup 2

Figure 4 presents ratings for the metrics of readability andcoherence separately. We ﬁnd that across both setups, the dif-ference between anchor and no anchor conditions to be largerfor the metrics of readability than coherence (statistically sig-niﬁcant with p < Effect of time taken to complete task on magnitude of ratings

We analyze the effects of time taken to complete the taskon magnitude of ratings. We ﬁnd that participants who arepresented with anchors spend more time on average takingthe study than participants in no anchor conditions acrossboth setups. From the total of 77 participants, the mean timetaken to complete the study was 57.17 minutes (see Figure 5).In Setup 1, we ﬁnd that participants took an average of 66minutes in the with anchor condition and average of 54.83minutes in the without anchor conditions. Similarly, in Setup2 we ﬁnd participants took an average of 54.94 minutes withanchor condition and 50.94 with no anchor condition.

Figure 4. Mean of the responses bootstrapped with 95% conﬁdence in-tervals across Setups 1 and 2 on the metrics of Readability and Coher-ence.

Below Average Above AverageSetup 1 No Achor 7 (71.19) 11 (39.65)Anchor 11 (73.53) 11 (72.35)Setup 2 No Anchor 4 (61.96) 14 (58.75)Anchor 5 (64.02) 14 (83)

Table 4. Number of the participants who spent below and above averagetime across conditions and their average rating values (in parenthesis)

Next, we grouped the participants based on the amount of timespent into two categories: (1)

Below Average - when partici-pants spend less than mean time; (2)

Above Average - whenparticipants spend more than mean time. Table 4 providesthe number of participants based on the time spent across theexperiment conditions. Across both setups, we ﬁnd that peo-ple in the above average group show signiﬁcant differencesin their responses. In Setup 1, in the above average group,the mean of responses in no anchor condition was 39 .

65 andmean of the responses in anchor condition was 72 .

35. We ﬁndsimilar evidence in Setup 2 with people in anchor conditionprovide higher values (83) close to the numerical anchor (100).Although, we note that the sample sizes in the Below Averagetime taken groups in Setup 2 are smaller (4 and 5 participantsresp., c.f. Table 4); more experimentation is needed to furthersubstantiate this ﬁnding.

Effect of prior experience on magnitude of ratings

Figure 7 demonstrates the impact of the prior experience ofevaluating conversational responses (Question 1 on the pre-questionnaire) on the magnitude of ratings. We ﬁnd con-trasting responses across both setups. In Setup 1, we ﬁnd igure 5. Average time taken to complete the task across four experi-ment conditions. Overall average is shown in dashed line in the graph(57.17 minutes). that people with prior experience in the anchor condition pro-duce higher responses (M=74.41) close to the numerical an-chor (100) and no anchor condition produce lower values(M=38.36) whilst people with no prior experience are similarin their responses across both conditions. In comparison toSetup 1, we ﬁnd that in Setup 2 participants with no priorexperience produce higher responses in the anchor condition(M=71.45) and in no anchor condition (M=63.74).Figure 8 shows the impact of prior experience of interactingwith chatbots. Participants who have such prior experiencedemonstrated signs of anchoring. We ﬁnd that mean of re-sponses (M=80.40) for participants with prior experience inthe anchor condition to be signiﬁcantly higher ( p < . however thiseffect is only seen in Setup 1, while Setup 2 demonstrates theopposite effect. We ﬁnd this evidence to be particularly inter-esting and plan to further investigate the potential of elicitingratings on different metrics as separate tasks (Setup 2) as ameans of mitigating the anchoring bias effect. Analysis and Results for RQ2

We measure consistency of ratings using the intra-class cor-relation measure (ICC) [34]. Following Bard et al. [7], weperform a log normalization of the scores obtained using mag-nitude estimation method across both setups.

Effects of anchor and type of setup on consistency of ratings

Table 5 represents the ICC scores obtained across both setupson the metrics of readability and coherence. We ﬁnd that thereis a signiﬁcant ( p < . ) increase in the consistency of the Figure 6. Mean of the responses bootstrapped with 95% conﬁdence in-tervals across Setups 1 and 2 based on amount of time spent on study. ratings in the anchor condition in Setup 1. The consistencyvalues obtained in Setup 2 for readability and coherence showmixed results. We ﬁnd that the no anchor condition of Setup2 produces more consistency in ratings for the readabilitymetrics whilst on the metric of coherence, we ﬁnd that thereis extremely low consistency between the raters when theyare presented with no anchors. However, we see a signiﬁcantincrease in consistency for Setup 2, when participants are inanchoring condition. Readability CoherenceSetup 1 No Anchor (n=18) 0.74 0.76Anchor (n=22) 0.921 0.855Setup 2 No Anchor (n=18) 0.874 0.151Anchor (n=19) 0.835 0.727

Table 5. ICC scores on the metrics of readability and coherence foreach experiment condition. All values are statistically signiﬁcant p-value < Effect of time taken to complete task on consistency of ratings

We look at the role of external factors of time and prior expe-rience towards consistency of the ratings provided. Table 6represents the ICC scores on the metrics on readability andcoherence across both setups. We group these participants intotwo groups of

Above Average and

Below Average based onthe amount of time spent in the study (c.f Table 4).Surprisingly, we ﬁnd that people who spend below averagetime achieve higher consistency in the ratings across bothsetups. However, we do notice some differences between thetwo setups. In Setup 1, we ﬁnd that amongst participantswho are in the below average group, the participants in the igure 7. Mean of the responses bootstrapped with 95% conﬁdence in-tervals across setups 1 and setup 2 based on prior experience of beinginvolved studies about evaluating conversations. anchor condition have a higher consistency than participantsin no anchor condition. Similarly, we ﬁnd that people whospend above average time on Setup 1 with anchor conditionachieve higher consistency when compared to Setup 1 with noanchor condition for the above average group. However, inSetup 2 we ﬁnd people who spend above average time have apoor consistency score on the metric of coherence, a possibleindication that coherence is highly subjective.

Effect of prior experience on consistency of ratings

Table 7 provides an overview of the consistency on the read-ability and coherence metrics based on participants prior ex-perience about taking part in studies about evaluating conver-sations across both setups. We ﬁnd that participants with noprior experience of evaluating conversation across both setupstend to have higher consistency when compared to participantswith prior experience of evaluating conversations irrespectiveon experimental condition assigned. When compared withinthe anchor conditions across both setups, we ﬁnd that partic-ipants with no prior experience of evaluating conversationsachieve higher consistency in Setup 2 and participants withprior experiences of evaluating conversations achieve a higherconsistency on readability metrics with Setup 1.Table 8 gives an overview of the consistency on the readabilityand coherence metrics based on participants prior experienceof taking part in studies related to engagement with a chatbot.Compared to Table 7, we ﬁnd that participants with prior ex-perience of engaging with chatbots achieve higher consistencyacross both setups irrespective of the experiment conditionexcept on the Setup 2 anchoring condition. Also, we ﬁndthe anchoring condition enables participants to achieve higherconsistency across both Setup 1 and Setup 2. We ﬁnd that

Figure 8. Mean of the responses bootstrapped with 95% conﬁdence in-tervals across setups 1 and setup 2 based on prior experience of beinginvolved studies about talking to chatbot. irrespective of the participants’ prior experience, anchoringhelps achieve a higher consistency. This also provides similarevidence to presence of anchoring helping towards achievinghigher consistency in this experiment design. Tables with con-ﬁdence intervals for Figures 3, 4, 6, 7 and 8 are included inour github repository.

Analysis and Results for RQ3

As shown in Table 5, we see that readability has a higher con-sistency over coherence on both setups. We also notice thesigniﬁcant impact anchoring has towards increasing consis-tency of ratings. We see that it seems harder to agree upon themore subjective metric of coherence, without any textual ornumerical anchor. We also suspect the impact of instructionsmight have towards consistency. In the instructions screen inour study, Readability was deﬁned as:

Is the response easyto understand, ﬂuent and grammatical and does not have anyconsecutive repeating words (following [45, 46]), which pro-vides clear indicators regarding evaluating a response on themetric of readability. Coherence was deﬁned as:

Is the re-sponse relevant to the topic and context of the conversation. (following [20, 57] making it more subjective.

DISCUSSION AND LIMITATIONS

In this section, we discuss implications of our results on an-choring effect in dialogue evaluation, and point out possiblelimitations related to the study design and analysis.

Implication of experiment results

Our key ﬁndings indicate that the presence of numerical andtextual anchors signiﬁcantly inﬂuences the ratings across two ondition Time Taken Readability Coherence

Setup 1No Anchor Below Average(n=11) 0.75 0.63Above Average(n=7) 0.23 0.59Setup 1Anchor Below Average(n=11) 0.86 0.785Above Average(n=11) 0.83 0.68Setup 2No Anchor Below Average(n=14) 0.85 -0.03†.Above Average(n=4) 0†. 0†.Setup 2Anchor Below Average(n=14) 0.726 0.76Above Average(n=5) 0.556 -0.20†.

Table 6. ICC scores on the metrics of readability and coherence based onthe amount of time spent in the study across both conditions. All valuesstatistically signiﬁcant at p-value < † . different experiment setups. We ﬁnd the effect of anchoringis more pronounced in instances when participants are askedto provide ratings on two metrics at the same time (BothQuestions/Setup 1) and the effect of anchoring is slightly lesspronounced when participants are asked to provide ratings fora single metric on a single screen (Single Question/Setup 2).Our ﬁndings have implications for potential future experimentdesigns that are geared towards evaluating the performance ofdialogue systems, if there are ratings to be elicited on multipledimensions, such as Readability and Coherence.Additionally, external factors of time taken to complete thestudy and participants prior experience of having taken part inresearch studies either about evaluation or engagement with achatbot were found to impact the magnitude of the responsesand consistency in the ratings. We ﬁnd participants who spendmore than the average time (above average) on the study getanchored and also exhibit low consistency scores on the met-rics of readability and coherence.We notice the choice of metrics to evaluate also has an impacton consistency. We see that ratings for the more subjective met-ric of coherence are less consistent than those for readabilityamongst the raters across all conditions and setups.We also analyzed the data from the post-questionnaire ques-tions asking participants which method of rating they preferredto work with. From the 77 participants, we ﬁnd that 42 par-ticipants preferred the magnitude estimation method and 35of them preferred the Likert scale method. Prior research hasshown that continuous scale methods like magnitude estima-tion do offer advantages [10, 46] and they need to be exploredfurther for the purposes of evaluation. Consistent with the priorwork in this area, we also ﬁnd similar advantages provided bymagnitude estimation across both our setups with an increase Condition Prior experienceevaluatingconversations? Readability Coherence

Setup 1No Anchor Yes (n=5) 0.44 0.71No (n=13) 0.67 0.62Setup 1Anchor Yes ((n=4) 0.61 0.52No (n=18) 0.91 0.84Setup 2No Anchor Yes (n=7) 0.77 -0.88†No (n=11) 0.71 0.46Setup 2Anchor Yes (n=6) -0.2† 0.65No (n=13) 0.93 0.86

Table 7. ICC scores on the metrics of readability and coherence whenbased of participants prior experience of taking part in research studiesabout evaluating conversations. All values statistically signiﬁcant at p-value < † . Condition Prior experienceinteracting withchatbots? Readability Coherence

Setup 1No Anchor Yes (n=5) 0.73 0.75No (n=13) 0.55 0.58Setup 1Anchor Yes ((n=5) 0.89 0.69No (n=17) 0.87 0.79Setup 2No Anchor Yes (n=8) 0.85 -0.163†No (n=10) 0.58 -0.48Setup 2Anchor Yes (n=7) -0.2† 0.49No (n=12) 0.91 0.82

Table 8. ICC scores on the metrics of readability and coherence whenbased of participants prior experience of taking part in research studiestalking to chatbot. All values statistically signiﬁcant at p-value < † . in consistency of the ratings provided by the crowd-sourcedworkers. These ﬁndings and the participants’ feedback on theirown preferences lead us to recommend magnitude estimationfor future evaluation design of conversational agents. Limitations

We acknowledge a few limitations of our work. First, we con-sider only two metrics for evaluation of conversational agents.In reality, there may be more metrics that are better designedto evaluate the performance of conversational agents. Second,we acknowledge that this study is exploratory; understandingthe impact of anchoring bias in the evaluation of conversa-tional agents is in its infancy. For future studies, we plan topre-register our study to improve the validity of our ﬁndings[33]. Third, we study the effect of anchoring, however, weprovide both numerical and textual anchors. Although weﬁnd the impact of anchoring, we are unable to determine ifthe numerical or the textual anchor is causing this effect. Toaddress this, we are planning an extension study with addi- igure 9. Participant ratings on which metrics they considered impor-tant for conversational output evaluation. Y-axis represents the % ofimportance. tional experiment conditions so that we can study the impactof textual and numerical anchors separately.

Future Work

The results of our study offer insights into the challengingtask of designing and understanding the impact of experimentsfor evaluation of dialogue systems. To provide additional in-formation for possible future directions, we also asked theparticipants in our study to rank the metrics that they consid-ered important for output of conversational agents (Figure 9),including Readability and Coherence. We ask them to ratetheir preferences in order of importance for the following met-rics: Readability, Coherence, Novelty, Diversity, Speciﬁcityand Engagement. These metrics are some of the commonlyused metrics in research articles that develop and evaluateconversational agent output. We notice that readability and co-herence are considered very important, but other metrics suchas engagement and speciﬁcity are also worth investigating.Possible extensions to our work would include speciﬁcity andengagement metrics, based on this evidence. Past research bySee et al. [49] speciﬁes metrics including speciﬁcity and en-gagement/interestingness and shows how these metrics couldimpact the training process of a model.

SUMMARY

Evaluation of dialogue systems is an extremely challengingtask since automated metrics do not adequately capture the nu-ances related to natural language and its production. However,prior research has not focused on the impact that experimentdesign has on qualitative dialogue evaluation.Our ﬁndings are a step towards understanding the impact of ex-periment design and the possible role of cognitive bias such asanchoring bias towards dialogue evaluation. Cognitive biasescould be the result of System 1 thinking (Type 1 processing),which is considered to be relatively fast, relatively low on cog-nitive demand, often based on intuition. By contrast, System 2thinking (or Type 2 processing), is considered to be the resultof systematic thinking and reasoning. Our results, however,indicate that participants who spent less time on the task hadhigher consistency of ratings than those who took longer. Onepossible experiment to identify the effects of Type 1 vs. Type 2 processing is to design an experiment condition which ex-plicitly triggers intuitive responses (Type 1) by imposing astrict and challenging response deadline. Bago and De Neys[3] observed in their experiments that participants gave correct,logical responses as the ﬁrst, immediate response, by explicitlytriggering Type 1 vs. Type 2 processing for logic problems.Capturing time taken per question in the interaction logs wouldallow us to collect the data that supports this investigation.We speciﬁcally investigate impact of anchoring bias in ourexperiment, to determine its effects on the consistency measureacross participants. By separately analyzing the effect of thepresence/absence of anchors and also the presentation orderof questions, we are able to make design recommendationsfor future experiments on dialogue evaluation. We focus onthe metrics of readability and coherence, but our proposedexperiment design can be extended to multiple other metrics.In addition, our study also suggests that external factors oftime and prior experience of taking part in research studiesabout evaluation of responses and engagement with chatbotshave a signiﬁcant impact towards responses provided and alsoon consistency.

Acknowledgments

This work was supported by the Defense Advanced ResearchProjects Agency (DARPA) under Contract No FA8650-18-C-7881. All statements of fact, opinion or conclusions containedherein are those of the authors and should not be construed asrepresenting the ofﬁcial views or policies of AFRL, DARPA,or the U.S. Government. We thank the anonymous reviewersfor the helpful feedback.

REFERENCES [1] Dan Ariely, George Loewenstein, and Drazen Prelec.2003. “Coherent arbitrariness”: Stable demand curveswithout stable preferences.

The Quarterly journal ofeconomics

European Conference on InformationRetrieval . Springer, 154–166.[3] Bence Bago and Wim De Neys. 2017. Fast logic?:Examining the time course assumption of dual processtheory.

Cognition

158 (2017), 90–109.[4] Dzmitry Bahdanau, Kyunghyun Cho, and YoshuaBengio. 2014. Neural machine translation by jointlylearning to align and translate. arXiv preprintarXiv:1409.0473 (2014).[5] Satanjeev Banerjee and Alon Lavie. 2005. METEOR:An automatic metric for MT evaluation with improvedcorrelation with human judgments. In

Proceedings ofthe acl workshop on intrinsic and extrinsic evaluationmeasures for machine translation and/or summarization .65–72.[6] Srinivas Bangalore and Owen Rambow. 2000.Corpus-based Lexical Choice in Natural LanguageGeneration. In

Proceedings of the 38th Annual Meetingn Association for Computational Linguistics (ACL ’00) .Association for Computational Linguistics, Stroudsburg,PA, USA, 464–471.

DOI: http://dx.doi.org/10.3115/1075218.1075277 [7] Ellen Gurman Bard, Dan Robertson, and AntonellaSorace. 1996. Magnitude estimation of linguisticacceptability.

Language (1996), 32–68.[8] Hans Baumgartner and Jan-Benedict EM Steenkamp.2001. Response styles in marketing research: Across-national investigation.

Journal of marketingresearch

38, 2 (2001), 143–156.[9] Anja Belz and Eric Kow. 2010. Comparing rating scalesand preference judgements in language evaluation. In

Proceedings of the 6th International Natural LanguageGeneration Conference . Association for ComputationalLinguistics, 7–15.[10] Anja Belz and Eric Kow. 2011. Discrete vs. continuousrating scales for language evaluation in nlp. In

Proceedings of the 49th Annual Meeting of theAssociation for Computational Linguistics: HumanLanguage Technologies: short papers-Volume 2 .Association for Computational Linguistics, 230–235.[11] Yoshua Bengio, Réjean Ducharme, Pascal Vincent, andChristian Jauvin. 2003. A neural probabilistic languagemodel.

Journal of machine learning research

3, Feb(2003), 1137–1155.[12] Yevgeni Berzak, Yan Huang, Andrei Barbu, AnnaKorhonen, and Boris Katz. 2016. Anchoring andAgreement in Syntactic Annotations. In

Proceedings ofthe 2016 Conference on Empirical Methods in NaturalLanguage Processing . Association for ComputationalLinguistics, Austin, Texas, 2215–2224.

DOI: http://dx.doi.org/10.18653/v1/D16-1239 [13] Antoine Bordes, Y-Lan Boureau, and Jason Weston.2016. Learning end-to-end goal-oriented dialog. arXivpreprint arXiv:1605.07683 (2016).[14] Hongshen Chen, Zhaochun Ren, Jiliang Tang,Yihong Eric Zhao, and Dawei Yin. 2018. Hierarchicalvariational memory network for dialogue generation. In

Proceedings of the 2018 World Wide Web Conference onWorld Wide Web . International World Wide WebConferences Steering Committee, 1653–1662.[15] Isaac Cho, Ryan Wesslen, Ali Karduni, SashankSanthanam, Samira Shaikh, and Wenwen Dou. 2017.The Anchoring Effect in Decision-Making with VisualAnalytics. In

IEEE Conference on Visual AnalyticsScience and Technology (VAST) .[16] Kenneth Mark Colby. 1975.

Artiﬁcial paranoia: acomputer simulation of paranoid process . PergamonPress.[17] Evanthia Dimara, Steven Franconeri, Catherine Plaisant,Anastasia Bezerianos, and Pierre Dragicevic. 2018. Atask-based taxonomy of cognitive biases for information visualization.

IEEE transactions on visualization andcomputer graphics (2018).[18] Emily Dinan, Stephen Roller, Kurt Shuster, Angela Fan,Michael Auli, and Jason Weston. 2018. Wizard ofWikipedia: Knowledge-Powered Conversational agents. arXiv preprint arXiv:1811.01241 (2018).[19] Nouha Dziri, Ehsan Kamalloo, Kory Mathewson, andOsmar Zaiane. 2019. Evaluating Coherence in DialogueSystems using Entailment. In

Proceedings of the 2019Conference of the North American Chapter of theAssociation for Computational Linguistics: HumanLanguage Technologies, Volume 1 (Long and ShortPapers) . Association for Computational Linguistics,Minneapolis, Minnesota, 3806–3812.

DOI: http://dx.doi.org/10.18653/v1/N19-1381 [20] Nouha Dziri, Ehsan Kamalloo, Kory W Mathewson, andOsmar Zaiane. 2018. Augmenting Neural ResponseGeneration with Context-Aware Topical Attention. arXivpreprint arXiv:1811.01063 (2018).[21] Geoffrey Ellis. 2018.

Cognitive Biases in Visualizations .Springer.[22] Adrian Furnham and Hua Chu Boo. 2011. A literaturereview of the anchoring effect.

The Journal ofSocio-Economics

40, 1 (2011), 35–42.[23] Albert Gatt and Emiel Krahmer. 2018. Survey of theState of the Art in Natural Language Generation: CoreTasks, Applications and Evaluation.

J. Artif. Int. Res. http://dl.acm.org/citation.cfm?id=3241691.3241693 [24] Marjan Ghazvininejad, Chris Brockett, Ming-WeiChang, Bill Dolan, Jianfeng Gao, Wen-tau Yih, andMichel Galley. 2018. A knowledge-grounded neuralconversation model. In

Thirty-Second AAAI Conferenceon Artiﬁcial Intelligence .[25] T Gilovich, N Robert, and A Amos. 2001. Puttingadjustment back into the anchoring and adjustmentheuristic: Differential processing of self-generated andexperimenterprovided anchors.

Psychological Science (2001).[26] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Longshort-term memory.

Neural computation

9, 8 (1997),1735–1780.[27] Matthew Hoffman, Francis R Bach, and David M Blei.2010. Online learning for latent dirichlet allocation. In advances in neural information processing systems .856–864.[28] Daniel Kahneman. 2003. A perspective on judgment andchoice: mapping bounded rationality.

Americanpsychologist

58, 9 (2003), 697.[29] Daniel Kahneman. 2016. 36 Heuristics and Biases.

Scientists Making a Difference: One Hundred EminentBehavioral and Brain Scientists Talk about Their MostImportant Contributions (2016), 171.30] Daniel Kahneman and Amos Tversky. 1972. Subjectiveprobability: A judgment of representativeness.

Cognitivepsychology

3, 3 (1972), 430–454.[31] J Peter Kincaid, Robert P Fishburne Jr, Richard LRogers, and Brad S Chissom. 1975. Derivation of newreadability formulas (automated readability index, fogcount and ﬂesch reading ease formula) for navy enlistedpersonnel. (1975).[32] Svetlana Kiritchenko and Saif Mohammad. 2017.Best-Worst Scaling More Reliable than Rating Scales: ACase Study on Sentiment Intensity Annotation. In

Proceedings of the 55th Annual Meeting of theAssociation for Computational Linguistics (Volume 2:Short Papers) . Association for ComputationalLinguistics, Vancouver, Canada, 465–470.

DOI: http://dx.doi.org/10.18653/v1/P17-2074 [33] Robert Kosara and Steve Haroz. 2018. Skipping theReplication Crisis in Visualization: Threats to StudyValidity and How to Address Them. (2018).[34] J Richard Landis and Gary G Koch. 1977. Themeasurement of observer agreement for categorical data. biometrics (1977), 159–174.[35] Irene Langkilde-Geary and Kevin Knight. 2002.Halogen statistical sentence generator. In

Proceedings ofthe ACL-02 Demonstrations Session . 102–103.[36] Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao,and Bill Dolan. 2016a. A Diversity-Promoting ObjectiveFunction for Neural Conversation Models. In

Proceedings of the 2016 Conference of the NorthAmerican Chapter of the Association for ComputationalLinguistics: Human Language Technologies .Association for Computational Linguistics, San Diego,California, 110–119.

DOI: http://dx.doi.org/10.18653/v1/N16-1014 [37] Jiwei Li, Will Monroe, Alan Ritter, Dan Jurafsky,Michel Galley, and Jianfeng Gao. 2016b. DeepReinforcement Learning for Dialogue Generation. In

Proceedings of the 2016 Conference on EmpiricalMethods in Natural Language Processing . Associationfor Computational Linguistics, 1192–1202.

DOI: http://dx.doi.org/10.18653/v1/D16-1127 [38] Xiujun Li, Yun-Nung Chen, Lihong Li, Jianfeng Gao,and Asli Celikyilmaz. 2017. End-to-EndTask-Completion Neural Dialogue Systems. In

Proceedings of the Eighth International JointConference on Natural Language Processing (Volume 1:Long Papers) . Asian Federation of Natural LanguageProcessing, Taipei, Taiwan, 733–743. [39] Chin-Yew Lin. 2004. Rouge: A package for automaticevaluation of summaries.

Text Summarization BranchesOut (2004).[40] Zachary Lipton, Xiujun Li, Jianfeng Gao, Lihong Li,Faisal Ahmed, and Li Deng. 2018. Bbq-networks: Efﬁcient exploration in deep reinforcement learning fortask-oriented dialogue systems. In

Thirty-Second AAAIConference on Artiﬁcial Intelligence .[41] Chia-Wei Liu, Ryan Lowe, Iulian Serban, MikeNoseworthy, Laurent Charlin, and Joelle Pineau. 2016.How NOT To Evaluate Your Dialogue System: AnEmpirical Study of Unsupervised Evaluation Metrics forDialogue Response Generation. In

Proceedings of the2016 Conference on Empirical Methods in NaturalLanguage Processing . Association for ComputationalLinguistics, 2122–2132.

DOI: http://dx.doi.org/10.18653/v1/D16-1230 [42] Ryan Lowe, Michael Noseworthy, Iulian Vlad Serban,Nicolas Angelard-Gontier, Yoshua Bengio, and JoellePineau. 2017. Towards an Automatic Turing Test:Learning to Evaluate Dialogue Responses. In

Proceedings of the 55th Annual Meeting of theAssociation for Computational Linguistics (Volume 1:Long Papers) . Association for ComputationalLinguistics, Vancouver, Canada, 1116–1126.

DOI: http://dx.doi.org/10.18653/v1/P17-1103 [43] Susan W McRoy, Songsak Channarukul, and Syed S Ali.2003. An augmented template-based approach to textrealization.

Natural Language Engineering

9, 4 (2003),381–420.[44] Hongyuan Mei, Mohit Bansal, and Matthew R. Walter.2017. Coherent Dialogue with Attention-BasedLanguage Models. In

Proceedings of the NationalConference on Artiﬁcial Intelligence (AAAI) . SanFrancisco, CA.[45] Jekaterina Novikova, Ondˇrej Dušek, AmandaCercas Curry, and Verena Rieser. 2017. Why We NeedNew Evaluation Metrics for NLG. In

Proceedings of the2017 Conference on Empirical Methods in NaturalLanguage Processing . Association for ComputationalLinguistics, Copenhagen, Denmark, 2241–2252.

DOI: http://dx.doi.org/10.18653/v1/D17-1238 [46] Jekaterina Novikova, Ondˇrej Dušek, and Verena Rieser.2018. RankME: Reliable Human Ratings for NaturalLanguage Generation. In

Proceedings of the 2018Conference of the North American Chapter of theAssociation for Computational Linguistics: HumanLanguage Technologies, Volume 2 (Short Papers) .Association for Computational Linguistics, NewOrleans, Louisiana, 72–78.

DOI: http://dx.doi.org/10.18653/v1/N18-2012 [47] Kishore Papineni, Salim Roukos, Todd Ward, andWei-Jing Zhu. 2002. BLEU: a method for automaticevaluation of machine translation. In

Proceedings of the40th annual meeting on association for computationallinguistics . Association for Computational Linguistics,311–318.[48] Howard Schuman and Stanley Presser. 1996.

Questionsand answers in attitude surveys: Experiments onquestion form, wording, and context . Sage.49] Abigail See, Stephen Roller, Douwe Kiela, and JasonWeston. 2019. What makes a good conversation? Howcontrollable attributes affect human judgments. In

Proceedings of the 2019 Conference of the NorthAmerican Chapter of the Association for ComputationalLinguistics: Human Language Technologies, Volume 1(Long and Short Papers) . Association for ComputationalLinguistics, Minneapolis, Minnesota, 1702–1723.

DOI: http://dx.doi.org/10.18653/v1/N19-1170 [50] Iulian Vlad Serban, Alessandro Sordoni, Yoshua Bengio,Aaron C Courville, and Joelle Pineau. 2016. BuildingEnd-To-End Dialogue Systems Using GenerativeHierarchical Neural Network Models.. In

AAAI , Vol. 16.3776–3784.[51] Fabian Sperrle, Udo Schlegel, Mennatallah El-Assady,and Daniel Keim. 2019. Human Trust Modeling for BiasMitigation in Artiﬁcial Intelligence. In

ACM CHI 2019Workshop: Where is the Human? Bridging the GapBetween AI and HCI .[52] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.Sequence to sequence learning with neural networks. In

Advances in neural information processing systems .3104–3112.[53] Zhiliang Tian, Rui Yan, Lili Mou, Yiping Song, YansongFeng, and Dongyan Zhao. 2017. How to Make ContextMore Useful? An Empirical Study on Context-AwareNeural Conversational Models. In

Proceedings of the55th Annual Meeting of the Association forComputational Linguistics (Volume 2: Short Papers) .Association for Computational Linguistics, Vancouver,Canada, 231–236.

DOI: http://dx.doi.org/10.18653/v1/P17-2036 [54] Amos Tversky and Daniel Kahneman. 1974. Judgmentunder uncertainty: Heuristics and biases. science

ComputationalLinguistics

31, 1 (2005), 15–24.[56] Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In

Advances in Neural Information Processing Systems .5998–6008.[57] Anu Venkatesh, Chandra Khatri, Ashwin Ram, FenfeiGuo, Raefer Gabriel, Ashish Nagar, Rohit Prasad, MingCheng, Behnam Hedayatnia, Angeliki Metallinou, andothers. 2018. On Evaluating and ComparingConversational Agents. arXiv preprintarXiv:1801.03625 (2018).[58] Oriol Vinyals and Quoc Le. 2015. A neuralconversational model. arXiv preprint arXiv:1506.05869 (2015).[59] Marilyn A Walker, Diane J Litman, Candace A Kamm,and Alicia Abella. 1997. PARADISE: A framework forevaluating spoken dialogue agents. In

Proceedings of theeighth conference on European chapter of theAssociation for Computational Linguistics . Associationfor Computational Linguistics, 271–280.[60] Danding Wang, Qian Yang, Ashraf Abdul, and Brian YLim. 2019. Designing Theory-Driven User-CentricExplainable AI. In

Proceedings of the 2019 CHIConference on Human Factors in Computing Systems .ACM, 601.[61] Joseph Weizenbaum. 1966. ELIZAâ ˘AˇTa computerprogram for the study of natural languagecommunication between man and machine.

Commun.ACM

9, 1 (1966), 36–45.[62] R Wesslen, S Santhanam, A Karduni, I Cho, S Shaikh,and W Dou. 2019. Investigating Effects of VisualAnchors on Decision-Making about Misinformation. In

Computer Graphics Forum , Vol. 38. Wiley OnlineLibrary, 161–171.[63] Saizheng Zhang, Emily Dinan, Jack Urbanek, ArthurSzlam, Douwe Kiela, and Jason Weston. 2018.Personalizing Dialogue Agents: I have a dog, do youhave pets too?. In

Proceedings of the 56th AnnualMeeting of the Association for ComputationalLinguistics (Volume 1: Long Papers) . Association forComputational Linguistics, Melbourne, Australia,2204–2213.