Towards an Automatic Turing Test: Learning to Evaluate Dialogue Responses
Ryan Lowe, Michael Noseworthy, Iulian V. Serban, Nicolas Angelard-Gontier, Yoshua Bengio, Joelle Pineau
TTowards an Automatic Turing Test:Learning to Evaluate Dialogue Responses
Ryan Lowe ♥∗ Michael Noseworthy ♥∗ Iulian V. Serban ♦ Nicolas A.-Gontier ♥ Yoshua Bengio ♦‡ Joelle Pineau ♥‡♥
Reasoning and Learning Lab, School of Computer Science, McGill University ♦ Montreal Institute for Learning Algorithms, Universit´e de Montr´eal ‡ CIFAR Senior Fellow
Abstract
Automatically evaluating the quality of di-alogue responses for unstructured domainsis a challenging problem. Unfortunately,existing automatic evaluation metrics arebiased and correlate very poorly with hu-man judgements of response quality. Yethaving an accurate automatic evaluationprocedure is crucial for dialogue research,as it allows rapid prototyping and testing ofnew models with fewer expensive humanevaluations. In response to this challenge,we formulate automatic dialogue evalua-tion as a learning problem. We present anevaluation model (
ADEM ) that learns to pre-dict human-like scores to input responses,using a new dataset of human responsescores. We show that the
ADEM model’spredictions correlate significantly, and at alevel much higher than word-overlap met-rics such as BLEU, with human judge-ments at both the utterance and system-level. We also show that
ADEM can gener-alize to evaluating dialogue models unseenduring training, an important step for auto-matic dialogue evaluation. Building systems that can naturally and meaning-fully converse with humans has been a central goalof artificial intelligence since the formulation ofthe Turing test (Turing, 1950). Research on onetype of such systems, sometimes referred to asnon-task-oriented dialogue systems, goes back tothe mid-60s with Weizenbaum’s famous program ∗ Indicates equal contribution. Code and pre-trained model parameters are available: github.com/mike-n-7/ADEM . Context of Conversation
Speaker A: Hey, what do you want to do tonight?Speaker B: Why don’t we go see a movie?
Model Response
Nah, let’s do something active.
Reference Response
Yeah, the film about Turing looks great!
Figure 1: Example where word-overlap scoresfail for dialogue evaluation; although the modelresponse is reasonable, it has no words in commonwith the reference response, and thus would begiven low scores by metrics such as BLEU.
ELIZA : a rule-based system mimicking a Roge-rian psychotherapist by persistently either rephras-ing statements or asking questions (Weizenbaum,1966). Recently, there has been a surge of inter-est towards building large-scale non-task-orienteddialogue systems using neural networks (Sordoniet al., 2015b; Shang et al., 2015; Vinyals and Le,2015; Serban et al., 2016a; Li et al., 2015). Thesemodels are trained in an end-to-end manner to op-timize a single objective, usually the likelihoodof generating the responses from a fixed corpus.Such models have already had a substantial im-pact in industry, including Google’s Smart Replysystem (Kannan et al., 2016), and Microsoft’s Xi-aoice chatbot (Markoff and Mozur, 2015), whichhas over 20 million users.One of the challenges when developing such sys-tems is to have a good way of measuring progress,in this case the performance of the chatbot. TheTuring test provides one solution to the evaluationof dialogue systems, but there are limitations withits original formulation. The test requires live hu-man interactions, which is expensive and difficultto scale up. Furthermore, the test requires carefullydesigning the instructions to the human interlocu-tors, in order to balance their behaviour and ex-pectations so that different systems may be ranked a r X i v : . [ c s . C L ] J a n ccurately by performance. Although unavoidable,these instructions introduce bias into the evaluationmeasure. The more common approach of havinghumans evaluate the quality of dialogue systemresponses, rather than distinguish them from hu-man responses, induces similar drawbacks in termsof time, expense, and lack of scalability. In thecase of chatbots designed for specific conversationdomains, it may also be difficult to find sufficienthuman evaluators with appropriate background inthe topic (Lowe et al., 2015).Despite advances in neural network-based mod-els, evaluating the quality of dialogue responsesautomatically remains a challenging and under-studied problem in the non-task-oriented setting.The most widely used metric for evaluating suchdialogue systems is BLEU (Papineni et al., 2002),a metric measuring word overlaps originally devel-oped for machine translation. However, it has beenshown that BLEU and other word-overlap metricsare biased and correlate poorly with human judge-ments of response quality (Liu et al., 2016). Thereare many obvious cases where these metrics fail,as they are often incapable of considering the se-mantic similarity between responses (see Figure 1).Despite this, many researchers still use BLEU toevaluate their dialogue models (Ritter et al., 2011;Sordoni et al., 2015b; Li et al., 2015; Galley et al.,2015; Li et al., 2016a), as there are few alternativesavailable that correlate with human judgements.While human evaluation should always be used toevaluate dialogue models, it is often too expensiveand time-consuming to do this for every modelspecification (for example, for every combinationof model hyperparameters). Therefore, having anaccurate model that can evaluate dialogue responsequality automatically — what could be consideredan automatic Turing test — is critical in the questfor building human-like dialogue agents.To make progress towards this goal, we makethe simplifying assumption that a ‘good’ chatbotis one whose responses are scored highly on ap-propriateness by human evaluators. We believethis is sufficient for making progress as currentdialogue systems often generate inappropriate re-sponses. We also find empirically that askingevaluators for other metrics results in either lowinter-annotator agreement, or the scores are highlycorrelated with appropriateness (see supp. mate-rial). Thus, we collect a dataset of appropriatenessscores to various dialogue responses, and we use κ score (inter-annotator 0.63correlation) Table 1: Statistics of the dialogue response evalua-tion dataset. Each example is in the form (context,model response, reference response, human score) .this dataset to train an automatic dialogue evalu-ation model ( ADEM ). The model is trained in asemi-supervised manner using a hierarchical recur-rent neural network (RNN) to predict human scores.We show that
ADEM scores correlate significantlywith human judgement at both the utterance-leveland system-level. We also show that
ADEM can of-ten generalize to evaluating new models, whoseresponses were unseen during training, making
ADEM a strong first step towards effective auto-matic dialogue response evaluation. To train a model to predict human scores to dia-logue responses, we first collect a dataset of humanjudgements (scores) of Twitter responses using thecrowdsourcing platform Amazon Mechanical Turk(AMT). The aim is to have accurate human scoresfor a variety of conversational responses — con-ditioned on dialogue contexts — which span thefull range of response qualities. For example, theresponses should include both relevant and irrel-evant responses, both coherent and non-coherentresponses and so on. To achieve this variety, we usecandidate responses from several different models.Following (Liu et al., 2016), we use the following4 sources of candidate responses: (1) a responseselected by a TF-IDF retrieval-based model, (2) aresponse selected by the Dual Encoder (DE) (Loweet al., 2015), (3) a response generated using the hier-archical recurrent encoder-decoder (HRED) model(Serban et al., 2016a), and (4) human-generatedresponses. It should be noted that the human-generated candidate responses are not the referenceresponses from a fixed corpus, but novel humanresponses that are different from the reference. Inaddition to increasing response variety, this is nec- Code and trained model parameters are available online: https://github.com/mike-n-7/ADEM . All data collection was conducted in accordance with thepolicies of the host institutions’ ethics board. ssary because we want our evaluation model tolearn to compare the reference responses to thecandidate responses. We provide the details of ourAMT experiments in the supplemental material, in-cluding additional experiments suggesting that sev-eral other metrics are currently unlikely to be usefulfor building evaluation models. Note that, in orderto maximize the number of responses obtained witha fixed budget, we only obtain one evaluation scoreper dialogue response in the dataset.To train evaluation models on human judge-ments, it is crucial that we obtain scores of re-sponses that lie near the distribution produced byadvanced models. This is why we use the TwitterCorpus (Ritter et al., 2011), as such models arepre-trained and readily available. Further, the setof topics discussed is quite broad — as opposed tothe very specific Ubuntu Dialogue Corpus (Loweet al., 2015) — and therefore the model may alsobe suited to other chit-chat domains. Finally, sinceit does not require domain specific knowledge (e.g.technical knowledge), it should be easy for AMTworkers to annotate.
Recurrent neural networks (RNNs) are a type ofneural network with time-delayed connections be-tween the internal units. This leads to the formationof a hidden state h t , which is updated for every in-put: h t = f ( W hh h t − + W ih x t ) , where W hh and W ih are parameter matrices, f is a non-linear acti-vation function such as tanh, and x t is the input attime t . The hidden state allows for RNNs to bettermodel sequential data, such as language.In this paper, we consider RNNs augmented withlong-short term memory (LSTM) units (Hochre-iter and Schmidhuber, 1997). LSTMs add a set ofgates to the RNN that allow it to learn how muchto update the hidden state. LSTMs are one of themost well-established methods for dealing withthe vanishing gradient problem in recurrent net-works (Hochreiter, 1991; Bengio et al., 1994). One of the most popular approaches for automati-cally evaluating the quality of dialogue responsesis by computing their word overlap with the ref-erence response. In particular, the most popularmetrics are the BLEU and METEOR scores usedfor machine translation, and the ROUGE score used for automatic summarization. While these metricstend to correlate with human judgements in theirtarget domains, they have recently been shown tohighly biased and correlate very poorly with humanjudgements for dialogue response evaluation (Liuet al., 2016). We briefly describe BLEU here, andprovide a more detailed summary of word-overlapmetrics in the supplemental material.
BLEU
BLEU (Papineni et al., 2002) analyzes theco-occurrences of n-grams in the reference and theproposed responses. It computes the n-gram preci-sion for the whole dataset, which is then multipliedby a brevity penalty to penalize short translations.For BLEU- N , N denotes the largest value of n-grams considered (usually N = 4 ). Drawbacks
One of the major drawbacks ofword-overlap metrics is their failure in capturingthe semantic similarity (and other structure) be-tween the model and reference responses whenthere are few or no common words. This problemis less critical for machine translation; since theset of reasonable translations of a given sentenceor document is rather small, one can reasonablyinfer the quality of a translated sentence by onlymeasuring the word-overlap between it and one (ora few) reference translations. However, in dialogue,the set of appropriate responses given a context ismuch larger (Artstein et al., 2009); in other words,there is a very high response diversity that is un-likely to be captured by word-overlap comparisonto a single response.Further, word-overlap scores are computed di-rectly between the model and reference responses.As such, they do not consider the context of theconversation. While this may be a reasonable as-sumption in machine translation, it is not the casefor dialogue; whether a model response is an ade-quate substitute for the reference response is clearlycontext-dependent. For example, the two responsesin Figure 1 are equally appropriate given the con-text. However, if we simply change the context to: “Have you heard of any good movies recently?” ,the model response is no longer relevant while thereference response remains valid.
ADEM ) To overcome the problems of evaluation with word-overlap metrics, we aim to construct a dialogueevaluation model that: (1) captures semantic sim-igure 2: The
ADEM model, which uses a hierarchical encoder to produce the context embedding c .ilarity beyond word overlap statistics, and (2) ex-ploits both the context and the reference responseto calculate its score for the model response. Wecall this evaluation model ADEM . ADEM learns distributed representations of thecontext, model response, and reference responseusing a hierarchical RNN encoder. Given the dia-logue context c , reference response r , and modelresponse ˆ r , ADEM first encodes each of them intovectors ( c , ˆ r , and r , respectively) using the RNNencoder. Then, ADEM computes the score using adot-product between the vector representations of c , r , and ˆ r in a linearly transformed space: : score ( c, r, ˆ r ) = ( c T M ˆr + r T N ˆr − α ) /β (1)where M, N ∈ R n are learned matrices initializedto the identity, and α, β are scalar constants used toinitialize the model’s predictions in the range [1 , .The model is shown in Figure 2.The matrices M and N can be interpreted aslinear projections that map the model response ˆr into the space of contexts and reference responses,respectively. The model gives high scores to re-sponses that have similar vector representations tothe context and reference response after this projec-tion. The model is end-to-end differentiable; all theparameters can be learned by backpropagation. Inour implementation, the parameters θ = { M, N } of the model are trained to minimize the squarederror between the model predictions and the humanscore, with L2-regularization: L = (cid:88) i =1: K [ score ( c i , r i , ˆ r i ) − human i ] + γ || θ || (2)where γ is a scalar constant. The simplicity of ourmodel leads to both accurate predictions and fastevaluation (see supp. material), which is importantto allow rapid prototyping of dialogue systems.The hierarchical RNN encoder in our model con-sists of two layers of RNNs (El Hihi and Bengio, 1995; Sordoni et al., 2015a). The lower-level RNN,the utterance-level encoder , takes as input wordsfrom the dialogue, and produces a vector outputat the end of each utterance. The context-level en-coder takes the representation of each utteranceas input and outputs a vector representation of thecontext. This hierarchical structure is useful forincorporating information from early utterances inthe context (Serban et al., 2016a). Following pre-vious work, we take the last hidden state of thecontext-level encoder as the vector representationof the input utterance or context. The parameters ofthe RNN encoder are pretrained and are not learnedfrom the human scores.An important point is that the ADEM procedureabove is not a dialogue retrieval model : the funda-mental difference is that
ADEM has access to thereference response. Thus,
ADEM can compare amodel’s response to a known good response, whichis significantly easier than inferring response qual-ity from solely the context.
Pre-training with VHRED
We would like anevaluation model that can make accurate predic-tions from few labeled examples, since these exam-ples are expensive to obtain. We therefore employsemi-supervised learning, and use a pre-trainingprocedure to learn the parameters of the encoder.In particular, we train the encoder as part of a neu-ral dialogue model; we attach a third decoder RNN that takes the output of the encoder as input, andtrain it to predict the next utterance of a dialogueconditioned on the context.The dialogue model we employ for pre-trainingis the latent variable hierarchical recurrent encoder-decoder (VHRED) model (Serban et al., 2016b),shown in Figure 3. The VHRED model is an exten-sion of the original hierarchical recurrent encoder-decoder (HRED) model (Serban et al., 2016a) witha turn-level stochastic latent variable. The dialogueigure 3: The VHRED model used for pre-training. The hierarchical structure of the RNN encoder isshown in the red box around the bottom half of the figure. After training using the VHRED procedure, thelast hidden state of the context-level encoder is used as a vector representation of the input text.context is encoded into a vector using our hierarchi-cal encoder, and the VHRED then samples a Gaus-sian variable that is used to condition the decoder(see supplemental material for further details). Af-ter training VHRED, we use the last hidden stateof the context-level encoder, when c , r , and ˆ r arefed as input, as the vector representations for c , r ,and ˆr , respectively. We use representations fromthe VHRED model as it produces more diverse andcoherent responses compared to HRED. In order to reduce the effective vocabulary size, weuse byte pair encoding (BPE) (Gage, 1994; Sen-nrich et al., 2015), which splits each word intosub-words or characters. We also use layer normal-ization (Ba et al., 2016) for the hierarchical encoder,which we found worked better at the task of dia-logue generation than the related recurrent batchnormalization (Ioffe and Szegedy, 2015; Cooij-mans et al., 2016). To train the VHRED model,we employed several of the same techniques foundin (Serban et al., 2016b) and (Bowman et al., 2016):we drop words in the decoder with a fixed rate of25%, and we anneal the KL-divergence term lin-early from 0 to 1 over the first 60,000 batches. Weuse Adam as our optimizer (Kingma and Ba, 2014).When training
ADEM , we also employ a sub-sampling procedure based on the model response length. In particular, we divide the training exam-ples into bins based on the number of words in aresponse and the score of that response. We thenover-sample from bins across the same score toensure that
ADEM does not use response length topredict the score. This is because humans have atendency to give a higher rating to shorter responsesthan to longer responses (Serban et al., 2016b), asshorter responses are often more generic and thusare more likely to be suitable to the context. Indeed,the test set Pearson correlation between responselength and human score is 0.27.For training VHRED, we use a context em-bedding size of 2000. However, we found the
ADEM model learned more effectively when thisembedding size was reduced. Thus, after train-ing VHRED, we use principal component analysis(PCA) (Pearson, 1901) to reduce the dimensional-ity of the context, model response, and referenceresponse embeddings to n . We found experimen-tally that n = 50 provided the best performance.When training our models, we conduct earlystopping on a separate validation set. For the eval-uation dataset, we split the train/ validation/ testsets such that there is no context overlap (i.e. thecontexts in the test set are unseen during training). a) BLEU-2 (b) ROUGE (c) ADEM
Figure 4: Scatter plot showing model against human scores, for BLEU-2 and ROUGE on the full dataset,and
ADEM on the test set. We add Gaussian noise drawn from N (0 , . to the integer human scores tobetter visualize the density of points, at the expense of appearing less correlated. Full dataset Test set
Metric
Spearman Pearson Spearman Pearson
BLEU-2 0.039 (0.013) 0.081 ( < < < < < < < < < Validation set Test set C- ADEM < < < < ADEM < < < < ADEM (T2V) 0.252 ( < < < < ADEM ( < ( < ( < ( < Table 2: Correlation between metrics and human judgements, with p-values shown in brackets. ‘
ADEM (T2V)’ indicates
ADEM with tweet2vec embeddings (Dhingra et al., 2016), and ‘VHRED’ indicates thedot product of VHRED embeddings (i.e.
ADEM at initialization). C- and R-
ADEM represent the
ADEM model trained to only compare the model response to the context or reference response, respectively. Wecompute the baseline metric scores (top) on the full dataset to provide a more accurate estimate of theirscores (as they are not trained on a training set).
We first presentnew utterance-level correlation results for existingword-overlap metrics, in addition to results withembedding baselines and ADEM , in Table 2. Thebaseline metrics are evaluated on the entire datasetof 4,104 responses to provide the most accurateestimate of the score. We measure the correlationfor
ADEM on the validation and test sets, whichconstitute 616 responses each.We also conduct an analysis of the response datafrom (Liu et al., 2016), where the pre-processingis standardized by removing ‘ < first speaker > ’ to- We present both the Spearman correlation (computed onranks, depicts monotonic relationships) and Pearson correla-tion (computed on true values, depicts linear relationships)scores. Note that our word-overlap correlation results in Table2 are also lower than those presented in (Galley et al., 2015).This is because Galley et al. measure corpus-level correlation,i.e. correlation averaged across different subsets (of size 100)of the data, and pre-filter for high-quality reference responses. kens at the beginning of each utterance. The resultsare detailed in the supplemental material. We canobserve from both this data, and the new data inTable 2, that the correlations for the word-overlapmetrics are even lower than estimated in previousstudies (Liu et al., 2016; Galley et al., 2015). Inparticular, this is the case for BLEU-4, which hasfrequently been used for dialogue response evalu-ation (Ritter et al., 2011; Sordoni et al., 2015b; Liet al., 2015; Galley et al., 2015; Li et al., 2016a).We can see from Table 2 that
ADEM correlatesfar better with human judgement than the word-overlap baselines. This is further illustrated by thescatterplots in Figure 4. We also compare with
ADEM using tweet2vec embeddings (Dhingra et al.,2016). In this case, instead of using the VHREDpre-training method presented in Section 4, we useoff-the-shelf embeddings for c , r , and ˆr , and fine-tune M and N on our dataset. These tweet2vec em-beddings are computed at the character-level withigure 5: Scatterplots depicting the system-level correlation results for ADEM , BLEU-2, BLEU-4,andROUGE on the test set. Each point represents the average scores for the responses from a dialogue model(TFIDF, DE, HRED, human). Human scores are shown on the horizontal axis, with normalized metricscores on the vertical axis. The ideal metric has a perfectly linear relationship.a bidirectional GRU on a Twitter dataset for hash-tag prediction (Dhingra et al., 2016). We find thatthey obtain reasonable but inferior performancecompared to using VHRED embeddings.
System-level correlations
We show the system-level correlations for various metrics in Table 3,and present it visually in Figure 5. Each point inthe scatterplots represents a dialogue model; hu-mans give low scores to TFIDF and DE responses,higher scores to HRED and the highest scores toother human responses. It is clear that existingword-overlap metrics are incapable of capturingthis relationship for even 4 models. This rendersthem completely deficient for dialogue evaluation.However,
ADEM produces almost the same modelranking as humans, achieving a significant Pearsoncorrelation of 0.954. Thus,
ADEM correlates wellwith humans both at the response and system level.
Generalization to previously unseen models
When
ADEM is used in practice, it will take asinput responses from a new model that it has notseen during training. Thus, it is crucial that
ADEM correlates with human judgements for new models.We test
ADEM ’s generalization ability by perform-ing a leave-one-out evaluation. For each dialoguemodel that was the source of response data fortraining
ADEM (TF-IDF, Dual Encoder, HRED, hu-mans), we conduct an experiment where we trainon all model responses except those from the cho-sen model, and test only on the model that wasunseen during training.The results are given in Table 4. We observethat the
ADEM model is able to generalize for allmodels except the Dual Encoder. This is partic-ularly surprising for the HRED model; in thiscase,
ADEM was trained only on responses that For comparison, BLEU achieves a system-level correla-tion of 0.99 on 5 models in the translation domain (Papineniet al., 2002).
Metric Pearson
BLEU-1 -0.079 (0.921)BLEU-2 0.308 (0.692)BLEU-3 -0.537 (0.463)BLEU-4 -0.536 (0.464)ROUGE 0.268 (0.732)
ADEM (0.046)
Table 3: System-level correlation, with the p-valuein brackets.were written by humans (from retrieval modelsor human-generated), but is able to generalize toresponses produced by a generative neural net-work model. When testing on the entire test set,the model achieves comparable correlations to the
ADEM model that was trained on 25% less dataselected at random.
Qualitative Analysis
To illustrate somestrengths and weaknesses of
ADEM , we showhuman and
ADEM scores for each of the responsesto various contexts in Table 5. There are severalinstances where
ADEM predicts accurately: inparticular,
ADEM is often very good at assigninglow scores to poor responses. This seen in the firsttwo contexts, where most of the responses given ascore of 1 from humans are given scores less than 2by
ADEM . The single exception in response (4) forthe second context seems somewhat appropriateand should perhaps have been scored higher by thehuman evaluator. There are also several instanceswhere the model assigns high scores to suitableresponses, as in the first two contexts.One drawback we observed is that
ADEM tendsto be too conservative when predicting responsescores. This is the case in the third context, wherethe model assigns low scores to most of the re-sponses that a human rated highly. This behaviouris likely due to the squared error loss used to train
ADEM ; since the model receives a large penalty forincorrectly predicting an extreme value, it learns topredict scores closer to the average human score. est on full dataset Test on removed model responses
Data Removed
Spearman Pearson Spearman Pearson
TF-IDF 0.406 ( < < < < < < < < < < < Average 0.364 0.369 0.145 0.130
25% at random 0.378 ( < < Table 4: Correlation for
ADEM when various model responses are removed from the training set. Theleft two columns show performance on the entire test set, and the right two columns show performanceon responses only from the dialogue model not seen during training. The last row (25% at random)corresponds to the
ADEM model trained on all model responses, but with the same amount of training dataas the model above (i.e. 25% less data than the full training set).
Context Reference response Model responses Humanscore
ADEM score photo to see my television debut go to -some. some on < url > - hehe < url > → it really was you? i thought ppl were rec-ognizing someone who looked like you!were the oysters worth the wait? yeah it was me . haha i’d kinda forgotten aboutit it was filmed a whileago 1) i’m not sure. i just don’t know what to do with it.2) you heard the horsepower productions remix of lee scratchperry’s ‘exercising’ off his ‘mighty upsetter’ album?3) you wont chug a fuzzy peach navel4) they were! 3115 3.4131.6441.9443.085just beat call of duty!! → want a cookie? → yes!! → come get it im in kenmore at themoment 1) i’m gonna get a new phone some moro2) no way man.3) wow i just got a free pizza coupon! get yoursbefore theres no more! < url >
4) i’m going to go to the mall. 1511 2.9473.4052.8772.851am i out of twitter jail yet? testing → yeah. i posted bail → Table 5: Examples of scores given by the
ADEM model.We provide many more experiments, including in-vestigation of evaluation speed, learning curves,data efficiency, a failure analysis, and the primarysource of improvement over word-overlap metricsin the supplemental material.
Related to our approach is the literature on novelmethods for the evaluation of machine translationsystems, especially through the WMT evaluationtask (Callison-Burch et al., 2011; Mach´acek andBojar, 2014; Stanojevic et al., 2015). In particu-lar, (Albrecht and Hwa, 2007; Gupta et al., 2015)have proposed to evaluate machine translation sys-tems using Regression and Tree-LSTMs respec-tively. Their approach differs from ours as, in thedialogue domain, we must additionally conditionour score on the context of the conversation, whichis not necessary in translation.There has also been related work on estimatingthe quality of responses in chat-oriented dialoguesystems. (DeVault et al., 2011) train an automaticdialogue policy evaluation metric from 19 struc-tured role-playing sessions, enriched with para- phrases and external referee annotations. (Gandheand Traum, 2016) propose a semi-automatic eval-uation metric for dialogue coherence, similar toBLEU and ROUGE, based on ‘wizard of Oz’ typedata. (Xiang et al., 2014) propose a framework topredict utterance-level problematic situations in adataset of Chinese dialogues using intent and sen-timent factors. Finally, (Higashinaka et al., 2014)train a classifier to distinguish user utterances fromsystem-generated utterances using various dialoguefeatures, such as dialogue acts, question types, andpredicate-argument structures.Several recent approaches use hand-crafted re-ward features to train dialogue models using rein-forcement learning (RL). For example, (Li et al.,2016b) use features related to ease of answeringand information flow, and (Yu et al., 2016) usemetrics related to turn-level appropriateness andconversational depth. These metrics are based onhand-crafted features, which only capture a smallset of relevant aspects; this inevitably leads to sub-optimal performance, and it is unclear whether suchobjectives are preferable over retrieval-based cross- In ‘wizard of Oz’ scenarios, humans play the role of thedialogue system, usually unbeknown to the interlocutors. ntropy or word-level maximum log-likelihood ob-jectives. Furthermore, many of these metrics arecomputed at the conversation-level, and are notavailable for evaluating single dialogue responses.The metrics that can be computed at the response-level could be incorporated into our framework, forexample by adding a term to equation 1 consistingof a dot product between these features and a vectorof learned parameters.There has been significant work on evaluationmethods for task-oriented dialogue systems, whichattempt to solve a user’s task such as finding arestaurant. These methods include the PARADISEframework (Walker et al., 1997) and MeMo (M¨olleret al., 2006), which consider a task completionsignal. PARADISE in particular is perhaps thefirst work on learning an automatic evaluation func-tion for dialogue, accomplished through linear re-gression. However, PARADISE requires that onecan measure task completion and task complexity,which are not available in our setting.
We use the Twitter Corpus to train our models as itcontains a broad range of non-task-oriented conver-sations and it has been used to train many state-of-the-art models. However, our model could easilybe extended to other general-purpose datasets, suchas Reddit, once similar pre-trained models becomepublicly available. Such models are necessary evenfor creating a test set in a new domain, which willhelp us determine if
ADEM generalizes to related di-alogue domains. We leave investigating the domaintransfer ability of
ADEM for future work.The evaluation model proposed in this paperfavours dialogue models that generate responsesthat are rated as highly appropriate by humans. It islikely that this property does not fully capture thedesired end-goal of chatbot systems. For example,one issue with building models to approximate hu-man judgements of response quality is the problemof generic responses. Since humans often providehigh scores to generic responses due to their appro-priateness for many given contexts (Shang et al.,2016), a model trained to predict these scores willexhibit the same behaviour. An important direc-tion for future work is modifying
ADEM such thatit is not subject to this bias. This could be done,for example, by censoring
ADEM ’s representations(Edwards and Storkey, 2016) such that they donot contain any information about length. Alterna- tively, one can combine this with an adversarialevaluation model (Kannan and Vinyals, 2017; Liet al., 2017) that assigns a score based on how easyit is to distinguish the dialogue model responsesfrom human responses. In this case, a model thatgenerates generic responses will easily be distin-guishable and obtain a low score.An important direction of future research isbuilding models that can evaluate the capability ofa dialogue system to have an engaging and mean-ingful interaction with a human. Compared to eval-uating a single response, this evaluation is arguablycloser to the end-goal of chatbots. However, suchan evaluation is extremely challenging to do in acompletely automatic way. We view the evaluationprocedure presented in this paper as an importantstep towards this goal; current dialogue systems areincapable of generating responses that are rated ashighly appropriate by humans, and we believe ourevaluation model will be useful for measuring andfacilitating progress in this direction.
Acknowledgements
We’d like to thank Casper Liu for his help with thecorrelation code, Laurent Charlin for helpful dis-cussions on the data collection, Jason Weston forsuggesting improvements in the experiments, andJean Harb and Emmanuel Bengio for their debug-ging aid. We gratefully acknowledge support fromthe Samsung Institute of Advanced Technology, theNational Science and Engineering Research Coun-cil, and Calcul Quebec. We’d also like to thank thedevelopers of Theano ( ? ). References
Joshua Albrecht and Rebecca Hwa. 2007. Regressionfor sentence-level mt evaluation with pseudo refer-ences. In
ACL .Ron Artstein, Sudeep Gandhe, Jillian Gerten, AntonLeuski, and David Traum. 2009. Semi-formal eval-uation of conversational characters. In
Languages:From Formal to Natural , Springer, pages 22–35.Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hin-ton. 2016. Layer normalization. arXiv preprintarXiv:1607.06450 .Satanjeev Banerjee and Alon Lavie. 2005. Meteor: Anautomatic metric for mt evaluation with improvedcorrelation with human judgments. In
Proceedingsof the acl workshop on intrinsic and extrinsic evalu-ation measures for machine translation and/or sum-marization . volume 29, pages 65–72.oshua Bengio, Patrice Simard, and Paolo Frasconi.1994. Learning long-term dependencies with gradi-ent descent is difficult.
IEEE transactions on neuralnetworks
COLING .Chris Callison-Burch, Philipp Koehn, Christof Monz,and Omar F Zaidan. 2011. Findings of the 2011workshop on statistical machine translation. In
Pro-ceedings of the Sixth Workshop on Statistical Ma-chine Translation . Association for ComputationalLinguistics, pages 22–64.Boxing Chen and Colin Cherry. 2014. A systematiccomparison of smoothing techniques for sentence-level bleu.
ACL 2014 page 362.Jacob Cohen. 1968. Weighted kappa: Nominal scaleagreement provision for scaled disagreement or par-tial credit.
Psychological bulletin arXiv preprint arXiv:1603.09025 .David DeVault, Anton Leuski, and Kenji Sagae. 2011.Toward learning and evaluation of dialogue policieswith text examples. In
Proceedings of the SIG-DIAL 2011 Conference . Association for Computa-tional Linguistics, pages 39–48.Bhuwan Dhingra, Zhong Zhou, Dylan Fitzpatrick,Michael Muehl, and William W Cohen. 2016.Tweet2vec: Character-based distributed repre-sentations for social media. arXiv preprintarXiv:1605.03481 .Harrison Edwards and Amos Storkey. 2016. Censoringrepresentations with an adversary.
ICLR .Salah El Hihi and Yoshua Bengio. 1995. Hierarchicalrecurrent neural networks for long-term dependen-cies. In
NIPS . Citeseer, volume 400, page 409.Philip Gage. 1994. A new algorithm for data compres-sion.
The C Users Journal arXivpreprint arXiv:1506.06863 .Sudeep Gandhe and David Traum. 2016. A semi-automated evaluation metric for dialogue modelcoherence. In
Situated Dialog in Speech-BasedHuman-Computer Interaction , Springer, pages 217–225. Rohit Gupta, Constantin Orasan, and Josef van Gen-abith. 2015. Reval: A simple and effective machinetranslation evaluation metric based on recurrent neu-ral networks. In
Proceedings of the 2015 Confer-ence on Empirical Methods in Natural LanguageProcessing (EMNLP) .Ryuichiro Higashinaka, Toyomi Meguro, Kenji Ima-mura, Hiroaki Sugiyama, Toshiro Makino, andYoshihiro Matsuo. 2014. Evaluating coherence inopen domain conversational systems. In
INTER-SPEECH . pages 130–134.Sepp Hochreiter. 1991. Untersuchungen zu dynamis-chen neuronalen netzen.
Diploma, Technische Uni-versit¨at M¨unchen page 91.Sepp Hochreiter and J¨urgen Schmidhuber. 1997. Longshort-term memory.
Neural computation arXiv preprintarXiv:1502.03167 .Anjuli Kannan, Karol Kurach, Sujith Ravi, TobiasKaufmann, Andrew Tomkins, Balint Miklos, GregCorrado, L´aszl´o Luk´acs, Marina Ganea, PeterYoung, et al. 2016. Smart reply: Automated re-sponse suggestion for email. In
Proceedings of theACM SIGKDD Conference on Knowledge Discoveryand Data Mining (KDD) . volume 36, pages 495–503.Anjuli Kannan and Oriol Vinyals. 2017. Adversar-ial evaluation of dialogue models. arXiv preprintarXiv:1701.08198 .Diederik Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980 .Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov,Richard Zemel, Raquel Urtasun, Antonio Torralba,and Sanja Fidler. 2015. Skip-thought vectors. In
Advances in Neural Information Processing Systems .pages 3276–3284.Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao,and Bill Dolan. 2015. A diversity-promoting objec-tive function for neural conversation models. arXivpreprint arXiv:1510.03055 .Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao,and Bill Dolan. 2016a. A persona-based neural con-versation model. arXiv preprint arXiv:1603.06155 .Jiwei Li, Will Monroe, and Dan Jurafsky. 2017. Learn-ing to decode for future success. arXiv preprintarXiv:1701.06549 .Jiwei Li, Will Monroe, Alan Ritter, and Dan Jurafsky.2016b. Deep reinforcement learning for dialoguegeneration. arXiv preprint arXiv:1606.01541 .hin-Yew Lin. 2004. Rouge: A package for automaticevaluation of summaries. In
Text summarizationbranches out: Proceedings of the ACL-04 workshop .Barcelona, Spain, volume 8.Chia-Wei Liu, Ryan Lowe, Iulian V Serban, MichaelNoseworthy, Laurent Charlin, and Joelle Pineau.2016. How not to evaluate your dialogue system:An empirical study of unsupervised evaluation met-rics for dialogue response generation. arXiv preprintarXiv:1603.08023 .Ryan Lowe, Nissan Pow, Iulian Serban, and JoellePineau. 2015. The ubuntu dialogue corpus: A largedataset for research in unstructured multi-turn dia-logue systems. arXiv preprint arXiv:1506.08909 .Matouˇs Mach´acek and Ondrej Bojar. 2014. Results ofthe wmt14 metrics shared task. In
Proceedings ofthe Ninth Workshop on Statistical Machine Transla-tion . Citeseer, pages 293–301.J. Markoff and P. Mozur. 2015. For sympathetic ear,more chinese turn to smartphone program.
NYTimes .Sebastian M¨oller, Roman Englert, Klaus-Peter Engel-brecht, Verena Vanessa Hafner, Anthony Jameson,Antti Oulasvirta, Alexander Raake, and Norbert Re-ithinger. 2006. Memo: towards automatic usabilityevaluation of spoken dialogue services by user errorsimulations. In
INTERSPEECH .Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic eval-uation of machine translation. In
Proceedings ofthe 40th annual meeting on association for compu-tational linguistics . Association for ComputationalLinguistics, pages 311–318.Karl Pearson. 1901. Principal components analysis.
The London, Edinburgh and Dublin PhilosophicalMagazine and Journal
Proceedings of the conference on empirical meth-ods in natural language processing . Association forComputational Linguistics, pages 583–593.Rico Sennrich, Barry Haddow, and Alexandra Birch.2015. Neural machine translation of rare words withsubword units. arXiv preprint arXiv:1508.07909 .Iulian Vlad Serban, Alessandro Sordoni, Yoshua Ben-gio, Aaron Courville, and Joelle Pineau. 2016a.Building end-to-end dialogue systems using gener-ative hierarchical neural network models. In
AAAI .pages 3776–3784.Iulian Vlad Serban, Alessandro Sordoni, Ryan Lowe,Laurent Charlin, Joelle Pineau, Aaron Courville,and Yoshua Bengio. 2016b. A hierarchical latentvariable encoder-decoder model for generating dia-logues. arXiv preprint arXiv:1605.06069 . Lifeng Shang, Zhengdong Lu, and Hang Li. 2015. Neu-ral responding machine for short-text conversation. arXiv preprint arXiv:1503.02364 .Lifeng Shang, Tetsuya Sakai, Zhengdong Lu, Hang Li,Ryuichiro Higashinaka, and Yusuke Miyao. 2016.Overview of the ntcir-12 short text conversation task.
Proceedings of NTCIR-12 pages 473–484.Alessandro Sordoni, Yoshua Bengio, Hossein Vahabi,Christina Lioma, Jakob Grue Simonsen, and Jian-Yun Nie. 2015a. A hierarchical recurrent encoder-decoder for generative context-aware query sugges-tion. In
Proceedings of the 24th ACM Internationalon Conference on Information and Knowledge Man-agement . ACM, pages 553–562.Alessandro Sordoni, Michel Galley, Michael Auli,Chris Brockett, Yangfeng Ji, Margaret Mitchell,Jian-Yun Nie, Jianfeng Gao, and Bill Dolan. 2015b.A neural network approach to context-sensitive gen-eration of conversational responses. arXiv preprintarXiv:1506.06714 .Miloˇs Stanojevic, Amir Kamran, Philipp Koehn, andOndrej Bojar. 2015. Results of the wmt15 metricsshared task. In
Proceedings of the Tenth Workshopon Statistical Machine Translation . pages 256–273.Alan M Turing. 1950. Computing machinery and intel-ligence.
Mind arXiv preprint arXiv:1506.05869 .Marilyn A Walker, Diane J Litman, Candace A Kamm,and Alicia Abella. 1997. Paradise: A framework forevaluating spoken dialogue agents. In
Proceedingsof the eighth conference on European chapter of theAssociation for Computational Linguistics . Associa-tion for Computational Linguistics, pages 271–280.J. Weizenbaum. 1966. ELIZAa computer program forthe study of natural language communication be-tween man and machine.
Communications of theACM
Proc. CLP pages 43–51.Zhou Yu, Ziyu Xu, Alan W Black, and Alex I Rud-nicky. 2016. Strategy and policy learning for non-task-oriented conversational systems. In . page 404. ppendix A: Further Notes onCrowdsourcing Data Collection
Amazon Mechanical Turk Experiments
Weconducted two rounds of AMT experiments. Wefirst asked AMT workers to provide a reasonablecontinuation of a Twitter dialogue (i.e. generatethe next response given the context of a conver-sation). Each survey contained 20 questions, in-cluding an attention check question. Workers wereinstructed to generate longer responses, in orderto avoid simple one-word responses. In total, weobtained approximately 2,000 human responses.Second, we filtered these human-generated re-sponses for potentially offensive language, andcombined them with approximately 1,000 re-sponses from each of the above models into a singleset of responses. We then asked AMT workers torate the overall quality of each response on a scaleof 1 (low quality) to 5 (high quality). Each user wasasked to evaluate 4 responses from 50 different con-texts. We included four additional attention-checkquestions and a set of five contexts was given toeach participant for assessment of inter-annotatoragreement. We removed all users who either failedan attention check question or achieved a κ inter-annotator agreement score lower than 0.2 (Cohen,1968). The remaining evaluators had a median κ score of 0.63, indicating moderate agreement. Thisis consistent with results from (Liu et al., 2016).Dataset statistics are provided in Table 1.In initial experiments, we also asked humansto provide scores for topicality, informativeness,and whether the context required background in-formation to be understandable. Note that we didnot ask for fluency scores, as 3/4 of the responseswere produced by humans (including the retrievalmodels). We found that scores for informativenessand background had low inter-annotator agreement(Table 6), and scores for topicality were highlycorrelated with the overall score (Pearson correla-tion of 0.72). Results on these auxiliary questionsvaried depending on the wording of the question.Thus, we continued our experiments by only ask-ing for the overall score. We provide more detailsconcerning the data collection in the supplementalmaterial, as it may aid others in developing effec-tive crowdsourcing experiments. Preliminary AMT experiments
Before con-ducting the primary crowdsourcing experimentsto collect the dataset in this paper, we ran a series
Measurement κ score Overall 0.63Topicality 0.57Informativeness 0.31Background 0.05Table 6: Median κ inter-annotator agreementscores for various questions asked in the survey.of preliminary experiments to see how AMT work-ers responded to different questions. Unlike theprimary study, where we asked a small number ofoverlapping questions to determine the κ score andfiltered users based on the results, we conducted astudy where all responses (40 in total from 10 con-texts) were overlapping. We did this for 18 usersin two trials, resulting in 153 pair-wise correlationscores per trial.In the first trial, we asked the following questionsto the users, for each response:1. How appropriate is the response overall?(overall, scale of 1-5)2. How on-topic is the response? (topicality,scale of 1-5)3. How specific is the response to some context?(specificity, scale of 1-5)4. How much background information is re-quired to understand the context? (back-ground, scale of 1-5)Note that we do not ask for fluency, as the 3/4responses for each context were written by a hu-man (including retrieval models). We also providedthe AMT workers with examples that have hightopicality and low specificity, and examples withhigh specificity and low topicality. The backgroundquestion was only asked once for each context.We observed that both the overall scores and top-icality had fairly high inter-annotator agreement(as shown in Table 6), but were strongly correlatedwith each other (i.e. participants would often putthe same scores for topicality and overall score).Conversely, specificity ( κ = 0 . ) and background( κ = 0 . ) had very low inter-annotator agree-ments.To better visualize the data, we produce scatter-plots showing the distribution of scores for differ-ent responses, for each of the four questions in oursurvey (Figure 6). We can see that the overall andtopicality scores are clustered for each question,ndicating high agreement. However, these clus-ters are most often in the same positions for eachresponse, which indicates that they are highly cor-related with each other. Specificity and backgroundinformation, on the other hand, show far fewer clus-ters, indicating lower inter-annotator agreement.We conjectured that this was partially because theterms ‘specificity’ and ‘background information’,along with our descriptions of them, had a highcognitive load, and were difficult to understand inthe context of our survey.To test this hypothesis, we conducted a new sur-vey where we tried to ask the questions for speci-ficity and background in a more intuitive manner.We also changed the formulation of the backgroundquestion to be a binary 0-1 decision of whetherusers understood the context. We asked the follow-ing questions:1. How appropriate is the response overall?(overall, scale of 1-5)2. How on-topic is the response? (topicality,scale of 1-5)3. How common is the response? (informative-ness, scale of 1-5)4. Does the context make sense? (context, scaleof 0-1)We also clarified our description for the third ques-tion, including providing more intuitive examples.Interestingly, the inter-annotator agreement on in-formativeness κ = 0 . was much higher than thatfor specificity in the original survey. Thus, the for-mulation of questions in a crowdsourcing surveyhas a large impact on inter-annotator agreement.For the context, we found that users either agreedhighly ( κ > . for 45 participants), or not at all( κ < . for 113 participants).We also experimented with asking the overallscore on a separate page, before asking questions2-4, and found that this increased the κ agreementslightly. Similarly, excluding all scores where par-ticipants indicated they did not understand the con-text improved inter-annotator agreement slightly.Due to these observations, we decided to onlyask users for their overall quality score for eachresponse, as it is unclear how much additional in-formation is provided by the other questions in thecontext of dialogue. We hope this information isuseful for future crowdsourcing experiments in thedialogue domain. Appendix B: Metric Description
BLEU
BLEU (Papineni et al., 2002) analyzesthe co-occurrences of n-grams in the ground truthand the proposed responses. It first computes ann-gram precision for the whole dataset: P n ( r, ˆ r ) = (cid:80) k min( h ( k, r ) , h ( k, ˆ r i )) (cid:80) k h ( k, r i ) where k indexes all possible n-grams of length n and h ( k, r ) is the number of n-grams k in r . Notethat the min in this equation is calculating the num-ber of co-occurrences of n-gram k between theground truth response r and the proposed response ˆ r , as it computes the fewest appearances of k ineither response. To avoid the drawbacks of using aprecision score, namely that it favours shorter (can-didate) sentences, the authors introduce a brevitypenalty. BLEU-N, where N is the maximum lengthof n-grams considered, is defined as: BLEU-N := b ( r, ˆ r ) exp( N (cid:88) n =1 β n log P n ( r, ˆ r )) β n is a weighting that is usually uniform, and b ( · ) is the brevity penalty. The most commonly usedversion of BLEU assigns N = 4 . Modern versionsof BLEU also use sentence-level smoothing, asthe geometric mean often results in scores of 0if there is no 4-gram overlap (Chen and Cherry,2014). Note that BLEU is usually calculated at thecorpus-level, and was originally designed for usewith multiple reference sentences. METEOR
The METEOR metric (Banerjee andLavie, 2005) was introduced to address severalweaknesses in BLEU. It creates an explicit align-ment between the candidate and target responses.The alignment is based on exact token matching,followed by WordNet synonyms, stemmed tokens,and then paraphrases. Given a set of alignments,the METEOR score is the harmonic mean of preci-sion and recall between the proposed and groundtruth sentence.Given a set of alignments m , the METEOR scoreis the harmonic mean of precision P m and recalligure 6: Scatter plots showing the distribution of scores (vertical axis) for different responses (horizontalaxis), for each of the four questions in our survey. It can be seen that the overall and topicality scoresare clustered for each question, indicating high agreement, while this is not the case for specificity orbackground information. Note that all scores are normalized based on a per-user basis, based on theaverage score given by each user. m between the candidate and target sentence. P en = γ ( chm ) θ (3) F mean = P m R m αP m + (1 − α ) R m (4) P m = | m | (cid:80) k h k ( c i ) (5) R m = | m | (cid:80) k h k ( s ij ) (6) M ET EOR = (1 − P en ) F mean (7)The penalty term P en is based on the ‘chunkiness’of the resolved matches. We use the default valuesfor the hyperparameters α, γ , and θ . ROUGE
ROUGE (Lin, 2004) is a set of evalua-tion metrics used for automatic summarization. Weconsider ROUGE-L, which is a F-measure basedon the Longest Common Subsequence (LCS) be-tween a candidate and target sentence. The LCS isa set of words which occur in two sentences in thesame order; however, unlike n-grams the words donot have to be contiguous, i.e. there can be otherwords in between the words of the LCS. ROUGE-L is computed using an F-measure between thereference response and the proposed response. R = max j l ( c i , s ij ) | s ij | (8) P = max j f racl ( c i , s ij ) | c ij | (9) ROU GE L ( c i , S i ) = (1 + β ) RPR + β P (10)where l ( c i , s ij ) is the length of the LCS betweenthe sentences. β is usually set to favour recall( β = 1 . ). Appendix C: Latent Variable HierarchicalRecurrent Encoder-Decoder (VHRED)
The VHRED model is an extension of the origi-nal hierarchical recurrent encoder-decoder (HRED)model (Serban et al., 2016a) with an additionalcomponent: a high-dimensional stochastic latentvariable at every dialogue turn. The dialogue con-text is encoded into a vector representation usingthe utterance-level and context-level
RNNs fromour encoder. Conditioned on the summary vector ateach dialogue turn, VHRED samples a multivariateGaussian variable that is provided, along with thecontext summary vector, as input to the decoder
RNN, which in turn generates the response word-by-word. We use representations from the VHREDmodel as it produces more diverse and coherentresponses compared to its HRED counterpart.The VHRED model is trained to maximize alower-bound on the log-likelihood of generatingthe next response: L = log P ˆ θ ( w , . . . , w N ) ≥ N (cid:88) n =1 − KL (cid:2) Q ψ ( z n | w , . . . , w n ) || P ˆ θ ( z n | w Maximizing the likeli-hood of generating the next utterance in a dialogueis not only a convenient way of training the en-coder parameters; it is also an objective that is con-sistent with learning useful representations of thedialogue utterances. Two context vectors producedby the VHRED encoder are similar if the contextsinduce a similar distribution over subsequent re-sponses; this is consistent with the formulation ofthe evaluation model, which assigns high scores toresponses that have similar vector representationsto the context. VHRED is also closely related tothe skip-thought-vector model (Kiros et al., 2015),which has been shown to learn useful representa-tions of sentences for many tasks, including se- etric Spearman Pearson BLEU-1 -0.026 (0.80) 0.016 (0.87)BLEU-2 0.065 (0.52) 0.080 (0.43)BLEU-3 0.139 (0.17) 0.088 (0.39)BLEU-4 0.139 (0.17) 0.092 (0.36)ROUGE -0.083 (0.41) -0.010 (0.92) Table 7: Correlations between word-overlap met-rics and human judgements on the dataset from(Liu et al., 2016), after removing the speaker tokensat the beginning of each utterance. The correlationsare even worse than estimated in the original paper,and none are significant.mantic relatedness and paraphrase detection. Theskip-thought-vector model takes as input a singlesentence and predicts the previous sentence andnext sentence. On the other hand, VHRED takes asinput several consecutive sentences and predicts thenext sentence. This makes it particularly suitablefor learning long-term context representations. Appendix D: Experiments & results Hyperparameters When evaluating our model, we conduct early stop-ping on an external validation set to obtain the bestparameter setting. We similarly choose our hyper-parameters (PCA dimension n , L2 regularizationpenalty γ , learning rate a , and batch size b ) basedon validation set results. Our best ADEM modelused γ = 0 . , a = 0 . , and b = 32 . For ADEM with tweet2vec embeddings, we did a similar hyper-parameter searched, and used n = 150 , γ = 0 . , a = 0 . , and b = 16 . Additional ResultsNew results on (Liu et al., 2016) data In or-der to ensure that the correlations between word-overlap metrics and human judgements were com-parable across datasets, we standardized the pro-cessing of the evaluation dataset from (Liu et al.,2016). In particular, the original data from (Liuet al., 2016) has a token (either ‘ < first speaker > ’,‘ < second speaker > ’, or ‘ < third speaker > ’) at thebeginning of each utterance. This is an artifactleft-over by the processing used as input to the hier-archical recurrent encoder-decoder (HRED) model(Serban et al., 2016a). Removing these tokensmakes sense for establishing the ability of word-overlap models, as they are unrelated to the contentof the tweets.We perform this processing, and report the up-dated results for word-overlap metrics in Table 7. Metric Wall time ADEM (CPU) 2861s ADEM (GPU) 168s Table 8: Evaluation time on the test set.Surprisingly, almost all significant correlation dis-appears, particularly for all forms of the BLEUscore. Thus, we can conclude that the word-overlapmetrics were heavily relying on these tokens toform bigram matches between the model responsesand reference responses. Evaluation speed An important property of eval-uation models is speed. We show the evaluationtime on the test set for ADEM on both CPU and aTitan X GPU (using Theano, without cudNN) in Ta-ble 8. When run on GPU, ADEM is able to evaluateresponses in a reasonable amount of time (approx-imately 2.5 minutes). This includes the time forencoding the contexts, model responses, and refer-ence responses into vectors with the hierarchicalRNN, in addition to computing the PCA projection,but does not include pre-training with VHRED. Forcomparison, if run on a test set of 10,000 responses, ADEM would take approximately 45 minutes. Thisis significantly less time consuming than setting uphuman experiments at any scale. Note that we havenot yet made any effort to optimize the speed ofthe ADEM model. Learning curves To show that our learning pro-cedure for ADEM really is necessary, and that theembeddings produced by VHRED are not sufficientto evaluate dialogue systems, we plot the Spearmanand Pearson correlations on the test set as a func-tion of the number of epochs in Figure 7. It is clearthat, at the beginning of training, when the matrices M and N have been initialized to the identity, themodel is incapable of accurately predicting humanscores, and its correlation is approximately 0. a) Spearman correlation (b) Pearson correlation Figure 7: Plots showing the Spearman and Pearson correlations on the test set as ADEM trains. At thebeginning of training, the model does not correlate with human judgements. Failure analysis We now conduct a failure anal-ysis of the ADEM model. In particular, we look attwo different cases: responses where both humansand (normalized) ROUGE or BLEU-2 score highly(a score of 4 out of 5 or greater) while ADEM scorespoorly (2 out of 5 or lower), and the converse,where ADEM scores the response highly while hu-mans and either ROUGE or BLEU-2 score it poorly.We randomly sample (i.e. without cherry picking)three examples of each case, which are shown inTables 9-10.From Table 9, the cases where ADEM misses agood response, we can see that there are a varietyof reasons for this cause of failure. In the first ex-ample, ADEM is not able to match the fact that themodel response talks about sleep to the referenceresponse or context. This is possibly because theutterance contains a significant amount of irrele-vant information: indeed, the first two sentencesare not related to either the context or reference re-sponse. In the second example, the model responsedoes not seem particularly relevant to the context —despite this, the human scoring this example gaveit / . This illustrates one drawback of humanevaluations; they are quite subjective, and oftenhave some noise. This makes it difficult to learn aneffective ADEM model. Finally, ADEM is unable toscore the third response highly, even though it isvery closely related to the reference response.We can observe from the first two examples inTable 10, where the ADEM model erroneously ranksthe model responses highly, that ADEM is occasion-ally fooled into giving high scores for responsesthat are completely unrelated to the context. Thismay be because both of the utterances are short,and short utterances are ranked higher by humansin general since they are often more generic (as detailed in Section 5). In the third example, the re-sponse actually seems to be somewhat reasonablegiven the context; this may be an instance wherethe human evaluator provided a score that was toolow. Data efficiency How much data is required totrain ADEM ? We conduct an experiment wherewe train ADEM on different amounts of trainingdata, from 5% to 100%. The results are shown inTable 11. We can observe that ADEM is very data-efficient, and is capable of reaching a Spearmancorrelation of 0.4 using only half of the availabletraining data (1000 labelled examples). ADEM cor-relates significantly with humans even when onlytrained on 5% of the original training data (100labelled examples). Improvement over word-overlap metrics Next, we analyze more precisely how ADEM outperforms traditional word-overlap metrics suchas BLEU-2 and ROUGE. We first normalize themetric scores to have the same mean and varianceas human scores, clipping the resulting scoresto the range [1 , (we assign raw scores of 0 anormalized score of 1). We indicate normalizationwith vertical bars around the metric. We thenselect all of the good responses that were givenlow scores by word-overlap metrics (i.e. responseswhich humans scored as 4 or higher, and which | BLEU-2 | and | ROUGE | scored as 2 or lower).The results are summarized in Table 12: of the 237responses that humans scored 4 or higher, mostof them (147/237) were ranked very poorly byboth BLEU-2 and ROUGE. This quantitativelydemonstrates what we argued qualitatively inFigure 1; a major failure of word-overlap metricsis the inability to consider reasonable responses ontext Referenceresponse Model re-sponse Humanscore | BLEU-2 | score | ROUGE | score | ADEM | score what theme do you guys want nexton tumblr? we’ve had mariskamom-mymoments what do you want tosee next? → im sorry. hope youfeel better soon! → it will wear off.just hate feeling like this → im sureit will! just relax and take your time → i’m okay. just overly tired maybe youneed a bitmore sleep i’m reallyimpressed.first guy tosaid that pwhat’s time?sleep late itsnot good. i’mworried 4 2.53 5.0 1.01 some pm syria - the editor of syr-ian government daily tishrin said shehad been sacked over remarks to al < url > → replaced by business edi-tor! sounds like the states lol msm is verysimilar theworld over! tied to thegovernment . i’m not sure ifi’m going tobe able to seethe < unk > .i’m not sureif i should beable to see it. 4 2.53 4.75 1.22wonder how long it will take waynerooney to get himself into troubleon twitter. odds? → probably. along time. because i used the ad-dress in the paper to find the pageand it doesn’t exist! → here you go thanks . iwonder whyit didn’t work! thanks let mesee if this oneworks :p 5 2.53 4.24 1.53 Table 9: Examples where a human and either BLEU-2 or ROUGE (after normalization) score the modelresponse highly ( > / ), while the ADEM model scored it poorly ( < / ). These examples are drawnrandomly (i.e. no cherry-picking). The bars around | metric | indicate that the metric scores have beennormalized. Context Referenceresponse Modelresponse Humanscore | BLEU-2 | score | ROUGE | score | ADEM | score rage slightly dissipated. wouldn’thave bothered restoring my phonebut i need it to moan at tomorrow. → speaking of moaning. i’m actuallygoing to email that chap that letterright now. → good plan i ’m going todo a little weeblog about it too. all nice andmeasured , ofcourse . some. someunfortu-nately 2 2.53 1.57 4.38high school flings college relation-ships → it seems like the other wayaround from wat i’ve seen word . i ’ve seena little of both .more of the col-lege though king james 1 2.53 1.57 5.0is it getting light outside? i swearit looks blue. → time to go tosleepppp.. for you , i ’mstaying up i’m going tothe beach. 1 2.53 1.57 5.0 Table 10: Examples where a human and either BLEU-2 or ROUGE (after normalization) score the modelresponse low ( < / ), while the ADEM model scored it highly ( > / ). These examples are drawnrandomly (i.e. no cherry-picking). The bars around | metric | indicate that the metric scores have beennormalized. raining data % Spearman p-value Pearson p-value100 % of data 0.414 < < < < < < < < < < < ADEM correlations when trained on different amounts of data. Metric scores Human ≥ 237 out of 616 and ( | BLEU-2 | < 2, 146 out of 237 | ROUGE | < and | ADEM | > and | ADEM | < and ( | BLEU-2 | > 4, 14 out of 42 or | ROUGE | > Table 12: In 60/146 cases, ADEM scores goodresponses (human score > 4) highly when word-overlap metrics fail. The bars around | metric | indi-cate that the metric scores have been normalized. Mean score ∆ w ≤ w > p-value (n=312) (n=304)ROUGE 0.042 0.031 < ADEM Table 13: Effect of differences in response lengthon the score, ∆ w = absolute difference in | ADEM | is able to correctly assigna score greater than 4. For comparison, there areonly 42 responses where humans give a score of4 and | ADEM | gives a score less than 2, and only14 of these are assigned a score greater than 4 byeither | BLEU-2 | or | ROUGE | .To provide further insight, we give specific ex-amples of responses that are scored highly ( > 4) byboth humans and | ADEM | , and poorly ( < 2) by both | BLEU-2 | and | ROUGE | in Table 14. We draw 3responses randomly (i.e. no cherry-picking) fromthe 60 test set responses that meet this criteria. Wecan observe that ADEM is able to recognize shortresponses that are appropriate to the context, with- out word-overlap with the reference response. Thisis even the case when the model and reference re-sponses have very little semantic similarity, as inthe first and third examples in Table 14.Finally, we show the behaviour of ADEM whenthere is a discrepancy between the lengths of thereference and model responses. In (Liu et al., 2016),the authors show that word-overlap metrics suchas BLEU-1, BLEU-2, and METEOR exhibit a biasin this scenario: they tend to assign higher scoresto responses that are closer in length to the refer-ence response. However, humans do not exhibitthis bias; in other words, the quality of a responseas judged by a human is roughly independent ofits length. In Table 13, we show that ADEM alsodoes not exhibit this bias towards similar-lengthresponses. Note that, for our dataset, BLEU-2 almost exclusivelyassigns scores near 0 for both ∆ w ≤ and ∆ w > , resultingin a p-value > ontext Reference response Modelre-sponse Humanscore | BLEU-2 | score | ROUGE | score | ADEM | score i’d recommend < url > - or buildbuy an htpc and put < url > on it. → you’re the some nd personthis week that’s recommendedroku to me. an htpc with xmbc iswhat i run . but i ’vedecked out my setup . i’ve got < number > tb ofdata on my home server becauseit’s bril-liant 5 1.0 1.0 4.726imma be an auntie this week-end. i guess i have to go al-bany. herewego → u supposedto been here → i come off nd on. → never tell me smh lol you sometiming haha,anyway,how’reyou? 5 1.0 1.0 4.201my son thinks she is plain. andthe girl that plays her sister.seekhelp4him? → send himthis. he’ll thank you. < url > you are too kind forwords . i will do 5 1.0 1.0 5.0 Table 14: Examples where both human and ADEM score the model response highly, while BLEU-2 andROUGE do not. These examples are drawn randomly (i.e. no cherry-picking) from the examples where ADEM outperforms BLEU-2 and ROUGE (as defined in the text). ADEM is able to correctly assign highscores to short responses that have no word-overlap with the reference response. The bars around | metric ||