[PDF] Better Automatic Evaluation of Open-Domain Dialogue Systems with Contextualized Embeddings

Abstract

Despite advances in open-domain dialogue systems, automatic evaluation of such systems is still a challenging problem. Traditional reference-based metrics such as BLEU are ineffective because there could be many valid responses for a given context that share no common words with reference responses. A recent work proposed Referenced metric and Unreferenced metric Blended Evaluation Routine (RUBER) to combine a learning-based metric, which predicts relatedness between a generated response and a given query, with reference-based metric; it showed high correlation with human judgments. In this paper, we explore using contextualized word embeddings to compute more accurate relatedness scores, thus better evaluation metrics. Experiments show that our evaluation metrics outperform RUBER, which is trained on static embeddings.

Full PDF

BBetter Automatic Evaluation of Open-Domain Dialogue Systems withContextualized Embeddings

Sarik Ghazarian

Information Sciences InstituteUniversity of Southern California [email protected]

Johnny Tian-Zheng Wei

College of Natural SciencesUniversity of Massachusetts Amherst [email protected]

Aram Galstyan

Information Sciences InstituteUniversity of Southern California [email protected]

Nanyun Peng

Information Sciences InstituteUniversity of Southern California [email protected]

Abstract

Despite advances in open-domain dialoguesystems, automatic evaluation of such systemsis still a challenging problem. Traditionalreference-based metrics such as BLEU are in-effective because there could be many validresponses for a given context that share nocommon words with reference responses. Arecent work proposed Referenced metric andUnreferenced metric Blended Evaluation Rou-tine (RUBER) to combine a learning-basedmetric, which predicts relatedness between agenerated response and a given query, withreference-based metric; it showed high corre-lation with human judgments. In this paper,we explore using contextualized word embed-dings to compute more accurate relatednessscores, thus better evaluation metrics. Experi-ments show that our evaluation metrics outper-form RUBER, which is trained on static em-beddings.

Recent advances in open-domain dialogue sys-tems (i.e. chatbots) highlight the difﬁculties inautomatically evaluating them. This kind of eval-uation inherits a characteristic challenge of NLGevaluation - given a context, there might be adiverse range of acceptable responses (Gatt andKrahmer, 2018).Metrics based on n -gram overlaps such asBLEU (Papineni et al., 2002) and ROUGE (Lin,2004), originally designed for evaluating machinetranslation and summarization, have been adoptedto evaluate dialogue systems (Sordoni et al., 2015;Li et al., 2016; Su et al., 2018). However, Liuet al. (2016) found a weak segment-level correla-tion between these metrics and human judgments Dialogue Context

Speaker 1: Hey! What are you doing here?Speaker 2: I’m just shopping.

Query : What are you shopping for?

Generated Response : Some new clothes.

Reference Response : I want buy gift for my mom!

Table 1: An example of zero BLEU score for an accept-able generated response in multi-turn dialogue system of response quality. As shown in Table 1, high-quality responses can have low or even no n -gramoverlap with a reference response, showing thatthese metrics are not suitable for dialogue evalu-ation (Novikova et al., 2017; Lowe et al., 2017).Due to the lack of strong automatic evalua-tion metrics, many researchers resort primarilyto human evaluation for assessing their dialoguesystems performances (Shang et al., 2015; Sor-doni et al., 2015; Shao et al., 2017). There aretwo main problems with human annotation: 1)it is time-consuming and expensive, and 2) itdoes not facilitate comparisons across research pa-pers. For certain research questions that involvehyper-parameter tuning or architecture searches,the amount of human annotation makes such stud-ies infeasible (Britz et al., 2017; Melis et al.,2018). Therefore, developing reliable automaticevaluation metrics for open-domain dialog sys-tems is imperative.The Referenced metric and Unreferenced met-ric Blended Evaluation Routine (RUBER) (Taoet al., 2018) stands out from recent work in au-tomatic dialogue evaluation, relying minimallyon human-annotated datasets of response qualityfor training. RUBER evaluates responses with ablending of scores from two metrics: a r X i v : . [ c s . C L ] A p r anking Loss Bi-RNN

Bert Embeddings

Pooling

Pooling MLP

Classifier

Whatareyoushopping for

Somenew clothes?.

Word2vec Embeddings M Figure 1: An illustration of changes applied to RUBER’s unreferenced metric’s architecture. Red dotted doublearrows show three main changes. The leftmost section is related to substituting word2vec embeddings with BERTembeddings. The middle section replaces Bi-RNNs with simple pooling strategies to get sentence representations.The rightmost section switches ranking loss function to MLP classiﬁer with cross entropy loss function. • an Unreferenced metric, which computes therelevancy of a response to a given query in-spired by Grice (1975)’s theory that the qual-ity of a response is determined by its related-ness and appropriateness, among other prop-erties. This model is trained with negativesampling. • a Referenced metric, which determines thesimilarities between generated and referenceresponses using word embeddings.Both metrics strongly depend on learned word em-beddings. We propose to explore the use of con-textualized embeddings, speciﬁcally BERT em-beddings (Devlin et al., 2018), in composing eval-uation metrics. Our contributions in this work areas follows: • We explore the efﬁciency of contextualizedword embeddings on training unreferencedmodels for open-domain dialog system eval-uation. • We explore different network architecturesand objective functions to better utilize con-textualized word embeddings, and show theirpositive effects.

We conduct the research under the RUBERmetric’s referenced and unreferenced framework,where we replace their static word embeddingswith pretrained BERT contextualized embeddingsand compare the performances. We identify threepoints of variation with two options each in the unreferenced component of RUBER. The mainchanges are in the word embeddings, sentence rep-resentation, and training objectives that will be ex-plained with details in the following section. Ourexperiment follows a 2x2x2 factorial design.

The unreferenced metric predicts how much a gen-erated response is related to a given query. Fig-ure 1 presents RUBER’s unreferenced metric over-laid with our proposed changes in three parts of thearchitecture. Changes are illustrated by red dotteddouble arrows and include word embeddings, sen-tence representation and the loss function.

Static and contextualized embeddings are two dif-ferent types of word embeddings that we explored. • Word2vec.

Recent works on learnable eval-uation metrics use simple word embeddingssuch as word2vec and GLoVe as input totheir models (Tao et al., 2018; Lowe et al.,2017; Kannan and Vinyals, 2017). Sincethese static embeddings have a ﬁxed context-independent representation for each word,they cannot represent the rich semantics ofwords in contexts. • BERT.

Contextualized word embeddings arerecently shown to be beneﬁcial in many NLPtasks (Devlin et al., 2018; Radford et al.,2018; Peters et al., 2018; Liu et al., 2019). Anoticeable contextualized word embeddings,BERT (Devlin et al., 2018), is shown to per-orm competitively among other contextual-ized embeddings, thus we explore the ef-fect of BERT embeddings on open domaindialogue systems evaluation task. Specif-ically, we substitute the word2vec embed-dings with BERT embeddings in RUBER’sunreferenced score as shown in the leftmostsection of Figure 1.

This section composes a single vector representa-tion for both a query and a response. • Bi-RNN.

In the RUBER model, BidirectionalRecurrent Neural Networks (Bi-RNNs) aretrained for this purpose. • Pooling.

We explore the effect of replacingBi-RNNs with some simple pooling strate-gies on top of words BERT embeddings(middle dotted section in Figure 1). The in-tuition behind this is that BERT embeddingsare pre-trained on bidirectional transformersand they include complete information aboutword’s context, therefore, another layer of bi-RNNs could just blow up the number of pa-rameters with no real gains.

Multilayer Perceptron Network (MLP) is the lastsection of RUBER’s unreferenced model that istrained by applying negative sampling techniqueto add some random responses for each query intotraining dataset. • Ranking loss.

The objective is to maximizethe difference between relatedness score pre-dicted for positive and randomly added pairs.We refer to this objective function as a rank-ing loss function. The sigmoid function usedin the last layer of MLP assigns a score toeach pair of query and response, which indi-cates how much the response is related to agiven query. • Cross entropy loss.

We explore the efﬁ-ciency of using a simpler loss function suchas cross entropy. In fact, we consider unref-erenced score prediction as a binary classiﬁ-cation problem and replace baseline trainedMLP with MLP classiﬁer (right dotted sec-tion in Figure 1). Since we do not have ahuman labeled dataset, we use negative sam-pling strategy to add randomly selected re-sponses to queries in training dataset. We as-sign label 1 to original pairs of queries and

Cosine similarity

PoolingPooling

Bert

Embeddings

IwantbuygiftforSomenewclothesmymom..

Word2vec

Embeddings

Figure 2: BERT-based referenced metric. Staticword2vec embeddings are replaced with BERT embed-dings (red dotted section). responses and 0 to the negative samples. Theoutput of softmax function in the last layer ofMLP classiﬁer indicates the relatedness scorefor each pair of query and response.

The referenced metric computes the similarity be-tween generated and reference responses. RU-BER achieves this by applying pooling strategieson static word embeddings to get sentence embed-dings for both generated and reference responses.In our metric, we replace the word2vec embed-dings with BERT embeddings (red dotted sectionin Figure 2) to explore the effect of contextualizedembeddings on calculating the referenced score.We refer to this metric as BERT-based referencedmetric.

We used the DailyDialog dataset which containshigh quality multi-turn conversations about dailylife including various topics (Li et al., 2017), totrain our dialogue system as well as the evalu-ation metrics. This dataset includes almost 13kmulti-turn dialogues between two parties splittedinto 42,000/3,700/3,900 query-response pairs fortrain/test/validation sets. We divided these setsinto two parts, the ﬁrst part for training dialoguesystem and the second part for training unrefer-neced metric. We used the ﬁrst part of train/test/validation setswith overall 20,000/1,900/1,800 query-response http://yanran.li/dailydialog uery Response Human rating Can I try this one on? Yes, of course. 5, 5, 5This is the Bell Captain’s Desk. May I help you? No, it was nothing to leave. 1, 2, 1Do you have some experiences to share withme? I want to have a try. Actually, it good to say.Thanks a lot. 3, 2, 2

Table 2: Examples of query-response pairs, each rated by three AMT workers with scores from 1 (not appropriateresponse) to 5 (completely appropriate response). pairs to train an attention-based sequence-to-sequence (seq2seq) model (Bahdanau et al., 2014)and generate responses for evaluation. We usedOpenNMT (Klein et al., 2017) toolkit to train themodel. The encoder and decoder are Bi-LSTMswith 2 layers each containing 500-dimensionalhidden units. We used 300-dimensional pretrainedword2vec embeddings as our word embeddings.The model was trained by using SGD optimizerwith learning rate of 1. We used random samplewith temperature control and set temperature valueto 0.01 empirically to get grammatical and diverseresponses.

We collected human annotations on generated re-sponses in order to compute the correlation be-tween human judgments and automatic evaluationmetrics. Human annotations were collected fromAmazon Mechanical Turk (AMT). AMT workerswere provided a set of query-response pairs andasked to rate each pair based on the appropriate-ness of the response for the given query on a scaleof 1-5 (not appropriate to very appropriate). Eachsurvey included 5 query-response pairs with an ex-tra pair for attention checking. We removed allpairs that were rated by workers who failed to cor-rectly answer attention-check tests. Each pair wasannotated by 3 individual turkers. Table 2 demon-strates three query-response pairs rated by threeAMT workers. In total 300 utterance pairs wererated from contributions of 106 unique workers.

To compare how the word embeddings affect theevaluation metric, which is the main focus ofthis paper, we used word2vec as static embed-ddings trained on about 100 billion words ofGoogle News Corpus. These 300 dimensionalword embeddings include almost 3 million wordsand phrases. We applied these pretrained embed- dings as input to dialogue generation, referencedand unreferenced metrics.

In order to explore the effects of contextual-ized embedding on evaluation metrics, we usedthe BERT base model with 768 vector dimen-sions pretrained on Books Corpus and EnglishWikipedia with 3,300M words (Devlin et al.,2018).

We used the second part of the DailyDi-alog dataset composed of 22,000/1,800/2,100train/test/validation pairs to train and tune theunreferenced model, which is implemented withTensorﬂow. For sentence encoder, we used 2layers of bidirectional gated recurrent unit (Bi-GRU) with 128-dimensional hidden unit. Weused three layers for MLP with 256, 512 and128-dimensional hidden units and tanh as activa-tion function for computing both ranking loss andcross-entropy loss. We used Adam (Kingma andBa, 2015) optimizer with initial learning rate of − and applied learning rate decay when no im-provement was observed on validation data for ﬁveconsecutive epochs. We applied early stop mecha-nism and stopped training process after observing20 epochs with no reduction in loss value. We ﬁrst present the unreferenced metrics’ per-formances. Then, we present results on the fullRUBER’s framework - combining unreferencedand referenced metrics. To evaluate the perfor-mance of our metrics, we calculated the Pearsonand Spearman correlations between learned met-ric scores and human judgments on 300 query-response pairs collected from AMT. The Pearsoncoefﬁcient measures a linear correlation betweentwo ordinal variables, while the Spearman coefﬁ-cient measures any monotonic relationship. The mbedding Representation Objective Pearson (p-value)

Spearman (p-value)

CosineSimilarity word2vec Bi-RNN Ranking 0.28 ( < ( < ( < ( < ( < ( < ( < ( < ( < ( < ( < ( < ( < ( < ( < ( < ( < ( < ( < ( < Mean Pooling Ranking 0.34 ( < ( < ( < ( < Table 3: Correlations and similarity values between relatedness scores predicted by different unreferenced modelsand human judgments. First row is RUBER’s unreferenced model. third metric we used to evaluate our metric is co-sine similarity, which computes how much thescores produced by learned metrics are similar tohuman scores.

This section analyzes the performance of unrefer-enced metrics which are trained based on variousword embeddings, sentence representations andobjective functions. The results in the upper sec-tion of Table 3 are all based on word2vec embed-dings while the lower section are based on BERTembeddings. The ﬁrst row of table 3 correspondsto RUBER’s unreferenced model and the ﬁve fol-lowing rows are our exploration of different unref-erenced models based on word2vec embeddings,for fair comparison with BERT embedding-basedones. Table 3 demonstrates that unreferenced met-rics based on BERT embeddings have higher cor-relation and similarity with human scores. Con-textualized embeddings have been found to carryricher information and the inclusion of these vec-tors in the unreferenced metric generally leads tobetter performance (Liu et al., 2019).Comparing different sentence encoding strate-gies (Bi-RNN v.s. Pooling) by keeping other vari-ations constant, we observe that pooling of BERTembeddings yields better performance. Thiswould be because of BERT embeddings are pre-trained on deep bidirectional transformers and us-ing pooling mechanisms is enough to assign richrepresentations to sentences. In contrast, themodels based on word2vec embeddings beneﬁt from Bi-RNN based sentence encoder. Acrosssettings, max pooling always outperforms meanpooling. Regarding the choice of objective func-tions, ranking loss generally performs better formodels based on word2vec embeddings, while thebest model with BERT embeddings is obtained byusing cross-entropy loss. We consider this as aninteresting observation and leave further investi-gation for future research.

This section analyzes the performance of integrat-ing variants of unreferenced metrics into the fullRUBER framework which is the combination ofunreferenced and referenced metrics. We onlyconsidered the best unreferenced models from Ta-ble 3. As it is shown in Table 4, across dif-ferent settings, max combinations of referencedand unereferenced metrics yields the best perfor-mance. We see that metrics based on BERT em-beddings have higher Pearson and Spearman cor-relations with human scores than RUBER (the ﬁrstrow of Table 4) which is based on word2vec em-beddings.In comparison with purely unreferenced met-rics (Table 3), correlations decreased across theboard. This suggests that the addition of the ref-erenced component is not beneﬁcial, contradictingRUBER’s ﬁndings (Tao et al., 2018). We hypothe-size that this could be due to data and/or languagedifferences, and leave further investigation for fu-ture work. odel Unreferenced Referenced Pooling Pearson Spearman CosineSimilarity

Embedding Representation Objective EmbeddingRUBER word2vec Bi-RNN Ranking word2vec min 0.08 ( < ( < ( < ( < ( < ( < ( < ( < ( < ( < ( < ( < Table 4: Correlation and similarity values between automatic evaluation metrics (combination of Referenced andUnreferenced metrics) and human annotations for 300 query-response pairs annotated by AMT workers. The”Pooling” column shows the combination type of referenced and unreferenced metrics.

Due to the impressive development of open do-main dialogue systems, existence of automaticevaluation metrics can be particularly desirable toeasily compare the quality of several models.

In some group of language generation tasks suchas machine translation and text summarization, n -grams overlapping metrics have a high correlationwith human evaluation. BLEU and METEOR areprimarily used for evaluating the quality of trans-lated sentence based on computing n -gram preci-sions and harmonic mean of precision and recall,respectively (Papineni et al., 2002; Banerjee andLavie, 2005). ROUGE computes F-measure basedon the longest common subsequence and is highlyapplicable for evaluating text summarization (Lin,2004). The main drawback of mentioned n -gramoverlap metrics, which makes them inapplicable indialogue system evaluation is that they don’t con-sider the semantic similarity between sentences(Liu et al., 2016; Novikova et al., 2017; Loweet al., 2017). These word overlapping metrics arenot compatible with the nature of language genera-tion, which allows a concept to be appeared in dif-ferent sentences with no common n -grams, whilethey all share the same meaning. Beside the heuristic metrics, researchers recentlytried to develop some trainable metrics for au-tomatically checking the quality of generated re-sponses. Lowe et al. (2017) trained a hierarchi-cal neural network model called Automatic Dia-logue Evaluation Model (ADEM) to predict theappropriateness score of dialogue responses. Forthis purpose, they collected a training dataset byasking human about the informativeness score forvarious responses of a given context. However, ADEM predicts highly correlated scores with hu-man judgments in both sentence and system level,collecting human annotation by itself is an effort-ful and laborious task.Kannan and Vinyals (2017) followed the GANmodel’s structure and trained a discriminator thattries to discriminate the model’s generated re-sponse from human responses. Even though theyfound discriminator can be useful for automaticevaluation systems, they mentioned that it can notcompletely address the evaluation challenges indialogue systems.RUBER is another learnable metric, which con-siders both relevancy and similarity concepts forevaluation process (Tao et al., 2018). Referencedmetric of RUBER measures the similarity betweenvectors of generated and reference responses com-puted by pooling word embeddings, while unref-erenced metric uses negative sampling to train therelevancy score of generated response to a givenquery. Despite ADEM score, which is trained onhuman annotated dataset, RUBER is not limitedto any human annotation. In fact, training withnegative samples makes RUBER to be more gen-eral. It is obvious that both referenced and unref-erenced metrics are under the inﬂuence of wordembeddings information. In this work, we showthat contextualized embeddings that include muchmore information about words and their contextcan have good effects on the accuracy of evalua-tion metrics.

Recently, there has been signiﬁcant progress inword embedding methods. Unlike previous staticword embeddings like word2vec , which mapswords to constant embeddings, contextualized em-beddings such as ELMo, OpenAI GPT and BERT https://code.google.com/archive/p/word2vec/ onsider word embeddings as a function of theword’s context in which the word is appeared(McCann et al., 2017; Peters et al., 2018; Rad-ford et al., 2018; Devlin et al., 2018). ELMolearns word vectors from a deep language modelpretrained on a large text corpus (Peters et al.,2018). OpenAI GPT uses transformers to learn alanguage model and also to ﬁne-tune it for spe-ciﬁc natural language understanding tasks (Rad-ford et al., 2018). BERT learns words’ representa-tions by jointly conditioning on both left and rightcontext in training all levels of deep bidirectionaltransformers (Devlin et al., 2018). In this paper,we show that beside positive effects of contex-ualized embeddings on many NLP tasks includ-ing question answering, sentiment analysis and se-mantic similarity, BERT embeddings also have thepotential to help evaluate open domain dialoguesystems closer to what would human do. In this paper, we explored applying contextual-ized word embeddings to automatic evaluation ofopen-domain dialogue systems. The experimentsshowed that the unreferenced scores of RUBERmetric can be improved by considering contextu-alized word embeddings which include richer rep-resentations of words and their context.In the future, we plan to extend the work toevaluate multi-turn dialogue systems, as well asadding other aspects, such as creativity and nov-elty into consideration in our evaluation metrics.

We thank the anonymous reviewers for their con-structive feedback, as well as the members of thePLUS lab for their useful discussion and feedback.This work is supported by Contract W911NF-15-1-0543 with the US Defense Advanced ResearchProjects Agency (DARPA).

References

Dzmitry Bahdanau, Kyunghyun Cho, and YoshuaBengio. 2014. Neural machine translation byjointly learning to align and translate.

CoRR ,abs/1409.0473.Satanjeev Banerjee and Alon Lavie. 2005. METEOR:an automatic metric for MT evaluation with im-proved correlation with human judgments. In

Pro-ceedings of the Workshop on Intrinsic and Ex-trinsic Evaluation Measures for Machine Transla- tion and/or Summarization@ACL 2005, Ann Arbor,Michigan, USA, June 29, 2005 , pages 65–72.Denny Britz, Anna Goldie, Minh-Thang Luong, andQuoc Le. 2017. Massive exploration of neural ma-chine translation architectures. In

Proceedings ofthe 2017 Conference on Empirical Methods in Nat-ural Language Processing , pages 1442–1451. Asso-ciation for Computational Linguistics.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2018. BERT: pre-training ofdeep bidirectional transformers for language under-standing.

CoRR , abs/1810.04805.Albert Gatt and Emiel Krahmer. 2018. Survey of thestate of the art in natural language generation: Coretasks, applications and evaluation.

J. Artif. Intell.Res. , 61:65–170.H. Paul Grice. 1975. Logic and conversation. In PeterCole and Jerry L. Morgan, editors,

Speech Acts , vol-ume 3 of

Syntax and Semantics , pages 41–58. Aca-demic Press, New York.Anjuli Kannan and Oriol Vinyals. 2017. Adver-sarial evaluation of dialogue models.

CoRR ,abs/1701.08198.Diederik P. Kingma and Jimmy Ba. 2015. Adam: Amethod for stochastic optimization. In .Guillaume Klein, Yoon Kim, Yuntian Deng, JeanSenellart, and Alexander M. Rush. 2017. Open-NMT: Open-source toolkit for neural machine trans-lation. In

Proc. ACL .Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao,and Bill Dolan. 2016. A diversity-promoting ob-jective function for neural conversation models. In

NAACL HLT 2016, The 2016 Conference of theNorth American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, San Diego California, USA, June 12-17,2016 , pages 110–119.Yanran Li, Hui Su, Xiaoyu Shen, Wenjie Li, ZiqiangCao, and Shuzi Niu. 2017. Dailydialog: A manuallylabelled multi-turn dialogue dataset. In

Proceedingsof the Eighth International Joint Conference on Nat-ural Language Processing, IJCNLP 2017, Taipei,Taiwan, November 27 - December 1, 2017 - Volume1: Long Papers , pages 986–995.Chin-Yew Lin. 2004. Rouge: a package for automaticevaluation of summaries.Chia-Wei Liu, Ryan Lowe, Iulian Serban, MichaelNoseworthy, Laurent Charlin, and Joelle Pineau.2016. How NOT to evaluate your dialogue sys-tem: An empirical study of unsupervised evaluationetrics for dialogue response generation. In

Pro-ceedings of the 2016 Conference on Empirical Meth-ods in Natural Language Processing, EMNLP 2016,Austin, Texas, USA, November 1-4, 2016 , pages2122–2132. The Association for Computational Lin-guistics.Nelson F. Liu, Matt Gardner, Yonatan Belinkov,Matthew Peters, and Noah A. Smith. 2019. Lin-guistic knowledge and transferability of contextualrepresentations.

CoRR , abs/1903.08855.Ryan Lowe, Michael Noseworthy, Iulian Vlad Ser-ban, Nicolas Angelard-Gontier, Yoshua Bengio, andJoelle Pineau. 2017. Towards an automatic turingtest: Learning to evaluate dialogue responses. In

Proceedings of the 55th Annual Meeting of the As-sociation for Computational Linguistics, ACL 2017,Vancouver, Canada, July 30 - August 4, Volume1: Long Papers , pages 1116–1126. Association forComputational Linguistics.Bryan McCann, James Bradbury, Caiming Xiong, andRichard Socher. 2017. Learned in translation: Con-textualized word vectors. In

Advances in NeuralInformation Processing Systems 30: Annual Con-ference on Neural Information Processing Systems2017, 4-9 December 2017, Long Beach, CA, USA ,pages 6297–6308.Sheila A. McIlraith and Kilian Q. Weinberger, edi-tors. 2018.

Proceedings of the Thirty-Second AAAIConference on Artiﬁcial Intelligence, (AAAI-18),the 30th innovative Applications of Artiﬁcial Intel-ligence (IAAI-18), and the 8th AAAI Symposiumon Educational Advances in Artiﬁcial Intelligence(EAAI-18), New Orleans, Louisiana, USA, February2-7, 2018 . AAAI Press.Gbor Melis, Chris Dyer, and Phil Blunsom. 2018. Onthe state of the art of evaluation in neural languagemodels. In

International Conference on LearningRepresentations .Jekaterina Novikova, Ondrej Dusek, Amanda CercasCurry, and Verena Rieser. 2017. Why we need newevaluation metrics for NLG. In

Proceedings of the2017 Conference on Empirical Methods in NaturalLanguage Processing, EMNLP 2017, Copenhagen,Denmark, September 9-11, 2017 , pages 2241–2252.Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic eval-uation of machine translation. In

Proceedings of the40th Annual Meeting of the Association for Compu-tational Linguistics, July 6-12, 2002, Philadelphia,PA, USA. , pages 311–318. ACL.Matthew E. Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word rep-resentations. In

Proceedings of the 2018 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics: Human Lan-guage Technologies, NAACL-HLT 2018, New Or- leans, Louisiana, USA, June 1-6, 2018, Volume 1(Long Papers) , pages 2227–2237.Alec Radford, Karthik Narasimhan, Tim Salimans, andIlya Sutskever. 2018. Improving language under-standing by generative pre-training.Lifeng Shang, Zhengdong Lu, and Hang Li. 2015.Neural responding machine for short-text conver-sation. In

Proceedings of the 53rd Annual Meet-ing of the Association for Computational Linguisticsand the 7th International Joint Conference on Natu-ral Language Processing of the Asian Federation ofNatural Language Processing, ACL 2015, July 26-31, 2015, Beijing, China, Volume 1: Long Papers ,pages 1577–1586.Yuanlong Shao, Stephan Gouws, Denny Britz, AnnaGoldie, Brian Strope, and Ray Kurzweil. 2017.Generating high-quality and informative conversa-tion responses with sequence-to-sequence models.In

Proceedings of the 2017 Conference on Em-pirical Methods in Natural Language Processing,EMNLP 2017, Copenhagen, Denmark, September9-11, 2017 , pages 2210–2219.Alessandro Sordoni, Michel Galley, Michael Auli,Chris Brockett, Yangfeng Ji, Margaret Mitchell,Jian-Yun Nie, Jianfeng Gao, and Bill Dolan. 2015.A neural network approach to context-sensitive gen-eration of conversational responses. In