DiSCoL: Toward Engaging Dialogue Systems through Conversational Line Guided Response Generation
Sarik Ghazarian, Zixi Liu, Tuhin Chakrabarty, Xuezhe Ma, Aram Galstyan, Nanyun Peng
DDiSCoL: Toward Engaging Di alogue S ystems through Co nversational L ine Guided Response Generation Sarik Ghazarian , Zixi Liu , Tuhin Chakrabarty , Xuezhe Ma , Aram Galstyan , and Nanyun Peng
Information Sciences Institute, University of Southern California Department of Computer Science, Columbia University Department of Computer Science, University of California Los Angeles {sarik, xuezhema, galstyan}@isi.edu [email protected]@cs.columbia.edu [email protected]
Abstract
Having engaging and informative conversa-tions with users is the utmost goal for open-domain conversational systems. Recent ad-vances in transformer-based language mod-els and their applications to dialogue sys-tems have succeeded to generate fluent andhuman-like responses. However, they still lackcontrol over the generation process towardsproducing contentful responses and achiev-ing engaging conversations. To achieve thisgoal, we present
DiSCoL ( Di alogue S ystemsthrough Co versational L ine guided responsegeneration). DiSCoL is an open-domain di-alogue system that leverages conversationallines (briefly convlines ) as controllable and in-formative content-planning elements to guidethe generation model produce engaging and in-formative responses. Two primary modules inDiSCoL’s pipeline are conditional generatorstrained for 1) predicting relevant and informa-tive convlines for dialogue contexts and 2) gen-erating high-quality responses conditioned onthe predicted convlines. Users can also changethe returned convlines to control the directionof the conversations towards topics that aremore interesting for them. Through automaticand human evaluations, we demonstrate the ef-ficiency of the convlines in producing engag-ing conversations. Over the past decade, users have actively engagedwith dialogue systems to fulfill a wide range ofrequirements. On one hand, task-oriented dialoguesystems have assisted users in accomplishing spe-cific tasks such as finding apartments and restau-rants or even booking movie tickets (Gustafsonet al., 2000; Gruenstein and Seneff, 2007; Li et al.,2017). On the other hand, open-domain dialoguesystems have been extensively leveraged for psy-chotherapy counseling, entertainment, and eventeaching foreign languages to users (Zhou et al.,
Dialogue Context: what do you think about Game of Thrones?
DialoGPTDiSCoL game of thrones game of lions love the game favorite show favorite character show lolgame of thrones game of lions adventure genre favorite show many genres show lol
Game of Thrones is my favorite show. I like many genres of TV, but I think I likethe Adventure genre the most. How about you? Do you have a favorite showor do you like the adventure genre?I love the game of thrones! My favorite show lol! Game of lions is also my favorite show, who is your favorite character?I like Game of Thrones. I have not seen the latest season. I have seen the first two seasons.
Figure 1: A dialogue context and its three responsesgenerated based on DialoGPT and our proposed DiS-CoL system using originally inferred and manipulatedconvlines, respectively. DiSCoL leverages convlines(depicted in colored boxes) to guide the generationmodel to encapsulate those informative contents. Ourdemo enables the user to edit or remove the inferredconvlines (shown in blue for edits and red for removal)to guide the conversation towards its desired directions. a r X i v : . [ c s . C L ] F e b igure 2: A snapshot of the proposed DiSCoL system models do not provide users the possibility to havecontrols on the generation contents and guide theconversation towards users’ desired direction.To alleviate this issue of generating informativeand controllable responses, we propose DiSCoLthat is an open-domain dialogue system with the in-tervention of convlines as primary elements to add control for generating informative and content-richresponses. Convlines are abstract representationsof utterances in the dialogues that can be used ascontent planning elements to form high-level con-tents of an utterance and guide the generator toincorporate these informative units in the genera-tion (See colored boxes in Figure 1). Content plan-ning has also been beneficial in story generationtask. These abstract representations known as sto-rylines or story plots have been successful to guidethe language models produce more coherent andfluent stories (Yao et al., 2019; Goldfarb-Tarrantet al., 2019; Fan et al., 2019; Goldfarb-Tarrant et al.,2020; Rashkin et al., 2020).DiSCoL is composed of four main neural-network-based modules (See Figure 3). The firsttwo modules are designed to extract entities andtopics of the dialogue context. The third moduleis a fine-tuned conditional generator that learns totake the dialogue context and previously extractedinformation and predict convlines that would beleveraged in the response generator module. Sim-ilar to convlines generator, response generator isa conditional auto-regressive language model thatgenerates response conditioned on the dialogue context and its convlines, entities, and topics ex-tracted from previous modules. The middle blockof Figure 1 exhibits the generated response for theinferred convlines shown in green boxes. In theinteractive setting of our devised demo from whicha snapshot is shown in Figure 2, we provide thefacility that the user can manipulate the predictedconvlines to direct the conversation towards its top-ics of interest. The last block in Figure 1 depicts theremoved and edited convlines (red and blue boxes)that led the generator to generate a slightly differ-ent response by taking into account the appliedadjustments.We validate DiSCoL on the Topical chat dataset(Gopalakrishnan et al., 2019) using both humanand automatic evaluations. Our results demonstratethe superiority of DiSCoL over DialoGPT in termsof generating higher quality responses, thus indi-cating the usefulness of convlines as dialogue con-trol mechanisms for generating more engaging re-sponses. We release the source code and trainedmodels to facilitate the future dialogue research. The architecture of our proposed DiSCoL demosystem and its modules are depicted in Figure 3. Auser converses with the system by writing an utter-ance as an input. This utterance passes through allthe modules and in each module some new informa-tion such as its extracted entities, topics, and con- Github Link: https://github.com/PlusLabNLP/Dialogue_System_Hackathon eneral EntertainmentGame of Thrones Topic classifierEntity Extractor Convline GeneratorUser Utterance: Response Generator what do you think about Game of Thrones? game of thrones, game of lions, love the game,favorite show, favorite character, show lolI love the game of thrones! My favorite show lol! Game of lions is also my favoriteshow, who is your favorite character?
Figure 3: Architecture of DiSCoL system vlines are augmented. The last module, responsegenerator, incorporates all this information to gen-erate a response as the output of the system. In thissection, we explain each module in detail.
One of the principal components in the conversa-tional systems is the set of entities that both in-terlocutors are interested to converse about. It iscrucial that the system can identify the main enti-ties from the dialogue context and try to continuethe conversation by providing more relevant infor-mation or even expressing its opinions and impres-sions regarding them. Therefore, in DiSCoL wetake the user’s utterance as the dialogue context andextract its entities. This task in known as a namedentity recognition (NER) task, where each tokenin the text is classified into one of the predefinedclasses such as a person, organization, location orother.Toward this goal, we leverage the BERT model(Devlin et al., 2019) fine-tuned on CoNLL-2003dataset (Sang and De Meulder, 2003), which is awell-known corpus for NER task. We detokenizethe output of the fine-tuned BERT model to get theoriginal version of entities’ tokens and disregardthe predefined classes of entities since in our casethey do not augment additional benefits. As it isshown in Figure 3, all entities with labels other than O are returned from the entity extractor module. Knowing the topic that the user is enthusiastic todiscuss would be beneficial for the dialogue systemto generate utterances about that specific topic. Theblue box in Figure 3 represents the topic classifierthat takes the user’s utterances and predicts themost relevant topics from a predefined set. These We leverage fine-tuned BERT model provided by Hug-gingface (https://github.com/huggingface/transformers). topics are later used for predicting convlines andconsequently generating responses.Due to the proven effectiveness of the BERTmodel (Devlin et al., 2019) and its wide applicabil-ity in many classification tasks, we incorporate itinto the topic classifier module of DiSCoL. We fine-tune BERT model on pairs of utterances and theiraligned topics with the main goal of minimizingthe cross-entropy loss.
DiSCoL’s main contribution is in the convline gen-erator module that is depicted as the purple box inFigure 3. Convlines are abstract representations orcontent plans of utterances throughout the conver-sation. These representations that are known as sto-rylines or story plots in the story generation contexthave recently posited their efficiency in generatinghigher quality stories (Yao et al., 2019; Fan et al.,2019; Goldfarb-Tarrant et al., 2020; Rashkin et al.,2020). Story generation models leverage plan-and-write framework that is successful in generatingfluent and informative stories by the interventionof storylines as an intermediate step. In this work,we follow the same idea but in the context of con-versational systems. In particular, we aim to showthat the controlled generation of high-quality ut-terances by planning in advance and leveraginguseful abstract-level convlines can be beneficial fordialogue systems as well.To compose the convlines as the main compo-nent in the convline generator module, we extractsequences of important words in each utterancefrom existing human-human conversational data.We pursue the YAKE (Campos et al., 2018) methodthat relies on the text’s statistical features to extractthe most important keywords of an utterance. Ithas shown its superiority versus other state-of-the-art unsupervised approaches such as TF-IDF andRAKE (Rose et al., 2010).3n order to train the convline generator, we ex-tract pairs of ( u i , r i ) as a set of consecutive pairs ofdialogue context utterances and their correspond-ing ground-truth responses in the human-humanconversational data. For each dialogue context ut-terance ( u i ), we extract its entities ( e i ) and topics( t i ) using the entity extractor and topic classifiermodules. Each response ( r i ) is replaced by its con-vlines ( c i ) obtained by the YAKE algorithm. Theconstructed input data are in ( u i , e i , t i , c i ) format.The convline generator is a conditional modelthat generates the most probable convlines giventhe provided dialogue context utterance and its enti-ties and topics. To this end, we apply BART (Lewiset al., 2019) a state-of-the-art pre-trained sequence-to-sequence generative model. It combines a bidi-rectional encoder as that of BERT (Devlin et al.,2019) to encode the input and a GPT like (Rad-ford et al., 2018) auto-regressive decoder modelto generate convlines as the output. The top blockin Figure 4 encapsulates the training process ofthe convlines module. We fine-tune BART on theconstructed training data with the objective of min-imizing the negative log likelihood that is shown inequation 1. L line _ gen = − log n (cid:88) i =1 P ( c i | u i , t i , e i ) (1)During inference, the fine-tuned BART model takesthe user’s utterance plus its inferred entities andtopics to predict the most probable convlines as itis depicted in the bottom block of Figure 4. We usetop-k sampling (Fan et al., 2019) with k = 5 and atemperature of . for the generation. The last module in DiSCoL system’s pipeline isthe response generator that is identical to convlinegenerator except for the type of inputs and outputs.The response generator takes the dialogue contextutterance, its convlines and topics as inputs andgenerates response conditioned on those data. L resp _ gen = − log n (cid:88) i =1 P ( r i | u i , t i , c i ) (2)During training, we provide utterances, their topicsand convlines extracted from YAKE to the BARTmodel and fine-tune this pre-trained conditionalgenerator. As it is shown in equation 2, the trainingobjective is to maximize the probability of gener-ating ground-truth responses given their contextutterances, topics followed by convlines. During inference, the generator attempts to pro-duce the most probable responses that include con-vlines returned by the convline generator module. We test our system on Topical-Chat dataset(Gopalakrishnan et al., 2019) that includesknowledge-grounded human-human conversationscovering a set of 8 different topics. This datasethas been collected by employing Amazon Mechan-ical Turk (AMT) workers who have been providedwith specific entities and some external knowledge(Wikipedia lead sections, Washington Post articles,or some Reddit fun facts) to chat about. Therefore,each utterance in the conversation is either based onprovided knowledge sources or the user’s personalknowledge. Overall, 261 popular entities spanning8 various topics (Fashion, Sports, Books, Politics,and etc.) have been selected for the dataset col-lection. We add
General topic for utterances (e.g.greetings) that do not include any specific contentssuch as "hi, how are you today?" . Although each utterance in the Topical-Chat datasetcomes from either external or personal knowledges,it lacks specified topics. These topics are neces-sary for DiSCoL modules. We manually match all261 entities in the external knowledges to one ofthe topics in the predefined set and easily label allutterances about entities of those external knowl-edges to their matched topics. This results in about78% of overall 188,378 ( easy_set) utterances to beeasily matched with topics. As an example, theutterance "Do you know Tom Brady" is about "TomBrady" entity that is an indication of the "Sports" topic. The remaining challenging utterances arebased on personal knowledges that their entitiesare not directly specified. We pursue the follow-ing context-based heuristics to label such challeng-ing_set utterances with their relevant topics. If theutterance’s neighbors are from easy_set and sharethe same entity, we assign that entity’s topic to theutterance, while in the case of neighbors containingdifferent entities, we label the given utterance withboth utterances’ topics. If the previous rules do notapply to an utterance in the challenging_set , we usethe most frequent topic in the dialog as its topic.In parallel to the above heuristics and in orderto improve the quality of assigned topics, we applya keyword-based classifier that classifies challeng- ART
Sports
Are you an NFL fan?
NFL
Books
Nice. Do you like Shakespeare?
Shakespeare
YAKE bears fan E n c o d e r I npu t D e c o d e r O u t p u t 𝑟 ! 𝑐 ! 𝑡 !
Hi do you like football?
General Entertainment
I've never see Pokemon, and I don't think I ever will.
Pokemon
Uttr. Easy_set Challenging_set General_uttr.
Table 1: Statistics of different groups of utterances (ut-trs.) in Topical chat dataset ing_set utterances with appropriate topics. Thekeyword-based classifier retrieves the most similarentity from the overall 261 entities to each utter-ance’s keywords using their BERT embeddings.The manually matched topics for the retrieved en-tity are assigned to the utterance. We only consider5323 challenging_set utterances that their adaptedlabels based on both context-based heuristics andkeyword-based classifier are the same (See statis-tics in Table 1). We fine-tune the BERT model asthe topic classifier for 10 epochs and get an accu-racy of 85.55 on the validation set.
Convlines are necessary components in the trainingof the DiSCoL system. We leverage YAKE (Cam-pos et al., 2018) for retrieving discourse keywordsrepresenting convlines. YAKE assigns importancescore to tokens in a text by following an unsuper-vised approach that builds upon features extractedfrom the text (Campos et al., 2018). In this model,a set of features are computed for each term in thetext. Subsequently, a list of candidates (n-gramsof tokens) is created. Next, the Levenshtein dis-tance is used to remove duplicate keywords. At theenc, the aggregation of tokens scores in each key-word represents the keyword’s scores. Keywordswith lower scores are returned as the text’s salientconvlines. Using YAKE we generate a contiguoussequence of 1, 2, and 3-grams candidate convlines.We extract 3-grams convlines, followed by extract-ing 2-grams and 1-gram that are not included inthe previously returned keywords. We fine-tuneBART-large for both convlines and response gener-
Dialogue Context Annotators Kappa Pearson
100 33 0.44 0.5
Table 2: Statistics and inter-annotator agreements ofAMT evaluations on DiSCoL and DialoGPT perfor-mances.
Diversity Bleu Relevancy Engagement0.00.10.20.30.40.50.60.70.8 S c o r e s DialoGPTDiSCoL
Figure 5: Automatic evaluations on responses gener-ated by DiSCoL and DialoGPT systems ator models for 3 epochs and checkpoint the bestepoch based on validation perplexity. We compare the performance of DiSCoL systemversus DialoGPT as one of the strongest recentbaselines, that has shown its efficiency in generat-ing consistent and relevant responses.
To explore the efficiency of our proposed controlledresponse generation, we apply both automatic andhuman evaluations.
Due to the multi-faceted nature of dialogue quality,it is necessary to do the evaluation from different as-pects (See et al., 2019; Mehri and Eskenazi, 2020).To this end, we compare the quality of DiSCoL andDialoGPT generated responses through computing We fine-tune BART model usinghttps://github.com/pytorch/fairseq igure 6: Human evaluations on responses generatedby DiSCoL and DialoGPT systems different metrics. We conduct automatic evalua-tions and compute evaluation metrics on 23,530consecutive utterance pairs (dialogue context ut-terances and their ground-truth responses) of theTopical chat test set. The measured metrics areaveraged over all utterance pairs within the testset. We compute BLEU-3 (Papineni et al., 2002)to evaluate the similarity of generated responsesto ground-truth responses based on the 3-gramsoverlaps. Due to the one-to-many essence of open-domain dialogue systems and the imperfection ofsuch word-overlap metrics (Liu et al., 2016; Ghaz-arian et al., 2019; Mehri and Eskenazi, 2020), wealso focus on three main aspects: diversity, rele-vancy, and engagingness as better indications ofsystems performances.Diversity measures the percentage of distinctgenerated tokens by each model. Li et al. (2015)proposed distinct-2 that computes distinct bi-gramsdivided by the total number of generated words.Relevancy utilizes both dialogue context utteranceand the generated response to deliberate how muchit is relevant to the given utterance (Tao et al., 2018;Ghazarian et al., 2019). We use the contextualizedRuber metric for this purpose (Ghazarian et al.,2019). At the end, since in open-domain dialoguesystems, it is necessary to have both relevant andinteresting responses to make the user feel satis-fied (Ghazarian et al., 2020), we further validatesystems based on the engagingness of responses.We compute engagingness as the probability scoreof the engaging class predicted by Ghazarian et al.(2020)’s proposed engagement classifier. We extend our evaluations by running AMT ex-periments to report human judgments on systems’qualities. We randomly select 100 dialogue con-text utterances from the Topical chat test set. For each given dialogue context utterance, we ask threeAMT workers to rate DiSCoL and DialoGPT’s gen-erated responses by keeping these systems anony-mous. Participants rate the relevancy, engaging-ness, and overall quality of each response on a5-point Likert scale (1 indicating irrelevant/not en-gaging and low-quality response). The statistics ofthe AMT experiment is shown in Table 2.
Figure 5 depicts the av-erage scores of diversity, BLEU, relevancy, andengagingness resulted from automatic evaluationmetrics for all the generated responses of DiSCoLand DialoGPT systems. The strength of DiSCoL isnoticeable from its higher BLEU score and morediverse, relevant, and engaging responses. Overall,the diversity is low due to the limited distinct topicsconsidered in the Topical chat dataset. The BLEUmetric is low for both systems which shows its in-adequacy in the open-domain evaluations; where aresponse can be super appropriate and at the sametime not similar to the ground-truth response.
Human Evaluation.
The bars in Figure 6demonstrate the average of human annotations fordifferent qualities of generated utterances. Eachresponse’s score is the mean aggregation of threeannotators’ ratings. According to Figure 6, annota-tors appraise responses generated by DiSCoL withhigher scores in terms of relevancy, engagingness,and overall quality. This could be an evidence forthe positive impact of incorporating convlines toguide the dialogue system towards generating con-trollable, relevant, and contentful responses thatinfuse the user to converse for a longer time.
We have introduced DiSCoL that leverages con-vline as an intermediate step towards generatingmore informative and controllable responses in di-alogues. These convlines are predicted and subse-quently leveraged in the response generation pro-cess. Additionally, DiSCoL allows users to manipu-late convlines towards their favorite conversationaldirection. Our findings show that in contrast toother transformer-based dialogue systems that donot have content plannings, our system takes the ad-vantage of such a principled structure to have betterand more engaging conversations with users.6
Ethics
Through the entire phases of the conducted re-search and developed DiSCoL system, all co-authors were agreed and adhered to
ACM Codeof Ethics . Our effort was to ensure we stuck to theconscience of the profession and considered theCode principles. We certify that this system and allthe presented evaluations are compatible with theprovided code.
DiSCoL System’s Development
The main con-tribution of our proposed DiSCoL system is to aug-ment controllable response generation with the in-tervention of convlines that leads the generationtowards producing more relevant and interestingresponses. Indeed, DiSCoL provides an oppor-tunity for users to manipulate the convlines andguide the system to continue the conversation inthe user’s favorite direction. All DiSCoL’s modulesleverage pre-trained large language models such asBART (Lewis et al., 2019) and fine-tune them on re-cently proposed Topical chat dataset (Gopalakrish-nan et al., 2019). One potential harm that DiSCoLcould cause is its feasibility to generate improperresponses conditioned on the inferred convlineswith abusive contents. Since the convline and re-sponse generators are BART models finetuned onhuman-human conversations that do not encompassprofanity and inappropriate content ((Gopalakrish-nan et al., 2019)), hence the convlines that indeedare important informative units of the utteranceswould be free of bias and obscene content. How-ever, there still is a possibility of dual-usage attacksby augmenting conversations with offensive lan-guages to fine-tune the generators and teach themto generate such inappropriate content. The identi-fication of such attacks that could occur in almostall learnable models and the way to overcome themby itself is a distinct and huge research area that isout of this paper’s scope.
DiSCoL System’s Evaluation
Alongside the au-tomatic evaluation for demonstrating the efficiencyof controllable generations using convlines, we fur-ther collected human annotations by conductingAmazon Mechanical Turk (AMT) experiments. Weprovided different systems responses for given ut-terances while keeping systems anonymous andasked users to rate responses by considering differ-ent aspects that had been explained in the AMT sur-veys. We estimated the average time users wouldspend on each survey and fairly compensated them according to the hourly wage.We kept the privacy of all AMT turkers who par-ticipated in the experiments. Our experiments didnot have the requisite to know the user’s personalinformation, therefore their personal informationincluding their genre, ethnicity, and etc. are not re-vealed. This fades the necessity for IRB approvals.Our system’s target is NLP open-domain conver-sational AI community. Its main goal is to achieveengaging conversations with the incorporation ofconvlines and increase the user’s ability to con-trol the generation process. Likewise other pro-posed dialogue systems, we anticipate specific fail-ure modes specifically for novel conversations onnew topics. Lifelong learning in dialogue systemswhich is not the focus of this work is a researcharea that attempts to enhance conversation systems’ability to deal with such novel scenarios.
References
Ricardo Campos, Vítor Mangaravite, Arian Pasquali,Alípio Mário Jorge, Célia Nunes, and Adam Jatowt.2018. Yake! collection-independent automatic key-word extractor. In
European Conference on Informa-tion Retrieval , pages 806–810. Springer.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. Bert: Pre-training of deepbidirectional transformers for language understand-ing. In
North American Chapter of the Associationfor Computational Linguistics (NAACL-HLT) .Angela Fan, Mike Lewis, and Yann Dauphin. 2019.Strategies for structuring story generation. In
Asso-ciation for Computational Linguistics (ACL) .Sarik Ghazarian, Johnny Tian-Zheng Wei, Aram Gal-styan, and Nanyun Peng. 2019. Better automaticevaluation of open-domain dialogue systems withcontextualized embeddings. In
Proceedings of theMethods for Optimizing and Evaluating Neural Lan-guage Generation (NeuralGen workshop of NAACL-HLT) .Sarik Ghazarian, Ralph M Weischedel, Aram Galstyan,and Nanyun Peng. 2020. Predictive engagement:An efficient metric for automatic evaluation of open-domain dialogue systems. In
AAAI , pages 7789–7796.Seraphina Goldfarb-Tarrant, Tuhin Chakrabarty, RalphWeischedel, and Nanyun Peng. 2020. Content plan-ning for neural story generation with aristotelianrescoring. In
Empirical Methods in Natural Lan-guage Processing (EMNLP) .Seraphina Goldfarb-Tarrant, Haining Feng, andNanyun Peng. 2019. Plan, write, and revise: aninteractive system for open-domain story generation. n , volume 4, pages 89–97.Karthik Gopalakrishnan, Behnam Hedayatnia,Qinglang Chen, Anna Gottardi, Sanjeev Kwatra,Anu Venkatesh, Raefer Gabriel, Dilek Hakkani-Tür,and Amazon Alexa AI. 2019. Topical-chat: Towardsknowledge-grounded open-domain conversations.In INTERSPEECH .Alexander Gruenstein and Stephanie Seneff. 2007. Re-leasing a multimodal dialogue system into the wild:User support mechanisms. In
Proceedings of the8th SIGdial Workshop on Discourse and Dialogue ,pages 111–119.Joakim Gustafson, Linda Bell, Jonas Beskow, JohanBoye, Rolf Carlson, Jens Edlund, Björn Granström,David House, and Mats Wirén. 2000. Adapt—a mul-timodal conversational dialogue system in an apart-ment domain. In
The Sixth International Conferenceon Spoken Language Processing (ICSLP), Beijing,China , pages 134–137.Mike Lewis, Yinhan Liu, Naman Goyal, Mar-jan Ghazvininejad, Abdelrahman Mohamed, OmerLevy, Ves Stoyanov, and Luke Zettlemoyer. 2019.Bart: Denoising sequence-to-sequence pre-trainingfor natural language generation, translation, andcomprehension. In
Association for ComputationalLinguistics (ACL) .Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao,and Bill Dolan. 2015. A diversity-promoting ob-jective function for neural conversation models. In
North American Chapter of the Association for Com-putational Linguistics (NAACL-HLT) .Xiujun Li, Yun-Nung Chen, Lihong Li, Jianfeng Gao,and Asli Celikyilmaz. 2017. End-to-end task-completion neural dialogue systems. In
Interna-tional Joint Conference on Natural Language Pro-cessing (IJCNLP) .Chia-Wei Liu, Ryan Lowe, Iulian V Serban, MichaelNoseworthy, Laurent Charlin, and Joelle Pineau.2016. How not to evaluate your dialogue system:An empirical study of unsupervised evaluation met-rics for dialogue response generation. In
EmpiricalMethods in Natural Language Processing (EMNLP) .Shikib Mehri and Maxine Eskenazi. 2020. Unsuper-vised evaluation of interactive dialog with dialogpt.In
Proceedings of the 21th Annual Meeting of theSpecial Interest Group on Discourse and Dialogue .Kyo-Joong Oh, Dongkun Lee, Byungsoo Ko, and Ho-Jin Choi. 2017. A chatbot for psychiatric counsel-ing in mental healthcare service based on emotionaldialogue analysis and sentence generation. In , pages 371–375. IEEE. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic eval-uation of machine translation. In
Proceedings of the40th annual meeting of the Association for Compu-tational Linguistics , pages 311–318.Alec Radford, Karthik Narasimhan, Tim Salimans, andIlya Sutskever. 2018. Improving language under-standing by generative pre-training.Alec Radford, Jeffrey Wu, Rewon Child, David Luan,Dario Amodei, and Ilya Sutskever. 2019. Languagemodels are unsupervised multitask learners.
OpenAIblog , 1(8):9.Hannah Rashkin, Asli Celikyilmaz, Yejin Choi, andJianfeng Gao. 2020. PlotMachines: Outline-conditioned generation with dynamic plot statetracking. In
Proceedings of the 2020 Conference onEmpirical Methods in Natural Language Processing(EMNLP) .Stuart Rose, Dave Engel, Nick Cramer, and WendyCowley. 2010. Automatic keyword extraction fromindividual documents.
Text mining: applicationsand theory , 1:1–20.Erik F Sang and Fien De Meulder. 2003. Intro-duction to the conll-2003 shared task: Language-independent named entity recognition. arXivpreprint cs/0306050 .M Sarosa, M Kusumawardani, A Suyono, and MH Wi-jaya. 2020. Developing a social media-based chat-bot for english learning. In
IOP Conference Series:Materials Science and Engineering , page 012074.IOP Publishing.Abigail See, Stephen Roller, Douwe Kiela, and JasonWeston. 2019. What makes a good conversation?how controllable attributes affect human judgments.In
North American Chapter of the Association forComputational Linguistics (NAACL-HLT) .Chongyang Tao, Lili Mou, Dongyan Zhao, and RuiYan. 2018. Ruber: An unsupervised method for au-tomatic evaluation of open-domain dialog systems.In
Proceedings of the AAAI Conference on ArtificialIntelligence .Lili Yao, Nanyun Peng, Weischedel Ralph, KevinKnight, Dongyan Zhao, and Rui Yan. 2019. Plan-and-write: Towards better automatic storytelling. In
The Thirty-Third AAAI Conference on Artificial In-telligence (AAAI-19) .Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen,Chris Brockett, Xiang Gao, Jianfeng Gao, JingjingLiu, and Bill Dolan. 2019. Dialogpt: Large-scalegenerative pre-training for conversational responsegeneration. In
Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguis-tics . i Zhou, Jianfeng Gao, Di Li, and Heung-Yeung Shum.2020. The design and implementation of xiaoice, anempathetic social chatbot. Computational Linguis-tics , 46(1):53–93., 46(1):53–93.