Modeling Topical Relevance for Multi-Turn Dialogue Generation
Hainan Zhang, Yanyan Lan, Liang Pang, Hongshen Chen, Zhuoye Ding, Dawei Yin
MModeling Topical Relevance for Multi-Turn Dialogue Generation
Hainan Zhang ∗ , Yanyan Lan † , Liang Pang , Hongshen Chen , Zhuoye Ding and DaweiYin CAS Key Lab of Network Data Science and Technology, Institute of Computing Technology, CAS JD.com, Beijing, China Baidu.com, Beijing, China University of Chinese Academy of Sciences { zhanghainan6,chenhongshen,dingzhuoye } @jd.com, { lanyanyan,pangliang } @ict.ac.cn,[email protected] Abstract
Topic drift is a common phenomenon in multi-turndialogue. Therefore, an ideal dialogue generationmodels should be able to capture the topic informa-tion of each context, detect the relevant context, andproduce appropriate responses accordingly. How-ever, existing models usually use word or sentencelevel similarities to detect the relevant contexts,which fail to well capture the topical level rele-vance. In this paper, we propose a new model,named STAR-BTM, to tackle this problem. Firstly,the Biterm Topic Model is pre-trained on the wholetraining dataset. Then, the topic level attentionweights are computed based on the topic repre-sentation of each context. Finally, the attentionweights and the topic distribution are utilized in thedecoding process to generate the corresponding re-sponses. Experimental results on both Chinese cus-tomer services data and English Ubuntu dialoguedata show that STAR-BTM significantly outper-forms several state-of-the-art methods, in terms ofboth metric-based and human evaluations.
Multi-turn dialogue generation is widely used in many natu-ral language processing (NLP) applications, such as customerservices, mobile assistant and chatbots. Given a conversa-tion history containing several contexts, a dialogue genera-tion model is required to automatically output an appropri-ate response. Therefore, how to fully understand and utilizethese contexts is important for designing a good multi-turndialogue generation model.Different from single-turn dialogue generation, people usu-ally model the multi-turn dialogue generation in a hierar-chical way. A typical example is the Hierarchical Recur-rent Encoder-Decoder (HRED) model [Serban et al. , 2016;Sordoni et al. , 2015]. In the encoding phase, a recurrent ∗ This work was done when the first author was a Ph.D student atICT, CAS. † Corresponding Author. context1 你 好 , 在 吗 ? (Hello)context2 有 什么 问 题 我 可 以 帮 您 呢 ? (What can I do for you?)context3 商 品 降 价了 , 我 要 申 请 保 价 (The product price has dropped. I want a low-price.)context4 好 的 , 这边 帮 您 申 请 , 商 品 已 经 收 到 了 吧 ? (Ok, I will apply for you. Have you received the product?)context5 东 西 收 到 了 发 票 不 在 一 起 吗 ? (I have received the product without the invoice together.)context6 开 具 电 子 发 票 不 会 随 货 寄 出 (The electronic invoice will not be shipped with the goods.)current context 是 发 我 邮 箱 吗 ? (Is it sent to my email ? )response 是 的 , 请 您 提 供 邮 箱 地址 , 电 子 发 票 小 时 寄 出 。 (Yes, please provide your email address , we will send the electronic invoices in 24 hours.) Table 1: The example from the customer services dataset. The wordcolor indicates the relevant topic word in the contexts and response,showing the topic-drift phenomenon. neural network (RNN) based encoder is first used to encodeeach context as a sentence-level vector, and then a hierarchi-cal RNN is utilized to encode these context vectors to a his-tory representation. In the decoding process, another RNNdecoder is conducted to generate the response based on thehistory representation. The parameters of both encoder anddecoder are learned by maximizing the averaged likelihoodof the training data. However, the desired response is usu-ally only dependent on some relevant contexts, instead of allthe contexts. Recently, some works have been proposed tomodel the relevant contexts by using some similarity mea-sures. For example, Tian et al. [2017] calculates the co-sine similarity of the sentence embedding between the cur-rent context and the history contexts as the attention weights,Xing et al. [2018] introduces the word and sentence level at-tention mechanisms to HRED, and Zhang et al. [2019] uti-lizes the sentences level self-attention mechanism to detectthe relevant contexts. However, these similarities are definedon either word or sentence level, which cannot well tackle thetopic drift problem in multi-turn dialogue generation.Here we give an example conversation, as shown in Ta-ble 1. The contexts are of three different topics. The(context1,context2) pair talks about ‘greeting’, the (con-text3,context4) pair talks about ‘low-price’, and the (con-text5,...,response) pair talks about ‘invoice’. In this case, us-ing all the contexts indiscriminately will obviously introduce a r X i v : . [ c s . C L ] S e p any noises to the decoding process, which will hurt the per-formance of the multi-turn dialogue generation model. If weuse word level similarities to locate the relevant contexts, thecurrent context and context4 in the example will be associ-ated because ‘send’ and ‘receive’ are highly similar words,which is clearly wrong. If we use sentence level similaritiesto locate the relevant contexts, it may still involve the falserelevant context4 into consideration.We argue that context relevance should be computed at thetopic level, to better tackle the topic drift problem in multi-turn dialogue generation. From both linguistic and cogni-tive perspective, topic is the high level cluster of knowledge,which can describe the relationship of sentences in the con-text, and has an important role in human dialogue for direct-ing focus of attention. In this paper, we propose a new model,namely STAR-BTM, to model the Short-text Topic-level At-tention Relevance with Biterm Topic Model (BTM) [Yan etal. , 2013]. Specifically, we first pre-train the BTM modelon the whole training data, which split every customer-serverpair in the context as a short document. Then, we use theBTM to get each sentence topic distribution and calculatethe topic distribution similarity between the current contextand each history context as the relevance attention. Finally,we utilize the relevance attention and the topic distribution toconduct the decoding process. The BTM model and the textgeneration model are jointly learned to improve their perfor-mances in this process.In our experiments, we use two public datasets to evalu-ate our proposed models, i.e., Chinese customer services andEnglish Ubuntu dialogue corpus. The experimental resultsshow that STAR-BTM generates more informative and suit-able responses than traditional HRED models and its atten-tion variants, in terms of both metric-based evaluation andhuman evaluation. Besides, we have shown the relevant at-tention words, indicating that STAR-BTM obtains coherentresults with human’s understanding. Recently, multi-turn dialogue generation has gained more at-tention in both research community and industry, comparedwith the single-turn dialogue generation [Li et al. , 2017;Mou et al. , 2017; Zhang et al. , 2018a; Zhang et al. , 2018b].One of the reasons is that it is closely related to the realapplication, such as chatbot and customer service. Moreimportantly, multi-turn dialogue generation needs to con-sider more information and constraints [Chen et al. , 2018;Zhang et al. , 2018c; Zhang et al. , 2019; Wu et al. , 2017;Zhou et al. , 2016], which brings more challenges for the re-searchers in this area. To better model the historical informa-tion, Serban et al. [Serban et al. , 2016] propose the HREDmodel, which uses a hierarchical encoder-decoder frameworkto model all the contexts information. With the widespreaduse of HRED, more and more variant models have been pro-posed. For example, Serban et al. [Serban et al. , 2017b;Serban et al. , 2017a] propose Variable HRED (VHRED) andMrRNN which utilize the latent variables as intermediatestates to generate diverse responses.However, it is unreasonable to use all the contexts indis- criminately for the multi-turn dialogue generation task, sincethe responses are usually only associated with a portion of theprevious contexts. Therefore, some researchers try to use thesimilarity measure to define the relevance of the context. Tianet al. [Tian et al. , 2017] propose a weighted sequence (WSeq)attention model for HRED, which uses the cosine similarityas the attention weight to measure the correlation of the con-texts. But this model only uses the unsupervised sentencelevel representation, which fails to capture some detailed se-mantic information. Recently, Xing et al. [Xing et al. , 2018]introduced the traditional attention mechanism [Bahdanau etal. , 2015] into HRED, named hierarchical recurrent attentionnetwork (HRAN). In this model, the weight of attention iscalculated based on the current state, the sentence level rep-resentation and the word level representation. However, theword level attention may introduce some noisy relevant con-texts. Shen et al. [Chen et al. , 2018] propose to introduce thememory network into the VHRED model, so that the modelcan remember the context information. Theoretically, it canretrieve some relevant information from the memory in thedecoding phase, however, it is not clearly whether and howthe system accurately extracts the relevant contexts. Zhanget al. [Zhang et al. , 2019] proposed to use the sentence levelself-attention to model the long distance dependency of con-texts, to detect the relevant context for the multi-turn dialoguegeneration. Though it has the ability to tackle the positionbias problem, the sentence level self-attention is still limitedin capturing the topic level relevant contexts.The motivation of this paper is to detect the topic level at-tention relevance for multi-turn dialogue generation. It is amore proper way to deal with the topic draft problem, as com-pared with the traditional word or sentence level methods.Some previous works[Xing et al. , 2017; Xing et al. , 2018]have been proposed to use topic model in dialogue genera-tion. They maily use the topic model to provide some topicrelated words for generation, while our work focuses on de-tecting the topic level relevant contexts.
In this section, we will describe our Short-text Topic Atten-tion Relevance with Biterm Topic Model (STAR-BTM) in de-tail, with the architecture shown in Figure 1. STAR-BTMconsists of three modules, i.e. , the pre-trained BTM model,topic level attention module and the joint learning decoder.Firstly, we pre-train the BTM model on the whole trainingdata, to obtain the topic word distribution of each context.Secondly, the topic level attention is calculated as the sim-ilarity between the topic distributions of the current contextand each history context. After that, the attention weightsare multiplied with the hierarchical hidden state in HRED toobtain the history representation. Finally, the history repre-sentation and the topic distribution of the current context areconcatenated to decode the response step by step.From the architecture, we can see that STAR-BTM intro-duces the short text topic model into the HRED model, toincorporate the topic level relevant contexts to the decodingprocess. It is clear that the topic level distribution can providemore specific topic information than only using the word and …W BTM W W … BTM W W … BTM W N,1 W N,MN … Y Y M … BTM W N,1 W N,MN … W …W W N,1 …W N,MN α α α N s s s N S N E(t N ,c N ) Contextc N TopicDistributiont N DecoderTopicAttentionHREDHidden state
Figure 1: The architecture of STAR-BTM. sentence level representations. What is more, the topic modelfirstly ‘sees’ the whole data globally by the pre-training tech-niques, and is then fine-tuned by the joint learning techniquewith the generation model.
We use the pre-trained BTM model on the whole trainingdata to obtain the topic distribution. The pre-trained modelon training data can be viewed as the background knowl-edge, which supplies additional information for the currentdialogue session. Like human dialogue in reality, the back-ground knowledge about potential topics will help to detectactual focus of attention model.BTM [Yan et al. , 2013] is a widely used topic model espe-cially designed for short text, which is briefly introduced asfollows. For each co-occurrence biterm b = ( w i , w j ) of word w i and w j , the joint probability of b is written as: P ( b ) = (cid:88) t P ( t ) P ( w i | t ) P ( w j | t ) , where t stands for a topic.To infer the topics in a document, BTM assumes that thetopic proportions of a document equals to the expectationof the topic proportions of biterms generated from the doc-ument. Then we have, P ( t | d ) = (cid:88) b P ( t | b ) P ( b | d ) , (1)where d is a document.Both P ( t | b ) and P ( b | d ) can be calculated viaBayes ’ formula as follows. P ( t | b ) = P ( t ) P ( w i | t ) P ( w j | t ) (cid:80) t P ( t ) P ( w i | t ) P ( w j | t ) ,P ( b | d ) = n d ( b ) (cid:80) b n d ( b ) , where n d ( b ) is the frequency of the biterm b in the document d . The parameters inference is based on the Gibbs Sampling. Now we introduce how we apply BTM in ourwork. Firstly, we split the whole training data D = { ( C, Y ) = ( c , . . . , c N , Y ) } to context pairs, i.e. D = { ( c , c ) , ( c , c ) , . . . , ( c N , Y ) } . In the training process, wetreat each context pair as one document for BTM. This isreasonable because each pair can be viewed as a single-turndialogue, and the input and output of a single-turn dialogueare usually of the same topic. After utilizing the Gibbs Sam-pling, we obtain the word distribution of each topic P ( w i | t ) and the topic distribution P ( t ) . In the inferring process,given each sentence c i in D , the topic of c i is computed by P ( t i ) = arg max t P ( t | c i ) in Equation 1.The BTM model is more suitable for the dialogue gener-ation task than the traditional topic models, such as LatentDirichlet Allocation(LDA) model. That is because the dia-logue has the characteristic of short text with omitted infor-mation, which makes LDA not reliable any more. BTM usesword co-occurrence as the core information to determine thetopic. So it only depends on the semantic dominance of localco-occurrence information, breaks the document boundary,uses the information of the entire corpus instead of a singledocument to overcome the sparse problem in short text topicmodeling. We define the context data as C = { c , . . . , c N } , and eachsentence in C is defined as c i = { x ( i )1 , . . . , x ( i ) M } . Given thesentence c i as input, the RNN model first maps the input se-quence c i to the fixed dimension vector h iM as follows: h ( i ) k = f ( h ( i ) k − , w ( i ) k ) , where w ( i ) k is the word vector of x ( i ) k , h ( i ) k represents the hid-den state vector of the RNN at time k , which combines w ( i ) k and h ( i ) k − . We obtained the state representation set of the con-texts { h (1) , . . . , h ( N ) } .Then we use a high-level RNN model to take the contextstate representation set { h (1) , . . . , h ( N ) } as input, and obtainthe high-level context representation vector s k : s k = f ( s k − , h ( k ) ) , where h ( k ) is the vector representation of the k -th sentence,and s k represents the state vector of the high-level RNN attime k , which combines h ( k ) and s k − . We obtained the out-put of the high-level RNN at each step: { s , . . . , s N } .Given the context data C = { c , . . . , c N } , we obtained thetopic for each sentence as T = { t , . . . , t N } through the pre-trained BTM model. We define attention weights as: α i = E ( t i , c i ) E ( t N , c N ) | E ( t i , c i ) | · | E ( t N , c N ) | , where E ( t i , c i ) is the sum of the word distribution for topic t i and the projected word distribution for context c i , which isdefined as the product of the word distribution for topic t i andthe one-hot representation of context c i .Finally, we obtain the softmax attention weights α (cid:48) i and thecontext vector S N as: α (cid:48) i = α i (cid:80) Nj =1 α j , S N = N (cid:88) i =1 α (cid:48) i × s i . .3 Joint Learning Decoder We conduct another RNN as the decoder to generate the re-sponse Y = { y , · · · , y M } . Given the context vector S N , thetopic distribution of the current context D N and the previ-ously generated word y , · · · , y i − , the decoder predicts theprobability of the next word y i by converting the joint prob-ability into a conditional probability through a chain rule inprobability theory. We use the topic distribution of the currentcontext D N in decoder for the reason that it could supply thetopic information to generate more relevant response.Given a set of training data D = { ( C, T, Y ) } , STAR-BTMassumes that the data is conditionally independent, and sam-ples from the probability P g , and uses the following negativelog likelihood as a minimized objective function: L = − (cid:88) ( C,T,Y ) ∈D log P g ( Y | C, T ) , where C is the context, T is the topic distribution of C and Y is the real response. In this section, we conducted experiments on the Chinese cus-tomer service dataset and the English Ubuntu conversationdataset to verify the effectiveness of our proposed method.
We first introduce experimental settings, including datasets,baselines, parameter settings, and evaluation measures.
Datasets
We utilize two public multi-turn dialogue datasets in our ex-periments, which are widely used in the evaluation of multi-turn dialogue generation task. The Chinese customer servicedataset, named JDC, consists of 515,686 history-responsepairs published by the JD contest . We randomly dividedthe corpus into training, validation and testing, each contains500,000, 7843, and 7843 pairs, respectively. The Ubuntu con-versation dataset is extracted from the Ubuntu Q&A forum,called Ubuntu [Lowe et al. , 2015]. We utilize the officialscripts for tokenizing, stemming and morphing, and removethe duplicates and sentence whose length is less than 5 orgreater than 50. Finally, we obtain 3,980,000, 10,000, and10,000 history-response pairs for training, validation and test-ing, respectively. Baseline Methods and Parameter Settings
We used seven baseline methods for comparison, includingthe traditional Seq2Seq [Sutskever et al. , 2014], HRED [Ser-ban et al. , 2016], VHRED [Serban et al. , 2017b], WeightedSequence with Concat (WSeq) [Tian et al. , 2017], Hierar-chical Recurrent Attention Network (HRAN) [Xing et al. ,2018], Hierarchical Hidden Variational Memory Network(HVMN) [Chen et al. , 2018] and Relevant Context with Self-Attention (ReCoSa) [Zhang et al. , 2019]. To fairly com-pare the topic-level attention model with self-attention model,we extend our STAR-BTM to the ReCoSa scenario, named http://jddc.jd.com/ https://github.com/rkadlec/ubuntu-ranking-dataset-creator ReCoSa-BTM, where the topic embedding is concatenatedwith the sentence representation.For JDC, the Jieba tool is utilized for Chinese word seg-mentation, and its vocabulary size is set to 68,521. ForUbuntu, we set the vocabulary size to 15,000. To fairly com-pare our model with all baselines, the number of hidden nodesis all set to 512 and the batch size set to 32. The max lengthof sentence is set to 50 and the max number of dialogue turnsis set to 15. The number of topics in BTM is set to 8. We usethe Adam for gradient optimization in our experiments. Thelearning rate is set to 0.0001. We run all models on the TeslaK80 GPU with Tensorflow. Evaluation Measures
We use both quantitative evaluation and human judgementsin our experiments. Specifically, we use the traditional in-dicators, i.e., PPL and BLEU [Xing et al. , 2017] to eval-uate the quality of generated responses [Chen et al. , 2018;Tian et al. , 2017; Xing et al. , 2018]. And we also use the distinct value [Li et al. , 2016a; Li et al. , 2016b] to evalu-ate the degree of diversity of generation responses. Theyare widely used in NLP and multi-turn dialogue generationtasks [Chen et al. , 2018; Tian et al. , 2017; Xing et al. , 2018;Zhang et al. , 2018c; Zhang et al. , 2018a; Zhang et al. , 2018b].For human evaluation, given the 300 randomly sampledcontext and its generated responses from all the models, weinvited three annotators (all CS majored students) to com-pare the STAR-BTM model with the baseline methods, e.g.win and loss, based on the coherence of the generated re-sponse with respect to the contexts. In particular, the wintag indicates that the response generated by STAR-BTM ismore relevant than the baseline model. In order to comparethe informativeness of the response generated by the models,we also require the annotators to label the informativenessof each model. If the response generated by STAR-BTM ismore informative than the baseline, the annotator will label 1,otherwise label 0.
Experimental results on two datasets are demonstrate below.
Metric-based Evaluation
The metric-based evaluation results are shown in Table 2.From the results, we can see that the models which detectthe relevant contexts, such as WSeq, HRAN, HVMN andReCoSa, are superior to the traditional HRED baseline mod-els in terms of BLEU, PPL and distinct . This is mainly be-cause these models further consider the attention of the rel-evant context information rather than all the contexts in theoptimization process. HRAN introduces the traditional at-tention mechanism to learn the important context sentences.HVMN utilizes the memory network to remember the con-text information. ReCoSa uses the self-attention to detect therelevant contexts. But their effects are limited since they donot consider the topical level relevance. Our proposed STAR-BTM and ReCoSa-BTM have shown good results. Taking theBLEU value on the JDC dataset as an example, the BLEUvalue of the STAR-BTM and ReCoSa-BTM are 13.386 and https://github.com/zhanghainan/STAR-BTM DC DatasetModel PPL BLEU distinct-1 distinct-2SEQ2SEQ 20.287 11.458 1.069 3.587HRED 21.264 12.087 1.101 3.809VHRED 22.287 11.501 1.174 3.695WSeq 21.824 12.529 1.042 3.917HRAN 20.573 12.278
ReCoSa 17.282 13.797 1.135 6.590ReCoSa-BTM 18.432
Ubuntu DatasetModel PPL BLEU distinct-1 distinct-2SEQ2SEQ 104.899 0.4245 0.808 1.120HRED 115.008 0.6051 1.045 2.724VHRED 186.793 0.5229 1.342 2.887WSeq 141.599 0.9074 1.024 2.878HRAN 110.278 0.6117 1.399 3.075HVMN 164.022 0.7549 1.607 3.245STAR-BTM
ReCoSa 96.057 1.6485 1.718 3.768ReCoSa-BTM 96.124
Table 2: The metric-based evaluation results(%).
JDC Datasetmodel STAR-BTM vs. kappawin (%) loss (%) inform. (%)SEQ2SEQ 55.32 2.12 73.79 0.356HRED 48.93 6.38 70.87 0.383VHRED 48.94 8.51 69.98 0.392WSeq 44.68 8.5 66.99 0.378HRAN 34.04 10.64 60.19 0.401HVMN 27.66 12.77 61.02 0.379ReCoSa 25.34 20.71 55.63 0.358Ubuntu Datasetmodel STAR-BTM vs. kappawin (%) loss (%) inform. (%)SEQ2SEQ 51.46 3.88 72.60 0.398HRED 48.54 6.80 71.23 0.410VHRED 48.44 6.76 69.18 0.423WSeq 40.78 6.80 67.80 0.415HRAN 32.04 11.65 61.16 0.422HVMN 25.24 13.59 60.27 0.414ReCoSa 20.14 15.33 56.15 0.409
Table 3: The human evaluation on JDC and Ubuntu. distinct value of ourmodel is also higher than other baseline models, indicting thatour model can generate more diverse responses. We also con-ducted a significance test. The results show that the improve-ment of our model is significant in both Chinese and Englishdatasets with p-value < . . In summary, our STAR-BTMand ReCoSa-BTM model are able to generate higher qualityand more diverse responses than the baselines. Human Evaluation
The results of human evaluation are shown in Table 3.The percentage of win, loss, and informativeness(inform.),as compared with the baseline models, are given to evalu-ate the quality and the informativeness of the generated re-sponses by STAR-BTM. From the experimental results, thepercentage of win is greater than the loss, indicating thatour STAR-BTM model is significantly better than the base-line methods. Taking JDC as an example, STAR-BTM ob-tains a preference gain (i.e., the win ratio minus the loss ra- tio) of 36.18 %, 23.4 %, 14.89 % and 4.63%, respectively,as compared with WSeq, HRAN, HVMN and ReCoSa. Inaddition, the percentage of informativeness is more than 50percent, as compared with WSeq, HRAN, HVMN and Re-CoSa, i.e.,66.99% , Case Study
To facilitate a better understanding of our model, we presentsome examples in Table 4, and show the top 10 words ofeach topic in the Table 5. From the results, we can see thatwhy the topic level attention model performs better than themodel only using the word and sentence level representation.Taking the example1 in Table 4 as an example, it easy to gen-erate common responses by using only sentence level repre-sentation, such as ‘
What can I do for you? ’ and ‘
Yes ’. How-ever, our topic level attention model has the ability to generatemore relevant and informative responses, such as ‘
Based onthe submitted after-sales service form ’ and ‘
Yes, you need ap-ply after-sales and select lack ’. This is mainly because thetopic level attention is able to associate some important infor-mation such as ‘ 补 发 (send a new one for a replacement)’ and‘ 售 后 (after-sales)’ by topic modeling, which are usually hardto be captured by traditional word or sentence level similar-ities. These results indicate the advantage of modeling topiclevel relevance.We also show the top 10 words of each topic from the BTMmodel on the two dataset, as shown in Table 5. Take the JDCdataset as an example, from the results, we can see that BTMmodel distinguishes eight topics, i.e., ‘ 配 送 (shipping), 发 票 (invoice), 退 款 (refund), 售 后 (after-sale), 催 单 (reminder), 保 价 (low-price), 缺 货 (out-of-stock) and 感 谢 (thanks)’. Foreach topic, the top 10 words represent the core informationof the topic. Take the example1 in the Table 4 as an ex-ample, since the ‘ 补 发 (send a new one for a replacement)’and ‘ 售 后 (after-sales)’ are the 15-th and second word in thesame topic 4, respectively, the model can generate ‘submittedafter-sales service form’ based on the topic level attention.In the example2, the current context is about the ‘gateway’with topic ‘network’, so the topic distribution can supplysome additional topic information, such as ‘restart’, ‘dhcp’and ‘router’. In a word, our STAR-BTM and ReCoSa-BTMmodel can supply the critical topic information to improve theinformativeness of the generated response. In this paper, we propose a new multi-turn dialogue genera-tion model, namely STAR-BTM. The motivation comes fromthe fact that topic drift is a common phenomenon in multi-turn dialogue. The existing models usually use word or sen-tence level similarities to detect the relevant contexts, whichignore the topic level relevance. Our core idea is to utilizetopic models to detect the relevant context information andgenerate a suitable response accordingly. Specifically, STAR-BTM first pre-trains a Biterm Topic Model on the whole train- xample1context1 我 收 到 货 , 一个 镜 子 碎 了 I received the goods, a mirror was broken.context2 可 以 申申申 请请请 售售售 后后后 You can apply for after-sale context3 可 以 安 排 补补补 发发发 吗 ? Can you send a new one for a replacement ?response 您 已 经 申申申 请请请 售售售 后后后 了 吗 ? Have you ever apply for the after-sale ?Seq2Seq 有 什么 能 帮 到 您 的 吗 ? What can I do for you?HRED 亲 爱 哒 , 请 您 稍 等 哦 , 小 红 人 这 就 帮 您 去 查 Dear, please wait, I will help you checkVHRED 您 好 , 为了 更 好 的 为 您 服 务 , 请 您 提 供 下京东 的 订 单号 Hello, in order to better serve you, please provide the order numberWSeq 可 以 的 YesHRAN 这 个 订 单 是 吗 Is this order?HVMN ORDERID 这 个 是 吗 ? ORDERID, this is yours?ReCoSa 可 以 的 , 为 您 申 请 呢 Yes, I will apply for you!STAR-BTM 以 提 交 的 售售售 后后后 服服服 务务务 单单单 为 准 的 Based on the submitted after-sales service form
ReCoSa-BTM 可 以 的 , 需 要 申 请 售 后 换 货 , 选 择 缺 件 Yes, you need apply after-sales and select lackExample2context1 I set up a dhcp server on my ubuntu machine . Client connect to it and get a dhcp address . However , those client cann’t connect the networkcontext2 Have you set the default gateway up in the dhcpd config ? Same subnet on all machine ?context3 Same subnet . How do I configure a default gateway ?response Option router 192.168.129.10 in your dhcpd.conf on the dhcp serverSeq2Seq I don’t knowHRED I don’t know what I have to do with the same timeVHRED I have a lot of UNKWSeq I don’t know , but I don’t know what I do thatHRAN You can use the ubuntu serverHVMN Subnet will workReCoSa How about the dhcp server?STAR-BTM Try restart itReCoSa-BTM Dhcp server with router
Table 4: The generated responses from the STAR-BTM model on JDC dataset.
Topic Topic top 10 words in JDC dataset.1 订 单 order 配 送 delivery 请 please 商 品 item 时 间 time 站 点 site 联 系 contact 电 话 phone 亲 爱 dear 地址 address2 发 票 invoice 地址 address 订 单 order 修 改 modification 电 子 electronic 开 具 issue 需 要 need 电 话 phone 号 number 姓 名 name3 工 作 日 work day 退 款 refunds 订 单 order 账 accounts 取 消 cancellations 申 请 applications 支 付 payments 成 功 successes 商 品 goods 请 please4 申 请 apply 售 后 after-sale 点 击 click 端 end 提 交 submit 客 户 服 务 customer service 审 核 review 链 接 link 返 修 rework 补 发 replacement5 订 单 order 站 点 site 时 间 time 日 期 date 下 单 order ORDERID 编 号 number 催 促 urging 信 息 information 订 单号 order number6 商 品 products 金 额 money 保 价 low-price 姓 名 name 手 机 mobile 申 请 apply 快 照 snapshot 订 单 order 查 询 inquiries 请 please7 查 询 inquire 帮 help 调 货 delivery 问 题 problem 处 理 deal 缺 货 out of stock 订 单号 order number 提 供 offer 采 购 purchase 请 please8 ! 帮 到 help 谢谢 thank 支 持 support 感 谢 您 thank you 评 价 evaluation 客 气 kind 妹 子 I 请 please 祝 您 wish youTopic Topic top10 words in Ubuntu dataset1 import each not old noth would than of thinic retri2 cover adhoc version each retri alt benefit would ubuntu apt preferec3 from cover alt or consid ed link we window minut4 run desktop cover kick distribut browser old show laptop ars5 each show instead from irc over saw rpm mockup out6 not libxt-dev big a by reason aha cover interest !7 896 on system cover restart not urgent violat overst ping8 kxb but charg alway polici f my aha ugh zealous Table 5: The top10 words for each topic from the BTM model on JDC dataset. ing data, and then incorporate the topic level attention weightsto the decoding process for generation.We conduct exten-sive experiments on both Chinese customer services datasetand English Ubuntu dialogue dataset. The experimental re-sults show that our model significantly outperforms existingHRED models and its attention variants. Therefore, we obtainthe conclusion that the topic-level information can be usefulfor improving the quality of multi-turn dialogue generation,by using proper topic model, such as BTM.In future work, we plan to further investigate the proposedSTAR-BTM model. For example, some personal informationcan be introduced to supply more relevant information forpersonalized modeling. In addition, some knowledges likeconcerned entities can be considered in the relevant contextsto further improve the quality of generated response.
Acknowledgments
This work was supported by Beijing Academy of ArtificialIntelligence (BAAI) under Grants No. BAAI2019ZD0306,and BAAI2020ZJ0303, the National Natural Science Foun-dation of China (NSFC) under Grants No. 61722211,61773362, 61872338, and 61906180, the Youth InnovationPromotion Association CAS under Grants No. 20144310,and 2016102, the National Key R&D Program of China un-der Grants No. 2016QY02D0405, the Lenovo-CAS JointLab Youth Scientist Project, and the Foundation and FrontierResearch Key Program of Chongqing Science and Technol-ogy Commission (No. cstc2017jcyjBX0059), and the Ten-cent AI Lab Rhino-Bird Focused Research Program (No.JR202033). eferences [Bahdanau et al. , 2015] Dzmitry Bahdanau, KyunghyunCho, and Yoshua Bengio. Neural machine translation byjointly learning to align and translate.
ICLR , 2015.[Chen et al. , 2018] Hongshen Chen, Zhaochun Ren, JiliangTang, Yihong Eric Zhao, and Dawei Yin. Hierarchicalvariational memory network for dialogue generation. In
Proceedings of the 2018 World Wide Web Conference onWorld Wide Web , pages 1653–1662, 2018.[Fleiss, 1971] Joseph L Fleiss. Measuring nominal scaleagreement among many raters.
American PsychologicalAssociation , 1971.[Li et al. , 2016a] Jiwei Li, Michel Galley, Chris Brockett,Jianfeng Gao, and Bill Dolan. A diversity-promoting ob-jective function for neural conversation models.
NAACL ,2016.[Li et al. , 2016b] Jiwei Li, Will Monroe, Alan Ritter, andGalley et al. Deep reinforcement learning for dialoguegeneration.
EMNLP , 2016.[Li et al. , 2017] Jiwei Li, Will Monroe, Tianlin Shi, AlanRitter, and Dan Jurafsky. Adversarial learning for neuraldialogue generation.
EMNLP , 2017.[Lowe et al. , 2015] Ryan Lowe, Nissan Pow, Iulian Serban,and Joelle Pineau. The ubuntu dialogue corpus: A largedataset for research in unstructured multi-turn dialoguesystems.
Computer Science , 2015.[Mou et al. , 2017] Lili Mou, Yiping Song, Rui Yan, Ge Li,Lu Zhang, and Zhi Jin. Sequence to backward and forwardsequences: A content-introducing approach to generativeshort-text conversation.
ACL , 2017.[Serban et al. , 2016] Iulian V Serban, Alessandro Sordoni,Yoshua Bengio, Aaron Courville, and Joelle Pineau.Building end-to-end dialogue systems using generative hi-erarchical neural network models. In
Thirtieth AAAI Con-ference on Artificial Intelligence , 2016.[Serban et al. , 2017a] Iulian Vlad Serban, Tim Klinger, Ger-ald Tesauro, Kartik Talamadupula, Bowen Zhou, YoshuaBengio, and Aaron Courville. Multiresolution recurrentneural networks: An application to dialogue response gen-eration. In
Thirty-First AAAI Conference on Artificial In-telligence , 2017.[Serban et al. , 2017b] Iulian Vlad Serban, Alessandro Sor-doni, Ryan Lowe, Laurent Charlin, Joelle Pineau, AaronCourville, and Yoshua Bengio. A hierarchical latent vari-able encoder-decoder model for generating dialogues. In
Thirty-First AAAI Conference on Artificial Intelligence ,2017.[Sordoni et al. , 2015] Alessandro Sordoni, Yoshua Bengio,Hossein Vahabi, Christina Lioma, Jakob Grue Simon-sen, and Jian-Yun Nie. A hierarchical recurrent encoder-decoder for generative context-aware query suggestion. In
Proceedings of the 24th ACM International on Conferenceon Information and Knowledge Management , pages 553–562, 2015. [Sutskever et al. , 2014] Ilya Sutskever, Oriol Vinyals, andQuoc V Le. Sequence to sequence learning with neuralnetworks. In
NIPS , pages 3104–3112, 2014.[Tian et al. , 2017] Zhiliang Tian, Rui Yan, Lili Mou, YipingSong, Yansong Feng, and Dongyan Zhao. How to makecontext more useful? an empirical study on context-awareneural conversational models. In
Proceedings of the 55thAnnual Meeting of the Association for Computational Lin-guistics (Volume 2: Short Papers) , volume 2, pages 231–236, 2017.[Wu et al. , 2017] Yu Wu, Wei Wu, Chen Xing, Ming Zhou,and Zhoujun Li. Sequential matching network: A newarchitecture for multi-turn response selection in retrieval-based chatbots.
Proceedings of the 55th Annual Meetingof the Association for Computational Linguistics , 2017.[Xing et al. , 2017] Chen Xing, Wei Wu, Yu Wu, Jie Liu,Yalou Huang, Ming Zhou, and Wei-Ying Ma. Topic awareneural response generation. In
AAAI , pages 3351–3357,2017.[Xing et al. , 2018] Chen Xing, Yu Wu, Wei Wu, YalouHuang, and Ming Zhou. Hierarchical recurrent attentionnetwork for response generation. In
Thirty-Second AAAIConference on Artificial Intelligence , 2018.[Yan et al. , 2013] Xiaohui Yan, Jiafeng Guo, Yanyan Lan,and Xueqi Cheng. A biterm topic model for short texts.In
Proceedings of the 22nd international conference onWorld Wide Web , pages 1445–1456. ACM, 2013.[Zhang et al. , 2018a] Hainan Zhang, Yanyan Lan, JiafengGuo, Jun Xu, and Xueqi Cheng. Reinforcing coherencefor sequence to sequence model in dialogue generation. In
International Joint Conference on Artificial Intelligence ,pages 4567–4573, 2018.[Zhang et al. , 2018b] Hainan Zhang, Yanyan Lan, JiafengGuo, Jun Xu, and Xueqi Cheng. Tailored sequence to se-quence models to different conversation scenarios. In
Pro-ceedings of the 56th Annual Meeting of the Association forComputational Linguistics (Volume 1: Long Papers) , vol-ume 1, pages 1479–1488, 2018.[Zhang et al. , 2018c] Weinan Zhang, Yiming Cui, YifaWang, Qingfu Zhu, Lingzhi Li, Lianqiang Zhou, and TingLiu. Context-sensitive generation of open-domain con-versational responses. In
Proceedings of the 27th Inter-national Conference on Computational Linguistics , pages2437–2447, 2018.[Zhang et al. , 2019] Hainan Zhang, Yanyan Lan, LiangPang, Jiafeng Guo, and Xueqi Cheng. Recosa: Detect-ing the relevant contexts with self-attention for multi-turndialogue generation. In
Proceedings of the 57th AnnualMeeting of the Association for Computational Linguistics ,pages 3721–3730, 2019.[Zhou et al. , 2016] Xiangyang Zhou, Daxiang Dong, HuaWu, Shiqi Zhao, Dianhai Yu, Hao Tian, Xuan Liu, and RuiYan. Multi-view response selection for human-computerconversation. In