Entity-Consistent End-to-end Task-Oriented Dialogue System with KB Retriever
Libo Qin, Yijia Liu, Wanxiang Che, Haoyang Wen, Yangming Li, Ting Liu
EEntity-Consistent End-to-end Task-Oriented Dialogue System with KBRetriever
Libo Qin, Yijia Liu, Wanxiang Che ∗ , Haoyang Wen, Yangming Li, Ting Liu Research Center for Social Computing and Information RetrievalHarbin Institute of Technology, China { lbqin,yjliu,car,hywen,yangmingli,tliu } @ir.hit.edu.cn Abstract
Querying the knowledge base (KB) has longbeen a challenge in the end-to-end task-oriented dialogue system. Previous sequence-to-sequence (Seq2Seq) dialogue generationwork treats the KB query as an attention overthe entire KB, without the guarantee that thegenerated entities are consistent with eachother. In this paper, we propose a novel frame-work which queries the KB in two steps to im-prove the consistency of generated entities. Inthe first step, inspired by the observation that aresponse can usually be supported by a singleKB row, we introduce a KB retrieval compo-nent which explicitly returns the most relevantKB row given a dialogue history. The retrievalresult is further used to filter the irrelevant en-tities in a Seq2Seq response generation modelto improve the consistency among the outputentities. In the second step, we further per-form the attention mechanism to address themost correlated KB column. Two methods areproposed to make the training feasible withoutlabeled retrieval data, which include distantsupervision and Gumbel-Softmax technique.Experiments on two publicly available taskoriented dialog datasets show the effective-ness of our model by outperforming the base-line systems and producing entity-consistentresponses.
Task-oriented dialogue system , which helps usersto achieve specific goals with natural language,is attracting more and more research attention.With the success of the sequence-to-sequence(Seq2Seq) models in text generation (Sutskeveret al., 2014; Bahdanau et al., 2014; Luong et al.,2015a; Nallapati et al., 2016b,a), several workstried to model the task-oriented dialogue as theSeq2Seq generation of response from the dialogue ∗ * Email corresponding. Address Distance POI type POI Traffic info
638 Amherst St 3 miles grocery store Sigona Farmers Market car collision nearby269 Alger Dr 1 miles coffee or tea place Cafe Venetia car collision nearby5672 barringer street 5 miles certain address 5672 barringer street no traffic200 Alester Ave 2 miles gas station Valero road block nearby899 Ames Ct 5 miles hospital Stanford Childrens Health moderate traffic481 Amaranta Ave 1 miles parking garage Palo Alto Garage R moderate traffic145 Amherst St 1 miles coffee or tea place Teavana road block nearby409 Bollard St 5 miles grocery store Willows Market no traffic
Driver
Address to the gas station.
Car
Valero is located at 200 Alester Ave.
Driver
OK , please give me directions via a route that avoids all heavy traffic.
Car
Since there is a road block nearby, I found another route for you and I sent it on your screen.
Driver
Awesome thank you.
Figure 1: An example of a task-oriented dialogue thatincorporates a knowledge base (KB). The fourth row inKB supports the second turn of the dialogue. A dia-logue system will produce a response with conflict en-tities if it includes the POI in the fourth row and theaddress in the fifth row, like “
Valero is located at 899Ames Ct ”. history (Eric and Manning, 2017; Eric et al., 2017;Madotto et al., 2018). This kind of modelingscheme frees the task-oriented dialogue systemfrom the manually designed pipeline modules andheavy annotation labor for these modules. Dif-ferent from typical text generation, the success-ful conversations for task-oriented dialogue sys-tem heavily depend on accurate knowledge base(KB) queries. Taking the dialogue in Figure 1as an example, to answer the driver’s query onthe gas station, the dialogue system is required toretrieve the entities like “
200 Alester Ave ”and “
Valero ”. For the task-oriented systembased on Seq2Seq generation, there is a trend inrecent study towards modeling the KB query as anattention network over the entire KB entity repre-sentations, hoping to learn a model to pay moreattention to the relevant entities (Eric et al., 2017;Madotto et al., 2018; Reddy et al., 2018; Wenet al., 2018).Though achieving good end-to-end dialoguegeneration with over-the-entire-KB attentionmechanism, these methods do not guarantee thegeneration consistency regarding KB entities and a r X i v : . [ c s . C L ] S e p ometimes yield responses with conflict entities,like “ Valero is located at 899 Ames Ct ” for thegas station query (as shown in Figure 1). In fact,the correct address for
Valero is
200 Alester Ave .A consistent response is relatively easy to achievefor the conventional pipeline systems becausethey query the KB by issuing API calls (Bordesand Weston, 2017; Wen et al., 2017b,a), and thereturned entities, which typically come from asingle KB row, are consistently related to theobject (like the “gas station”) that serves the user’srequest. This indicates that a response can usuallybe supported by a single KB row . It’s promisingto incorporate such observation into the Seq2Seqdialogue generation model, since it encouragesKB relevant generation and avoids the model fromproducing responses with conflict entities.To achieve entity-consistent generation in theSeq2Seq task-oriented dialogue system, we pro-pose a novel framework which query the KB intwo steps. In the first step, we introduce a re-trieval module — KB-retriever to explicitly querythe KB. Inspired by the observation that a sin-gle KB row usually supports a response, giventhe dialogue history and a set of KB rows, theKB-retriever uses a memory network (Sukhbaataret al., 2015) to select the most relevant row. Theretrieval result is then fed into a Seq2Seq dialoguegeneration model to filter the irrelevant KB entitiesand improve the consistency within the generatedentities. In the second step, we further perform at-tention mechanism to address the most correlatedKB column. Finally, we adopt the copy mecha-nism to incorporate the retrieved KB entity.Since dialogue dataset is not typically annotatedwith the retrieval results, training the KB-retrieveris non-trivial. To make the training feasible, wepropose two methods: 1) we use a set of heuris-tics to derive the training data and train the re-triever in a distant supervised fashion; 2) we useGumbel-Softmax (Jang et al., 2017) as an approx-imation of the non-differentiable selecting processand train the retriever along with the Seq2Seq di-alogue generation model. Experiments on twopublicly available datasets (
Camrest (Wen et al.,2017b) and
InCar Assistant (Eric et al., 2017))confirm the effectiveness of the KB-retriever. Boththe retrievers trained with distant-supervision andGumbel-Softmax technique outperform the com-pared systems in the automatic and human evalu-ations. Analysis empirically verifies our assump- tion that more than 80% responses in the datasetcan be supported by a single KB row and better re-trieval results lead to better task-oriented dialoguegeneration performance.
In this section, we will describe the input and out-put of the end-to-end task-oriented dialogue sys-tem, and the definition of Seq2Seq task-orienteddialogue generation.
Given a dialogue between a user ( u ) and asystem ( s ), we follow Eric et al. (2017) andrepresent the k -turned dialogue utterances as { ( u , s ) , ( u , s ) , ..., ( u k , s k ) } . At the i th turn ofthe dialogue, we aggregate dialogue context whichconsists of the tokens of ( u , s , ..., s i − , u i ) anduse x = ( x , x , ..., x m ) to denote the whole dia-logue history word by word, where m is the num-ber of tokens in the dialogue history. In this paper, we assume to have the access to arelational-database-like KB B , which consists of |R| rows and |C| columns. The value of entity inthe j th row and the i th column is noted as v j,i . We define the Seq2Seq task-oriented dialoguegeneration as finding the most likely response y according to the input dialogue history x and KB B . Formally, the probability of a response is de-fined as p ( y | x , B ) = n (cid:89) t =1 p ( y t | y , ..., y t − , x , B ) , where y t represents an output token. In this section, we describe our framework forend-to-end task-oriented dialogues. The architec-ture of our framework is demonstrated in Figure 2,which consists of two major components includ-ing an memory network-based retriever and theseq2seq dialogue generation with KB Retriever.Our framework first uses the KB-retriever to se-lect the most relevant KB row and further filter theirrelevant entities in a Seq2Seq response genera-tion model to improve the consistency among the t 200_Alester_Aveis locatedValero
Dialogue History “Address to the gasstation” ) = softmax (3 4 + 6 )
Address Distance POI type POI Traffic info $,$ $,% $,& $,' $, ⋮ ⋮ ⋮ ⋮ ⋮
200 Alester Ave 2 miles gas station Valero Road block nearby ⋮ ⋮ ⋮ ⋮ ⋮ |ℛ|,$ |ℛ|,% |ℛ|,& |ℛ|,' ℛ ,
Retrieval Results
010 010 0⋮1⋮0 010 010 |ℛ|×|@| ⨂ Memory Network-based Retriever (KB Row Selection)
Vocabulary distribution ⋯ ⋯ … ck-1,5 200 A. … Vaero ck+1,1 …
Constrained KB distribution … ck-1,5 200 A. … Vaero ck+1,1 …
KB distribution (KB Column Selection)
From
C + k − 1 × @ to C + k× @
Flattened and Expanded Retrieval Results
KB RowRepresentationDialogue HistoryRepresentation G ⋯ F H ⋯ F ℛ I G ⋯ I H ⋯ I ℛ Figure 2: The workflow of our Seq2Seq task-oriented dialogue generation model with KB-retriever. For simplifi-cation, we draw the single-hop memory network instead of the multiple-hop one we use in our model. output entities. While in decoding, we further per-form the attention mechanism to choose the mostprobable KB column. We will present the detailsof our framework in the following sections.
In our encoder, we adopt the bidirectional LSTM(Hochreiter and Schmidhuber, 1997, BiLSTM)to encode the dialogue history x , which cap-tures temporal relationships within the sequence.The encoder first map the tokens in x to vec-tors with embedding function φ emb , and then theBiLSTM read the vector forwardly and back-wardly to produce context-sensitive hidden states ( h , h , ..., h m ) by repeatedly applying the recur-rence h i = BiLSTM (cid:0) φ emb ( x i ) , h i − (cid:1) . Here, we follow Eric et al. (2017) to adoptthe attention-based decoder to generation the re-sponse word by word. LSTM is also used torepresent the partially generated output sequence ( y , y , ..., y t − ) as (˜ h , ˜ h , ..., ˜ h t ) . For the gen-eration of next token y t , their model first calcu-lates an attentive representation ˜ h (cid:48) t of the dialogue history as u ti = W tanh( W [ h i , ˜ h t ]) , a ti = softmax ( u ti ) , ˜ h (cid:48) t = m (cid:88) i =1 a ti · h i . Then, the concatenation of the hidden represen-tation of the partially outputted sequence ˜ h t andthe attentive dialogue history representation ˜ h (cid:48) t areprojected to the vocabulary space V by U as o t = U [˜ h t , ˜ h (cid:48) t ] , to calculate the score (logit) for the next token gen-eration. The probability of next token y t is finallycalculated as p ( y t | y , ..., y t − , x , B ) = softmax ( o t ) . As shown in section 3.2, we can see that the gener-ation of tokens are just based on the dialogue his-tory attention, which makes the model ignorant tothe KB entities. In this section, we present how toquery the KB explicitly in two steps for improvingthe entity consistence, which first adopt the KB-retriever to select the most relevant KB row andthe generation of KB entities from the entities-augmented decoder is constrained to the entitiesithin the most probable row, thus improve theentity generation consistency. Next, we performthe column attention to select the most probableKB column. Finally, we show how to use thecopy mechanism to incorporate the retrieved en-tity while decoding.
In our framework, our KB-retriever takes the di-alogue history and KB rows as inputs and selectsthe most relevant row. This selection process re-sembles the task of selecting one word from the in-puts to answer questions (Sukhbaatar et al., 2015),and we use a memory network to model this pro-cess. In the following sections, we will first de-scribe how to represent the inputs, then we willtalk about our memory network-based retriever
Dialogue History Representation:
We encodethe dialogue history by adopting the neural bag-of-words (BoW) followed the original paper(Sukhbaatar et al., 2015). Each token in the di-alogue history is mapped into a vector by an-other embedding function φ emb (cid:48) ( x ) and the dia-logue history representation q is computed as thesum of these vectors: q = (cid:80) mi =1 φ emb (cid:48) ( x i ) . KB Row Representation:
In this section, wedescribe how to encode the KB row. Each KBcell is represented as the cell value v embeddingas c j,k = φ value ( v j,k ) , and the neural BoW is alsoused to represent a KB row r j as r j = (cid:80) |C| k =1 c j,k . Memory Network-Based Retriever:
We modelthe KB retrieval process as selecting the rowthat most-likely supports the response gener-ation. Memory network (Sukhbaatar et al.,2015) has shown to be effective to model thiskind of selection. For a n -hop memory net-work, the model keeps a set of input matrices { R , R , ..., R n +1 } , where each R i is a stack of |R| inputs ( r i , r i , ..., r i |R| ) . The model also keepsquery q as the input. A single hop memory net-work computes the probability a j of selecting the j th input as π = softmax (( q ) T R ) , o = (cid:88) i π i r i , a = softmax ( W mem ( o + q )) . For the multi-hop cases, layers of single hop mem-ory network are stacked and the query of the ( i + 1) th layer network is computed as q i +1 = q i + o i , and the output of the last layer is used as the out-put of the whole network. For more details aboutmemory network, please refer to the original paper(Sukhbaatar et al., 2015).After getting a , we represent the retrieval resultsas a 0-1 matrix T ∈ { , } |R|×|C| , where each ele-ment in T is calculated as T j, ∗ = [ j = argmax i a i ] . (1)In the retrieval result, T j,k indicates whether theentity in the j th row and the k th column is rele-vant to the final generation of the response. Inthis paper, we further flatten T to a 0-1 vector t ∈ { , } |E| (where |E| equals |R| × |C| ) as ourretrieval row results. After getting the retrieved row result that indi-cates which KB row is the most relevant to thegeneration, we further perform column attentionin decoding time to select the probable KB col-umn. For our KB column selection, following theEric et al. (2017) we use the decoder hidden state (˜ h , ˜ h , ..., ˜ h t ) to compute an attention score withthe embedding of column attribute name. The at-tention score c ∈ R |E| then become the logits ofthe column be selected, which can be calculatedas c j = W (cid:48) tanh( W (cid:48) [ k j , ˜ h t ]) , where c j is the attention score of the j th KB col-umn, k j is represented with the embedding ofword embedding of KB column name. W (cid:48) , W (cid:48) and t T are trainable parameters of the model. After the row selection and column selection, wecan define the final retrieved KB entity score asthe element-wise dot between the row retriever re-sult and the column selection score, which can becalculated as v t = t ∗ c , (2)where the v t indicates the final KB retrieved en-tity score. Finally, we follow Eric et al. (2017) tose copy mechanism to incorporate the retrievedentity, which can be defined as o t = U [˜ h t , ˜ h (cid:48) t ] + v t , where o t s dimensionality is |V| + |E| . In v t , lower |V| is zero and the rest |E| is retrieved entity scores. As mentioned in section 3.3.1, we adopt the mem-ory network to train our KB-retriever. However, inthe Seq2Seq dialogue generation, the training datadoes not include the annotated KB row retrievalresults, which makes supervised training the KB-retriever impossible. To tackle this problem, wepropose two training methods for our KB-row-retriever. 1) In the first method, inspired by the re-cent success of distant supervision in informationextraction (Zeng et al., 2015; Mintz et al., 2009;Min et al., 2013; Xu et al., 2013), we take advan-tage of the similarity between the surface string ofKB entries and the reference response, and designa set of heuristics to extract training data for theKB-retriever. 2) In the second method, instead oftraining the KB-retriever as an independent com-ponent, we train it along with the training of theSeq2Seq dialogue generation. To make the re-trieval process in Equation 1 differentiable, we useGumbel-Softmax (Jang et al., 2017) as an approx-imation of the argmax during training.
Although it’s difficult to obtain the annotated re-trieval data for the KB-retriever, we can “guess”the most relevant KB row from the reference re-sponse, and then obtain the weakly labeled data forthe retriever. Intuitively, for the current utterancein the same dialogue which usually belongs to onetopic and the KB row that contains the largestnumber of entities mentioned in the whole dia-logue should support the utterance. In our trainingwith distant supervision, we further simplify ourassumption and assume that one dialogue which isusually belongs to one topic and can be supportedby the most relevant KB row, which means for a k -turned dialogue, we construct k pairs of train-ing instances for the retriever and all the inputs ( u , s , ..., s i − , u i | i ≤ k ) are associated withthe same weakly labeled KB retrieval result T ∗ .In this paper, we compute each row’s sim-ilarity to the whole dialogue and choose the most similar row as T ∗ . We define the sim-ilarity of each row as the number of matchedspans with the surface form of the entities inthe row. Taking the dialogue in Figure 1for an example, the similarity of the 4 th rowequals to 4 with “
200 Alester Ave ”, “ gasstation ”, “
Valero ”, and “ road blocknearby ” matching the dialogue context; and thesimilarity of the 7 th row equals to 1 with only“ road block nearby ” matching.In our model with the distantly supervised re-triever, the retrieval results serve as the inputfor the Seq2Seq generation. During training theSeq2Seq generation, we use the weakly labeled re-trieval result T ∗ as the input. In addition to treating the row retrieval result as aninput to the generation model, and training the kb-row-retriever independently, we can train it alongwith the training of the Seq2Seq dialogue genera-tion in an end-to-end fashion. The major difficultyof such a training scheme is that the discrete re-trieval result is not differentiable and the trainingsignal from the generation model cannot be passedto the parameters of the retriever. Gumbel-softmaxtechnique (Jang et al., 2017) has been shown an ef-fective approximation to the discrete variable andproved to work in sentence representation. In thispaper, we adopt the Gumbel-Softmax technique totrain the KB retriever. We use T approx j, ∗ = exp ((log( a j ) + g j ) /τ ) (cid:80) i exp ((log( a i ) + g i ) /τ ) , as the approximation of T , where g j are i.i.d sam-ples drawn from Gumbel (0 , and τ is a constantthat controls the smoothness of the distribution. T approx j replaces T j in equation 1 and goes throughthe same flattening and expanding process as V toget v t approx (cid:48) and the training signal from Seq2Seqgeneration is passed via the logit o approx t = U [˜ h t , ˜ h (cid:48) t ] + v t approx (cid:48) . To make training with Gumbel-Softmax more sta-ble, we first initialize the parameters by pre-training the KB-retriever with distant supervisionand further fine-tuning our framework. We sample g by drawing u ∼ Uniform (0 , then com-puting g = − log( − log( u )) . .3 Experimental Settings We choose the InCar Assistant dataset (Eric et al.,2017) including three distinct domains: naviga-tion, weather and calendar domain. For weatherdomain, we follow Wen et al. (2018) to separatethe highest temperature, lowest temperature andweather attribute into three different columns. Forcalendar domain, there are some dialogues with-out a KB or incomplete KB. In this case, wepadding a special token “-” in these incompleteKBs. Our framework is trained separately in thesethree domains, using the same train/validation/testsplit sets as Eric et al. (2017). To justify the gen-eralization of the proposed model, we also use an-other public CamRest dataset (Wen et al., 2017b)and partition the datasets into training, validationand testing set in the ratio 3:1:1. Especially, wehired some human experts to format the CamRestdataset by equipping the corresponding KB to ev-ery dialogues.All hyper-parameters are selected accordingto validation set. We use a three-hop mem-ory network to model our KB-retriever. Thedimensionalities of the embedding is selectedfrom { , } and LSTM hidden units is se-lected from { , , , , } . The dropoutwe use in our framework is selected from { . , . , . } and the batch size we adopt isselected from { , } . L2 regularization is used onour model with a tension of × − for reduc-ing overfitting. For training the retriever with dis-tant supervision, we adopt the weight typing trick(Liu and Perez, 2017). We use Adam (Kingma andBa, 2014) to optimize the parameters in our modeland adopt the suggested hyper-parameters for op-timization.We adopt both the automatic and human evalu-ations in our experiments. We compare our model with several baselines in-cluding: • Attn seq2seq (Luong et al., 2015b): A modelwith simple attention over the input context ateach time step during decoding. We obtain the BLEU and Entity F1 score on the wholeInCar dataset by mixing all generated response and evaluat-ing them together. The dataset can be available at: https://github.com/yizhen20133868/Retriever-Dialogue • Ptr-UNK (Gulcehre et al., 2016): Ptr-UNKis the model which augments a sequence-to-sequence architecture with attention-basedcopy mechanism over the encoder context. • KV Net (Eric et al., 2017): The modeladopted and argumented decoder which de-codes over the concatenation of vocabularyand KB entities, which allows the model togenerate entities. • Mem2Seq (Madotto et al., 2018): Mem2Seqis the model that takes dialogue history andKB entities as input and uses a pointer gate tocontrol either generating a vocabulary wordor selecting an input as the output. • DSR (Wen et al., 2018): DSR leveraged dia-logue state representation to retrieve the KBimplicitly and applied copying mechanism toretrieve entities from knowledge base whiledecoding.In InCar dataset, for the
Attn seq2seq , Ptr-UNK and
Mem2seq , we adopt the reported results fromMadotto et al. (2018). In CamRest dataset, forthe
Mem2Seq , we adopt their open-sourced codeto get the results while for the
DSR , we run theircode on the same dataset to obtain the results. Follow the prior works (Eric et al., 2017; Madottoet al., 2018; Wen et al., 2018), we adopt the
BLEU and the
Micro Entity F1 to evaluate our model per-formance. The experimental results are illustratedin Table 1.In the first block of Table 1, we show the Hu-man, rule-based and KV Net (with*) result whichare reported from Eric et al. (2017). We arguethat their results are not directly comparable be-cause their work uses the entities in thier canoni-calized forms, which are not calculated based onreal entity value. It’s noticing that our frameworkwith two methods still outperform KV Net in In-Car dataset on whole BLEU and Entity F met-rics, which demonstrates the effectiveness of ourframework. We adopt the same pre-processed dataset from Madottoet al. (2018). We can find that experimental results is slightlydifferent with their reported performance (Wen et al., 2018)because of their different tokenized utterances and normal-ization for entities. nCar CamRestModel BLEU F1 NavigateF1 WeatherF1 CalendarF1 BLEU F1Human* (Eric et al., 2017) 13.5 60.7 55.2 61.6 64.3 - -Rule-Based* (Eric et al., 2017) 6.6 43.8 40.4 39.5 61.3 - -KV Net* (Eric et al., 2017) 13.2 48.0 41.3 47.0 62.9 - -Attn seq2seq (Luong et al., 2015b) 9.3 11.9 10.8 25.6 23.4 - -Ptr-UNK (Gulcehre et al., 2016) 8.3 22.7 14.9 26.7 26.9 - -Mem2Seq (Madotto et al., 2018) 12.6 33.4 20.0 32.8 49.3 16.6 42.4DSR (Wen et al., 2018) 12.7 51.9 52.0 50.4 52.1 18.3 53.6w/ distant supervision
Table 1: Comparison of our model with baselines
In the second block of Table 1, we can see thatour framework trained with both the distant super-vision and the Gumbel-Softmax beats all existingmodels on two datasets. Our model outperformseach baseline on both BLEU and F1 metrics. InInCar dataset, Our model with Gumbel-Softmaxhas the highest BLEU compared with baselines,which which shows that our framework can gen-erate more fluent response. Especially, our frame-work has achieved 2.5% improvement on navigatedomain, 1.8% improvement on weather domainand 3.5% improvement on calendar domain on F1metric. It indicates that the effectiveness of ourKB-retriever module and our framework can re-trieve more correct entity from KB. In CamRestdataset, the same trend of improvement has beenwitnessed, which further show the effectiveness ofour framework.Besides, we observe that the model trained withGumbel-Softmax outperforms with distant super-vision method. We attribute this to the fact thatthe KB-retriever and the Seq2Seq module are fine-tuned in an end-to-end fashion, which can refinethe KB-retriever and further promote the dialoguegeneration.
In this section, we verify our assumption by ex-amining the proportion of responses that can besupported by a single row.We define a response being supported by themost relevant KB row as all the responded enti-ties are included by that row. We study the pro-portion of these responses over the test set. Thenumber is 95% for the navigation domain, 90% forthe CamRest dataset and 80% for the weather do-main. This confirms our assumption that most re-sponses can be supported by the relevant KB row. . . . . . . . . . G e n e r a t i o n C o n s i s t e n c y Figure 3: Correlation between the number of KB rowsand generation consistency on navigation domain.
Correctly retrieving the supporting row should bebeneficial.We further study the weather domain to see therest 20% exceptions. Instead of being supportedby multiple rows, most of these exceptions cannotbe supported by any KB row. For example, thereis one case whose reference response is “
It ’s notrainy today ”, and the related KB entity is sunny .These cases provide challenges beyond the scopeof this paper. If we consider this kind of cases asbeing supported by a single row, such proportionin the weather domain is 99%.
In this paper, we expect the consistent generationfrom our model. To verify this, we compute theconsistency recall of the utterances that have mul-tiple entities. An utterance is considered as con-sistent if it has multiple entities and these entitiesbelong to the same row which we annotated withdistant supervision.The consistency result is shown in Table 2.From this table, we can see that incorporating re-triever in the dialogue generation improves theconsistency. .3 Correlation between the number of KBrows and generation consistency
To further explore the correlation between thenumber of KB rows and generation consistency,we conduct experiments with distant manner tostudy the correlation between the number of KBrows and generation consistency.We choose KBs with different number of rowson a scale from 1 to 5 for the generation. FromFigure 3, as the number of KB rows increase, wecan see a decrease in generation consistency. Thisindicates that irrelevant information would harmthe dialogue generation consistency.
To gain more insights into how the our retrievermodule influences the whole KB score distri-bution, we visualized the KB entity probabilityat the decoding position where we generate theentity
200 Alester Ave . From the example(Fig 4), we can see the th row and the th col-umn has the highest probabilities for generating
200 Alester Ave , which verify the effective-ness of firstly selecting the most relevant KB rowand further selecting the most relevant KB col-umn.
We provide human evaluation on our frameworkand the compared models. These responses arebased on distinct dialogue history. We hire sev-eral human experts and ask them to judge the qual-ity of the responses according to correctness, flu-ency, and humanlikeness on a scale from 1 to 5.In each judgment, the expert is presented with thedialogue history, an output of a system with thename anonymized, and the gold response.The evaluation results are illustrated in Table 2.Our framework outperforms other baseline mod-els on all metrics according to Table 2. The mostsignificant improvement is from correctness, indi-cating that our model can retrieve accurate entityfrom KB and generate more informative informa-tion that the users want to know.
Sequence-to-sequence (Seq2Seq) models in textgeneration (Sutskever et al., 2014; Bahdanauet al., 2014; Luong et al., 2015a; Nallapati et al.,2016b,a) has gained more popular and they are ap-plied for the open-domain dialogs (Vinyals and Le,
Model Cons. Human EvaluationCor. Flu. Hum.Copy Net 21.2 4.14 4.40 4.36Mem2Seq 38.1 4.29 4.29 4.27DSR 70.3 4.59 4.71 4.65w/ distant supervision 65.8 4.53 4.71 4.64w/ Gumble-Softmax
Table 2: The generation consistency and Human Eval-uation on navigation domain.
Cons. represents
Con-sistency . Cor. represents
Correctness . Flu . represents
Fluency and
Hum. represents
Humanlikeness.
Address Distance POI type POI Traffic info
638 Amherst St 3 miles grocery store Sigona Farmers Market car collision nearby269 Alger Dr 1 miles coffee or tea place Cafe Venetia car collision nearby5672 barringer street 5 miles certain address 5672 barringer street no traffic200 Alester Ave 2 miles gas station Valero road block nearby899 Ames Ct 5 miles hospital Stanford Childrens Health moderate traffic481 Amaranta Ave 1 miles parking garage Palo Alto Garage R moderate traffic145 Amherst St 1 miles coffee or tea place Teavana road block nearby409 Bollard St 5 miles grocery store Willows Market no traffic
200 Alester Ave 2 miles gas station Valero road block nearby
Figure 4: KB score distribution. The distribution is thetimestep when generate entity
200 Alester Ave forresponse “
Valero is located at 200 Alester Ave ” In this paper, we propose a novel framework to im-prove entities consistency by querying KB in twosteps. In the first step, inspired by the observa-tion that a response can usually be supported by asingle KB row, we introduce the KB retriever toreturn the most relevant KB row, which is used tofilter the irrelevant KB entities and encourage con-sistent generation. In the second step, we furtherperform attention mechanism to select the mostrelevant KB column. Experimental results showthe effectiveness of our method. Extensive analy-sis further confirms the observation and reveal thecorrelation between the success of KB query andthe success of task-oriented dialogue generation.
Acknowledgments
We thank the anonymous reviewers for their help-ful comments and suggestions. This work wassupported by the National Natural Science Foun-dation of China (NSFC) via grant 61976072,61632011 and 61772153.
References
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2014. Neural machine translation by jointlylearning to align and translate. arXiv preprintarXiv:1409.0473 .Antoine Bordes and Jason Weston. 2017. Learningend-to-end goal-oriented dialog. In
Proc. of ICLR .Bhuwan Dhingra, Lihong Li, Xiujun Li, Jianfeng Gao,Yun-Nung Chen, Faisal Ahmed, and Li Deng. 2017.Towards end-to-end reinforcement learning of dia-logue agents for information access. In
Proc. ofACL .Mihail Eric, Lakshmi Krishnan, Francois Charette, andChristopher D Manning. 2017. Key-value retrievalnetworks for task-oriented dialogue. In
Proc. ofSIGDial .Mihail Eric and Christopher Manning. 2017. A copy-augmented sequence-to-sequence architecture gives good performance on task-oriented dialogue. In
Proc. of EACL .Caglar Gulcehre, Sungjin Ahn, Ramesh Nallapati,Bowen Zhou, and Yoshua Bengio. 2016. Pointingthe unknown words. In
Proc. of ACL .Sepp Hochreiter and J¨urgen Schmidhuber. 1997. Longshort-term memory.
Neural computation .Eric Jang, Shixiang Gu, and Ben Poole. 2017. Cate-gorical reparameterization with gumbel-softmax. In
ICLR .Diederik P Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980 .Bing Liu and Ian Lane. 2017. An end-to-end trainableneural network model with belief tracking for task-oriented dialog. In
Interspeech 2017 .Fei Liu and Julien Perez. 2017. Gated end-to-endmemory networks. In
Proc. of ACL .Thang Luong, Hieu Pham, and Christopher D. Man-ning. 2015a. Effective approaches to attention-based neural machine translation. In
Proc. ofEMNLP .Thang Luong, Hieu Pham, and Christopher D. Man-ning. 2015b. Effective approaches to attention-based neural machine translation. In
Proc. ofEMNLP .Andrea Madotto, Chien-Sheng Wu, and Pascale Fung.2018. Mem2seq: Effectively incorporating knowl-edge bases into end-to-end task-oriented dialog sys-tems. In
Proc. of ACL .Bonan Min, Ralph Grishman, Li Wan, Chang Wang,and David Gondek. 2013. Distant supervision forrelation extraction with an incomplete knowledgebase. In
Proc. of ACL .Mike Mintz, Steven Bills, Rion Snow, and Dan Juraf-sky. 2009. Distant supervision for relation extrac-tion without labeled data. In
Proc. of ACL .Ramesh Nallapati, Bing Xiang, and Bowen Zhou.2016a. Sequence-to-sequence rnns for text summa-rization.Ramesh Nallapati, Bowen Zhou, Cicero dos Santos,Caglar Gulcehre, and Bing Xiang. 2016b. Abstrac-tive text summarization using sequence-to-sequencernns and beyond. In
Proc. of SIGNLL .Dinesh Raghu, Nikhil Gupta, and Mausam. 2019.Disentangling Language and Knowledge in Task-Oriented Dialogs. In
Proc. of NAACL .Revanth Reddy, Danish Contractor, Dinesh Raghu, andSachindra Joshi. 2018. Multi-level memory for taskoriented dialogs. arXiv preprint arXiv:1810.10647 .ulian V Serban, Alessandro Sordoni, Yoshua Bengio,Aaron Courville, and Joelle Pineau. 2016. Buildingend-to-end dialogue systems using generative hier-archical neural network models. In
Proc. of AAAI .Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et al.2015. End-to-end memory networks. In
NIPS .Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.Sequence to sequence learning with neural net-works. In
NIPS .Oriol Vinyals and Quoc Le. 2015. A neural conversa-tional model. arXiv preprint arXiv:1506.05869 .Haoyang Wen, Yijia Liu, Wanxiang Che, Libo Qin, andTing Liu. 2018. Sequence-to-sequence learning fortask-oriented dialogue with dialogue state represen-tation. In
Proc. of COLING .Tsung-Hsien Wen, Yishu Miao, Phil Blunsom, andSteve Young. 2017a. Latent intention dialogue mod-els. In
ICML .Tsung-Hsien Wen, David Vandyke, Nikola Mrkˇsi´c,Milica Gasic, Lina M. Rojas Barahona, Pei-Hao Su,Stefan Ultes, and Steve Young. 2017b. A network-based end-to-end trainable task-oriented dialoguesystem. In
Proc. of EACL .Chien-Sheng Wu, Richard Socher, and CaimingXiong. 2019. Global-to-local memory pointer net-works for task-oriented dialogue. arXiv preprintarXiv:1901.04713 .Wei Xu, Raphael Hoffmann, Le Zhao, and Ralph Gr-ishman. 2013. Filling knowledge base gaps for dis-tant supervision of relation extraction. In
Proc. ofACL .Daojian Zeng, Kang Liu, Yubo Chen, and Jun Zhao.2015. Distant supervision for relation extraction viapiecewise convolutional neural networks. In