[PDF] Learning to Select External Knowledge with Multi-Scale Negative Sampling

Abstract

The Track-1 of DSTC9 aims to effectively answer user requests or questions during task-oriented dialogues, which are out of the scope of APIs/DB. By leveraging external knowledge resources, relevant information can be retrieved and encoded into the response generation for these out-of-API-coverage queries. In this work, we have explored several advanced techniques to enhance the utilization of external knowledge and boost the quality of response generation, including schema guided knowledge decision, negatives enhanced knowledge selection, and knowledge grounded response generation. To evaluate the performance of our proposed method, comprehensive experiments have been carried out on the publicly available dataset. Our approach was ranked as the best in human evaluation of DSTC9 Track-1.

Full PDF

LLearning to Select External Knowledge with Multi-Scale Negative Sampling

Huang He * , Hua Lu *† , Siqi Bao, Fan Wang, Hua Wu, Zhengyu Niu, Haifeng Wang Baidu Inc., China { hehuang, v luhua01, baosiqi, wang.fan, wu hua, niuzhengyu, wanghaifeng } @baidu.com Abstract

The Track-1 of DSTC9 aims to effectively answer user re-quests or questions during task-oriented dialogues, whichare out of the scope of APIs/DB. By leveraging externalknowledge resources, relevant information can be retrievedand encoded into the response generation for these out-of-API-coverage queries. In this work, we have explored sev-eral advanced techniques to enhance the utilization of ex-ternal knowledge and boost the quality of response genera-tion, including schema guided knowledge decision , negativesenhanced knowledge selection , and knowledge grounded re-sponse generation . To evaluate the performance of our pro-posed method, comprehensive experiments have been car-ried out on the publicly available dataset. Our approach wasranked as the best in human evaluation of DSTC9 Track-1. Task-oriented dialogue agents have been widely used in ourdaily lives, such as digital personal assistant, customer ser-vice bot, and so on. Given that these agents generally relyon pre-deﬁned APIs to provide services, they cannot handleuser queries beyond the API’s coverage. For instance, book-ing a seat at a restaurant can be handled with a pre-deﬁnedAPI. However, requesting the noise level at this restaurantmight go beyond the API’s coverage and lead to the fail-ure of this system. Under the circumstances, users need toﬁnd out the information themselves, by visiting the web-site description, customer reviews or FAQs. In fact, for mostout-of-API-coverage queries, relevant information might al-ready exist in external resources.To tackle the above problem, external knowledge is ex-tracted and employed to boost the capacity of task-orienteddialogue system (Kim et al. 2020). To this end, an aug-mented dataset with external knowledge access is con-structed based on MultiWOZ 2.1 (Eric et al. 2020). Theconversation in MultiWOZ 2.1 is about touristic informa-tion seeking between a tourist and a clerk, conﬁned bypre-deﬁned APIs. In the augmented dataset, out-of-API-coverage utterances with external knowledge access are in-serted accordingly into the original conversation. * First two authors contributed equally to this work. † Distinct with open-domain conversations (Dinan et al.2018; Lian et al. 2019; Fan et al. 2020), task-oriented con-versation needs to deliver information accurately to satisfyuser’s needs. Therefore, it encounters more stringent re-quirements on knowledge selection and utilization. First, thesystem has to pick out the most accurate knowledge snippetfrom the large external database, not just relevant ones. Forexample, to respond to the user with the opening hours of aparticular museum, the references from other museums arenot very useful. Second, to generate an accurate and coher-ent response, the system needs to carry out elaborate pro-cessing and reasoning with the retrieved knowledge snippetand dialogue context.In DSTC9 Track-1, the retrieval-augmented response gen-eration has been split into three successive tasks. First, giventhe dialogue context, the system decides whether to triggerexternal knowledge access or not. Second, for the turns re-quiring external knowledge, the system selects the most ap-propriate knowledge snippets. Third, the system generatesthe responses given the dialogue context and selected knowl-edge. In this work, to enhance task-oriented dialogue gener-ation, a complete solution is introduced to the three tasks.In Task1, we propose the schema guided knowledge deci-sion , which takes the functions of APIs and external knowl-edge into consideration. In Task2, we introduce the nega-tives enhanced knowledge selection , which includes multi-scale negatives to increase the training difﬁculty and boostthe selection performance. In Task3, to obtain coherent andaccurate responses, we leverage powerful pre-trained mod-els for knowledge grounded response generation . To eval-uate the performance of our proposed solution, comprehen-sive experiments have been carried out on the publicly avail-able dataset. Our approach was ranked as the best in humanevaluation of DSTC9 Track-1.

In this section, we will discuss the following strategies in de-tail, including schema guided knowledge decision, negativesenhanced knowledge selection, and knowledge grounded re-sponse generation.

To determine whether to seek external knowledge or not, themost straightforward way is to rely on the dialogue con- a r X i v : . [ c s . C L ] F e b ervice_name: hotel description: hotel reservations and vacation stays Service name: hotel-pricerange description: price budget of the hotel

Slots name: find_hotel description: search for a hotel to stay in

Intents service_name: attraction description: find touristy stuff to do around you

Service name: attraction-area description: area to search for attractions

Slots name: find_attraction description: search for places to see for leisure

Intents

Figure 1: Some schema descriptions from MultiWOZ 2.2.text (Kim et al. 2020). However, such an approach typi-cally captures the frequent keywords or semantic patternsin the context, which might lead to biased decision andsuffer from poor performance on the unseen domain/localeconversations. In practice, the system is supposed to knowthe functions of APIs and external knowledge before mak-ing a choice between them. For the sake of a comprehen-sive decision, we take the dialogue context as well as theAPI/knowledge functions into consideration.While it is challenging to encode the structured APIsinto the decision making process. Inspired by the recentprogress in schema description (Shah et al. 2019; Ericet al. 2020; Rastogi et al. 2020), we employ the naturallanguage descriptions to represent the functions of APIs.Some schema descriptions from MultiWOZ 2.2 (Zang, Ras-togi, and Chen 2020) are illustrated in Figure 1, with sup-ported slots and intents listed under each service/API. Theschema descriptions are denoted as S = { s , s , · · · , s m } ,where s i is one slot/intent description from MultiWOZ2.2. The external knowledge snippets are represented as K = { k , k , · · · , k n } , where k i represent the i -th knowl-edge snippet. The dialogue context is referred as C t = { u , u , · · · , u t } , where u i is the i -th utterance in a multi-turn conversation and t is the current time step. In theschema guided knowledge decision, we will estimate thefollowing probability p decision ( l x = 1 | C t , x ) , where x can beone schema description s i or knowledge snippet k i . l x standsfor the label to choose x or not given the dialogue context.The input is fed into transformer network in the followingformat: [CLS] C t [SEP] x [SEP], and the hidden embeddingof [CLS] in the last layer is used to estimate the above prob-ability.During the training process, mixed samples from schemadescriptions and knowledge snippets are combined togetherand learned to minimize the following loss: L decision = − (cid:88) i log p decision ( l x i = 1 | C t , x i ) − (cid:88) j log p decision ( l x − j = 0 | C t , x − j ) (1) For the conversational turns that rely on API services, posi-tive samples x i are collected with corresponding slot/intentdescriptions, and negative samples x − j include other schemadescriptions and knowledge snippets. For the conversationalturns that require external knowledge access, positive sam-ples x i are collected with corresponding knowledge snip-pets, and negative samples x − j include other snippets andschema descriptions.During inference, the knowledge-seeking turn is deter-mined as follows. If the following condition is met: max k i ∈ K p decision ( l k i = 1 | C t , k i ) ≥ max s i ∈ S p decision ( l s i = 1 | C t , s i ) the system will consult the external knowledge snippets forsubsequent response generation. Otherwise, the system willkeep relying on the API services for response generation. Itis notable that besides knowledge-seeking turn detection, theabove estimated probability max k i ∈ K p decision ( l k i = 1 | C t , k i ) isalso passable for knowledge selection. Once determined to trigger the external knowledge access,the next move is to select the appropriate knowledge snip-pet. Within this section, we will elaborate more on the rel-evance estimation between each knowledge snippet and thedialogue context p selection ( l k i = 1 | C t , k i ) .Usually, the relevance function is trained to separate pos-itive samples from those randomly selected negative sam-ples. Given that the space of negative samples is extremelylarge, random sampling might lead to coarse-grained classseparation, which is insufﬁcient for ﬁne-grained knowledgeselection. Recently, it has been recognized that the selec-tion of negative samples is crucial in boosting the capac-ity of retrieval systems (Henderson et al. 2017; Karpukhinet al. 2020). In this task, we include multi-scale negatives tostrengthen the ability of ﬁne-grained relevance estimation.The training objective is to minimize the following loss: L selection = − log p selection ( l k i = 1 | C t , k i ) − (cid:88) j log p selection ( l k − i,j = 0 | C t , k − i,j ) (2)The negatives samples k − i,j are collected from distinctscales to increase the training difﬁculty. (1) Random: oneknowledge snippet is randomly selected from the whole set K ; (2) In-Domain: one knowledge snippet is randomly se-lected from those within the same domain as the positivesample k i ; (3) In-Entity: one knowledge snippet is randomlyselected from those belonging to the same entity as the posi-tive sample. (4) Cross-Entity: one knowledge snippet is ran-domly selected from those belonging to the aforementionedentity in the dialogue context. The training difﬁculty in-creases along with the reﬁnement of the negative sample’sgranularity. During training, the ratio of positive to nega-tive training samples is 1:4. During inference, the optimalknowledge snippet can be selected in the following way: k ∗ = max k i ∈ K p selection ( l k i = 1 | C t , k i ) (3) nowledge ResponseInput [CLS] ⋅ [SEP] Context ⋅ ⋅ [SEP] ⋅ ⋅ [SEP] ⋅⋅ ⋅ ⋅

TransformerBlocksOutput

Figure 2: Knowledge grounded response generation. Orangelines denote bi-directional attention, and blue lines denoteuni-directional attention

There are two basic requirements for knowledge groundedresponse generation. First, the generated response needs tobe coherent with the dialogue context, remaining a smoothconversation ﬂow. Second, the response needs to express theinformation accurately, without deviating from the originalknowledge snippet. Although there are several options toproduce the response, it is not easy to simultaneously meetthese requirements. For instance, with an open-domain chat-bot, it is capable to produce one coherent response withoutexternal knowledge, but incapable to provide necessary in-formation towards the user. Another way is to forward theretrieved knowledge snippet directly as the reply, resultingin incoherence of the conversation ﬂow. One compromisesolution is to employ machine comprehension techniques toextract spans from the knowledge snippet as the response.However, it still might fail when the accurate answer is notcontained within the surface contents. As such, it is chal-lenging to utilize knowledge accurately and generate high-quality responses.In this paper, we leverage powerful pre-trained models forknowledge grounded response generation. The network in-frastructure is sketched in Figure 2. The backbone of the pre-training network consists of transformer blocks. The input tothe network is the sum of the following four representations(Devlin et al. 2019; Bao et al. 2020a).• Token Embedding. Following the conventional pre-processing, the input text is tokenized into byte-pair-encoding (BPE) tokens (Sennrich, Haddow, and Birch2016).• Segment Embedding. To better differentiate the input in-formation, distinct segment embeddings are assigned tothe knowledge snippet, dialogue context and response.• Role Embedding. As the multi-turn conversation is inter-active, role embeddings are employed to distinguish theutterances from different characters.• Position Embedding. To obtain better extensibility on theinput length, relative position embeddings are embracedin this generation network.As for the self-attention mechanism, bi-directional atten-tion is enabled for better natural language understanding,and uni-directional attention is employed for auto-regressive

Method Task1: Turn Detection Task2: Knowledge Selection Task3: Response GenerationTeam ID Entry ID Precision Recall F1 MRR@5 Recall@1 Recall@5 BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR ROUGE-1 ROUGE-2 ROUGE-LBaseline

15 3 0.9933 0.9677 0.9803 0.9195 0.8975 0.9460 0.3779 0.2532

Table 1: Dataset statistics of DSTC9 Track-1.response generation. The training objective is to minimizethe negative log-likelihood (NLL) loss: L generation = − E log p generation ( r | C t , k ) (4)where r refers to the target response. During training, weutilize the golden knowledge snippet ˜ k for response genera-tion. During inference, we rely on the knowledge snippet k ∗ retrieved with Equation (3) for response generation. In DSTC9 Track-1, one augmented dataset (Kim et al. 2020)is constructed based on MultiWOZ 2.1 (Eric et al. 2020).This dataset is about touristic information seeking betweena tourist and a clerk. Besides the conventional API-relatedutterances, out-of-API-coverage utterances are inserted withexternal knowledge access. The detailed statistics of theaugmented dataset are summarized in Table 1. To evaluatethe generalization ability of task-oriented dialogue systems,some unseen conversations from new domains or locales areincluded in the test set.In DSTC9 Track-1, it focuses on the conversational utter-ances that require external knowledge access. The evaluationcovers the following three successive tasks.• Task1: Knowledge-seeking Turn Detection. The modelneeds to decide whether to trigger external knowledge ac-cess or not, given the dialogue context. The evaluationmetrics of this task include precision, recall, and F1.• Task2: Knowledge Selection. For the conversational turnsthat require external knowledge access, the model needsto select appropriate knowledge snippets. The evaluationmetrics of this task include mrr@5 (Voorhees 1999), re-call@1, and recall@5.• Task3: Response Generation. The model needs to pro-duce the responses given the dialogue context and se-lected knowledge snippets. The automatic evaluation met-rics of this task include BLEU-1/2/3/4 (Papineni et al.2002), Meteor (Denkowski and Lavie 2014) and ROUGE-1/2/L (Lin 2004).In the experiments, we leverage large-scale pre-trainingmodels to boost the performance of three tasks. The mod-els are pre-trained with 684M (context, response) trainingsamples extracted from Reddit. The vocabulary has 8k BPEsubwords constructed with the SentencePiece library. Thepre-training process is carried out via curriculum learning ethod Task1: Knowledge-seekingTurn Detection Task2: Knowledge Selection Task3: Response GenerationTeam ID Entry ID Precision Recall F1 MRR@5 Recall@1 Recall@5 BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR ROUGE-1 ROUGE-2 ROUGE-LBaseline

Ours ) 0 0.9985 0.9963

15 3 0.9933 0.9677 0.9803 0.9195 0.8975 0.9460 0.3779 0.2532

Ours ) 0 0.9941 0.9430 0.9679 0.9181 0.8870 0.9554 0.3726 0.2402 0.1556 0.1064 0.3802 0.4103 0.1936 0.36651 0.9911 0.9566 0.9735 0.9214 0.8883 0.9612 0.3780 0.2449 0.1594 0.1088 0.3853 0.4167 0.1978 0.37272 0.9954 0.9818 0.9886

Ours ) 2

Table 2: Experimental results on the validation set, with the highest value written in bold.

Method Task1: Knowledge-seekingTurn Detection Task2: Knowledge Selection Task3: Response GenerationTeam ID Entry ID Precision Recall F1 MRR@5 Recall@1 Recall@5 BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR ROUGE-1 ROUGE-2 ROUGE-LBaseline

Ours ) 0 0.9985 0.9963

15 3 0.9933 0.9677 0.9803 0.9195 0.8975 0.9460 0.3779 0.2532

Ours ) 2

Table 3: Experimental results on the test set, with the highest value written in bold.as PLATO-2 (Bao et al. 2020b). In the ﬁrst stage, a genera-tion model is trained to minimize the negative log-likelihood(NLL) loss. In the second stage, an evaluation model is fur-ther trained to minimize the sentence-order prediction (SOP)loss (Lan et al. 2019). The evaluation model is used for theﬁne-tuning of Task1 and Task2. The generation model is em-ployed for the ﬁne-tuning of Task3. All the models have 32transformer blocks and 32 attention heads, with the hiddenembedding dimension of 2048.

During the competition, each team is able to submit at mostﬁve entries. The methods that we use in each entry are de-scribed as follows:• Entry 0. In Task1, the knowledge-seeking turn detectionis estimated based on the dialogue context. In Task2, theknowledge selection is enhanced with multi-scale nega-tives training. In Task3, the response is generated usingbeam search, with a beam size of 5.• Entry 1. In Task1, the proposed schema guided knowledgedecision is adopted for knowledge-seeking turn detection.The settings of Task2 and Task3 are the same as entry 0.• Entry 2. The model ensemble is carried out for both Task1and Task2. In Task3, the response is generated using beamsearch, with a beam size of 5.• Entry 3. The model ensemble for Task1 and Task2 is the same as entry 2. In Task3, the response is generated usingbeam search, with a beam size of 3.• Entry 4. The model ensemble for Task1 and Task2 is thesame as entry 2. In Task3, the response is extracted di-rectly from the retrieved knowledge snippet.For the Task1 model of entry 0, it can be represented asSOP-32L-Context in short, where SOP-32L refers to thepre-trained 32L evaluation model optimized with SOP lossand the knowledge-seeking turn detection is estimated basedon the dialogue context. Similarly, for the Task1 model ofentry 1, it is referred as SOP-32L-Schema. For the Task2model of entry 0-1, it is referred as SOP-32L-Selection.To carry out model ensemble, extra pre-training models areemployed, including SOP-24L, NSP-24L, BERT-base (De-vlin et al. 2019) and ALBERT-xlarge (Lan et al. 2019). For the Task1 of entry 2-4, the following component ap-proaches are used for ensemble via majority voting: SOP-32L-Context, SOP-32L-Schema, SOP-24L-Context, SOP-24L-Schema, NSP-24L-Context, ALBERT-xlarge-Schema,BERT-base-Schema. For the Task2 of entry 2-4, the follow-ing approaches are combined to calculate the average se-lection probability: SOP-32L-Selection, SOP-32L-Schema, The 24L models have 24 transformer blocks and 16 atten-tion heads, with the hidden embedding dimension of 1024. Be-sides SOP, next-sentence-prediction (NSP) is another commonlyused pre-training loss function. BERT and ALBERT are includedfor the sake of distribution diversity. ethod Task1: Turn Detection Task2: Knowledge Selection Task3: Response GenerationTeam ID Entry ID Precision Recall F1 MRR@5 Recall@1 Recall@5 BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR ROUGE-1 ROUGE-2 ROUGE-LBaseline

Ours ) 0 0.9985 0.9963

15 3 0.9933 0.9677 0.9803 0.9195 0.8975 0.9460 0.3779 0.2532

Ours ) 2

Table 4: Final human evaluation on the test set, with thehighest value written in bold.SOP-24L-Selection, NSP-24L-Selection, ALBERT-xlarge-Schema. Among these 5 entries, entry 1 is the full version of ourproposed solution. The comparison between entry 0 and en-try 1 is to reﬂect the difference of schema guided knowledgedecision over conventional dialogue context-based decision.The comparison between entry 1 and entry 2 is to exhibit theimprovements brought by the model ensemble in Task1 andTask2. The last 3 entries are to examine the distinct strate-gies of response generation.The experimental results on the validation set and the testset are summarized in Table 2 and Table 3, with the high-est value written in bold. The ofﬁcial baseline method isbased on GPT-2 (Radford et al. 2019), using dialogue con-text for turn detection and top-p sampling for response gen-eration (Holtzman et al. 2019). There are 24 teams partic-ipated in the competition, and our team id is 19. Severalinteresting phenomena can be observed from these results.1) The dialogue context-based knowledge decision workswell on the validation set. In comparison, the schema guidedknowledge decision obtains superior performance on the testset, demonstrating better generalization on the unseen con-versations from new domains or locales. 2) The obviousgap between the baseline and the proposed method indicatesthat the negatives enhanced strategy brings a signiﬁcant im-provement in knowledge selection. 3) As for response gen-eration, although directly selecting the knowledge snippetsas the reply ensures information accuracy, it achieves infe-rior performance compared with generation based methods.Besides the objective evaluation, human evaluation hasbeen carried out for the ﬁnal ranking. For the top 12 teams,the entry with the best objective evaluation results will beselected for the ﬁnal human evaluation. Two metrics areconsidered in the evaluation: appropriateness and accuracy.Appropriateness measures how well the system response isnaturally connected to the conversation. Accuracy accesseshow accurate the system’s response is given the reference These models are selected through experiments on the valida-tion set, under the objective to maximize the knowledge selectionmetric Recall@1. Our source code and trained models will be released at https://github.com/PaddlePaddle/Knover. service_name: hotel description: hotel reservations and vacation stays

Service name: hotel-pricerange description: price budget of the hotel

Slots name: find_hotel description: search for a hotel to stay in

Intents service_name: attraction description: find touristy stuff to do around you

Service name: attraction-area description: area to search for attractions

Slots name: find_attraction description: search for places to see for leisure

Intents

I need a 4-star hotel in the financial district of San FranciscoI located Hilton San Francisco Financial District, would like me to check availability?Not at the moment, I need to know if they speak French.The hotel does not speak French. Would you like to make a reservation?Entity: Hilton San Francisco Financial DistrictQ: What type of language is spoken at the hotel?A: Languages spoken are english, spanish and chinese.What is the zip code for the Alpha Inn and Suites in Marina?The zip code is 94123. Will this work for you?Thanks. Does its front desk open also at late night? I would need to check after midnight.The Alpha Inn and Suites does not have 24 hour front desk service. Would you like me to make a reservation for you?Entity: Alpha Inn and SuitesQ: Does the front desk at Alpha Inn and Suites open for 24 hours?A: The frontdesk is not open for 24 hours at the Alpha Inn and Suites. Yes, and can you also tell me the zip code and phone number?Sure, the zip code is 94111, and the phone number is 415-983-8000.What are the opening hours like at Ferry Building Marketplace?The Ferry Building Marketplace is open from 10:00 am to 6:00 pm every day. On Sunday it is open from 11:00 am to 5:00 pm. Anything else I can do for you?Entity: Ferry Building MarketplaceQ: What are the business hours of Ferry Building Marketplace?A: Ferry Building Marketplace is open from 10:00 am to 6:00 pm every day. On Saturday it is open from 9:00 am to 6:00 pm. On Sunday it is open from 11:00 am to 5:00 pm.are there any landmark public markets in the San Francisco areaYes, Ferry Building Marketplace is a public market located in Embarcadero. Would you like the address?What type of car is it? Can I have the contact phone number?You'll be looking for a red Audi and the contact number is 07918620184. .Will I need to pay a cash tip, or is that included in the fee?Cash is not required, Visa, Master Card and major debit cards are accepted. Is there anything else I can help you with?Entity: TaxiQ: What types of payments do you accept?A: We take cash, Visa, Master Card and major debit cards.

Figure 3: Cherry-picked examples. service_name: hotel description: hotel reservations and vacation stays

Service name: hotel-pricerange description: price budget of the hotel

Slots name: find_hotel description: search for a hotel to stay in

Intents service_name: attraction description: find touristy stuff to do around you

Service name: attraction-area description: area to search for attractions

Slots name: find_attraction description: search for places to see for leisure

Intents

Figure 4: Examples with issues.nowledge. The evaluation score ranges from 1 to 5. Thehigher, the better. The ﬁnal evaluation results on the test setare summarized in Table 4. Our proposed approach obtains1st place in the ﬁnal ranking. The golden responses are alsoenrolled in the ﬁnal evaluation. The small gap between ourapproach and ground truth suggests that the system is ableto provide high-quality and human-like services via externalknowledge access.

To further analyze the quality of the proposed approach, sev-eral good and bad cases are provided in Figure 3 and Figure4. From these cherry-picked examples shown in Figure 3,it can be observed that the model is able to select the mostappropriate knowledge snippet from the large-scale externaldatabase and generate high-quality knowledge grounded re-sponses. In the upper case, the accurate answer towards theuser’s query is not included in the surface contents of theknowledge snippet. The model still generates an accuratereply, exhibiting the ability of natural language inference tosome extent.For those examples in Figure 4, the issues mainly comefrom two aspects. First, the model fails to select the most ap-propriate knowledge snippet due to the deﬁciency on train-ing samples of some patterns. This issue might be alleviatedthrough more advanced retrieval techniques or response gen-eration conditioned on multiple knowledge snippets. Sec-ond, for those complicated knowledge snippets with mul-tiple segments, sometimes the model omits a minor part andfails to generate complete information. This issue might bealleviated through the combination of extraction and gener-ation in the near future.

The related work will be discussed on task-oriented dialoguesystems and knowledge-grounded response generation.Task-oriented dialogue systems interact with users in nat-ural language and help them to accomplish some tasks, suchas setting an alarm clock, booking a taxi, reserving a ta-ble, and so on. Conventional systems (Young et al. 2013;Henderson, Thomson, and Williams 2014; Wen et al. 2015)adopt the modular architecture, including natural languageunderstanding (NLU), dialogue state tracking (DST), dia-logue policy, and natural language generation (NLG) mod-ules. Recently, some end-to-end neural models (Wen et al.2017; Li et al. 2017; Ham et al. 2020) have been introducedfor task-oriented dialogue systems. Regardless of the mod-ular or end-to-end architecture, these systems need to oper-ate within the scope of pre-deﬁned APIs and cannot handlethose out-of-range queries. Considering that relevant infor-mation about these queries might already exist on the inter-net, a new way is paved to incorporate external knowledgeinto task-oriented conversation modeling (Kim et al. 2020).To improve the informativeness in social conversations,some approaches (Dinan et al. 2018; Lian et al. 2019;Fan et al. 2020) have explored knowledge grounded dia-logue generation, where relevant knowledge segments areretrieved and encoded into memory network. As there exists the one-to-many mapping phenomenon in open-domain so-cial conversations (Zhao, Zhao, and Eskenazi 2017; Kim,Ahn, and Kim 2019; Bao et al. 2020a), multiple knowl-edge segments might be appropriate to produce coherent re-sponses. In comparison, task-oriented conversation model-ing needs to deliver precise information to satisfy the user’sneeds. Therefore, it encounters more stringent requirementson knowledge selection and utilization. In this work, wehave explored several advanced techniques to enhance task-oriented dialogue generation via external knowledge access.

To boost the capacity of task-oriented dialogue system, inthis work, we have explored several advanced techniques,including schema guided knowledge decision, negatives en-hanced knowledge selection, and knowledge grounded re-sponse generation. Comprehensive experiments have beencarried out on the publicly available dataset. Experimentalresults demonstrate that the schema guided knowledge deci-sion achieves better generalization on unseen conversationsand negatives enhanced knowledge selection brings signiﬁ-cant improvements. More coherent and accurate knowledgegrounded responses are generated by leveraging powerfulpre-training models. As compared with other state-of-the-art methods, our approach obtains superior performance andranks the 1st in the ﬁnal evaluation.

Acknowledgments

We would like to thank the reviewers for their constructivesuggestions; Jingzhou He, and Tingting Li for the help onresource coordination; Wenquan Wu, and Han Zhou for thehelpful discussions. This work was supported by the Natu-ral Key Research and Development Project of China (No.2018AAA0101900).

References

Bao, S.; He, H.; Wang, F.; Wu, H.; and Wang, H. 2020a.PLATO: Pre-trained Dialogue Generation Model with Dis-crete Latent Variable. In

Proceedings of the 58th AnnualMeeting of the Association for Computational Linguistics ,85–96.Bao, S.; He, H.; Wang, F.; Wu, H.; Wang, H.; Wu, W.; Guo,Z.; Liu, Z.; and Xu, X. 2020b. PLATO-2: Towards Buildingan Open-Domain Chatbot via Curriculum Learning. arXivpreprint arXiv:2006.16779 .Denkowski, M.; and Lavie, A. 2014. Meteor Universal: Lan-guage Speciﬁc Translation Evaluation for Any Target Lan-guage. In

Proceedings of the 9th Workshop on StatisticalMachine Translation , 376–380.Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2019.BERT: Pre-training of Deep Bidirectional Transformers forLanguage Understanding. In

Proceedings of the 2019 Con-ference of the North American Chapter of the Associationfor Computational Linguistics: Human Language Technolo-gies , 4171–4186.Dinan, E.; Roller, S.; Shuster, K.; Fan, A.; Auli, M.; and We-ston, J. 2018. Wizard of Wikipedia: Knowledge-Poweredonversational Agents. In

International Conference onLearning Representations .Eric, M.; Goel, R.; Paul, S.; Sethi, A.; Agarwal, S.; Gao, S.;Kumar, A.; Goyal, A.; Ku, P.; and Hakkani-Tur, D. 2020.MultiWOZ 2.1: A Consolidated Multi-Domain DialogueDataset with State Corrections and State Tracking Baselines.In

Proceedings of The 12th Language Resources and Evalu-ation Conference , 422–428.Fan, A.; Gardent, C.; Braud, C.; and Bordes, A. 2020. Aug-menting Transformers with KNN-Based Composite Mem-ory for Dialogue. arXiv preprint arXiv:2004.12744 .Ham, D.; Lee, J.-G.; Jang, Y.; and Kim, K.-E. 2020. End-to-End Neural Pipeline for Goal-Oriented Dialogue Systemsusing GPT-2. In

Proceedings of the 58th Annual Meeting ofthe Association for Computational Linguistics , 583–592.Henderson, M.; Al-Rfou, R.; Strope, B.; Sung, Y.-H.;Luk´acs, L.; Guo, R.; Kumar, S.; Miklos, B.; and Kurzweil,R. 2017. Efﬁcient Natural Language Response Suggestionfor Smart Reply. arXiv preprint arXiv:1705.00652 .Henderson, M.; Thomson, B.; and Williams, J. D. 2014. TheSecond Dialog State Tracking Challenge. In

Proceedings ofthe 15th Annual Meeting of the Special Interest Group onDiscourse and Dialogue , 263–272.Holtzman, A.; Buys, J.; Du, L.; Forbes, M.; and Choi, Y.2019. The Curious Case of Neural Text Degeneration. In

International Conference on Learning Representations .Karpukhin, V.; O˘guz, B.; Min, S.; Wu, L.; Edunov, S.;Chen, D.; and Yih, W.-t. 2020. Dense Passage Retrievalfor Open-Domain Question Answering. In

Proceedings ofthe 2020 Conference on Empirical Methods in Natural Lan-guage Processing , 6769–6781.Kim, B.; Ahn, J.; and Kim, G. 2019. Sequential LatentKnowledge Selection for Knowledge-Grounded Dialogue.In

International Conference on Learning Representations .Kim, S.; Eric, M.; Gopalakrishnan, K.; Hedayatnia, B.; Liu,Y.; and Hakkani-Tur, D. 2020. Beyond Domain APIs:Task-oriented Conversational Modeling with UnstructuredKnowledge Access. In

Proceedings of the 21th AnnualMeeting of the Special Interest Group on Discourse and Di-alogue , 278–289.Lan, Z.; Chen, M.; Goodman, S.; Gimpel, K.; Sharma, P.;and Soricut, R. 2019. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. In

In-ternational Conference on Learning Representations .Li, X.; Chen, Y.-N.; Li, L.; Gao, J.; and Celikyilmaz, A.2017. End-to-End Task-Completion Neural Dialogue Sys-tems. In

Proceedings of the 8th International Joint Confer-ence on Natural Language Processing , 733–743.Lian, R.; Xie, M.; Wang, F.; Peng, J.; and Wu, H. 2019.Learning to Select Knowledge for Response Generation inDialog Systems. In

Proceedings of the 28th InternationalJoint Conference on Artiﬁcial Intelligence , 5081–5087.Lin, C.-Y. 2004. ROUGE: A Package for Automatic Evalu-ation of Summaries. In

Text Summarization Branches Out ,74–81. Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W.-J. 2002.BLEU: a Method for Automatic Evaluation of MachineTranslation. In

Proceedings of the 40th Annual Meeting ofthe Association for Computational Linguistics , 311–318.Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; andSutskever, I. 2019. Language Models are UnsupervisedMultitask Learners.

Technical report, OpenAI .Rastogi, A.; Zang, X.; Sunkara, S.; Gupta, R.; and Khaitan,P. 2020. Schema-Guided Dialogue State Tracking Task atDSTC8. arXiv preprint arXiv:2002.01359 .Sennrich, R.; Haddow, B.; and Birch, A. 2016. Neural Ma-chine Translation of Rare Words with Subword Units. In

Proceedings of the 54th Annual Meeting of the Associationfor Computational Linguistics , 1715–1725.Shah, D.; Gupta, R.; Fayazi, A.; and Hakkani-Tur, D. 2019.Robust Zero-Shot Cross-Domain Slot Filling with ExampleValues. In

Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics , 5484–5490.Voorhees, E. M. 1999. The TREC-8 Question AnsweringTrack Report. In

Proceedings of TREC-8 , volume 99, 77–82.Wen, T.-H.; Gasic, M.; Mrkˇsi´c, N.; Su, P.-H.; Vandyke, D.;and Young, S. 2015. Semantically Conditioned LSTM-based Natural Language Generation for Spoken DialogueSystems. In

Proceedings of the 2015 Conference on Empir-ical Methods in Natural Language Processing , 1711–1721.Wen, T.-H.; Vandyke, D.; Mrkˇsi´c, N.; Gasic, M.; Bara-hona, L. M. R.; Su, P.-H.; Ultes, S.; and Young, S. 2017.A Network-based End-to-End Trainable Task-oriented Di-alogue System. In

Proceedings of the 15th Conference ofthe European Chapter of the Association for ComputationalLinguistics , 438–449.Young, S.; Gaˇsi´c, M.; Thomson, B.; and Williams, J. D.2013. POMDP-Based Statistical Spoken Dialog Systems:A Review. In

Proceedings of the IEEE , volume 101, 1160–1179.Zang, X.; Rastogi, A.; and Chen, J. 2020. MultiWOZ 2.2:A Dialogue Dataset with Additional Annotation Correctionsand State Tracking Baselines. In

Proceedings of the 2ndWorkshop on Natural Language Processing for Conversa-tional AI , 109–117.Zhao, T.; Zhao, R.; and Eskenazi, M. 2017. LearningDiscourse-level Diversity for Neural Dialog Models usingConditional Variational Autoencoders. In