[PDF] A Hybrid Task-Oriented Dialog System with Domain and Task Adaptive Pretraining

Abstract

This paper describes our submission for the End-to-end Multi-domain Task Completion Dialog shared task at the 9th Dialog System Technology Challenge (DSTC-9). Participants in the shared task build an end-to-end task completion dialog system which is evaluated by human evaluation and a user simulator based automatic evaluation. Different from traditional pipelined approaches where modules are optimized individually and suffer from cascading failure, we propose an end-to-end dialog system that 1) uses Generative Pretraining 2 (GPT-2) as the backbone to jointly solve Natural Language Understanding, Dialog State Tracking, and Natural Language Generation tasks, 2) adopts Domain and Task Adaptive Pretraining to tailor GPT-2 to the dialog domain before finetuning, 3) utilizes heuristic pre/post-processing rules that greatly simplify the prediction tasks and improve generalizability, and 4) equips a fault tolerance module to correct errors and inappropriate responses. Our proposed method significantly outperforms baselines and ties for first place in the official evaluation. We make our source code publicly available.

Full PDF

AA Hybrid Task-Oriented Dialog System with Domain and Task AdaptivePretraining

Boliang Zhang ∗ , Ying Lyu ∗ , Ning Ding, Tianhao Shen, Zhaoyang Jia, Kun Han, Kevin Knight DiDi AI Labs { boliangzhang, yinglu, yaeldingning, shentianhao i, jiazhaoyang, kunhan, kevinknight } @didiglobal.com Abstract

This paper describes our submission for the End-to-endMulti-domain Task Completion Dialog shared task at the9th Dialog System Technology Challenge (DSTC-9). Partic-ipants in the shared task build an end-to-end task completiondialog system which is evaluated by human evaluation and auser simulator based automatic evaluation. Different from tra-ditional pipelined approaches where modules are optimizedindividually and suffer from cascading failure, we propose anend-to-end dialog system that 1) uses Generative Pretraining2 (GPT-2) as the backbone to jointly solve Natural LanguageUnderstanding, Dialog State Tracking, and Natural LanguageGeneration tasks, 2) adopts Domain and Task Adaptive Pre-training to tailor GPT-2 to the dialog domain before ﬁnetun-ing, 3) utilizes heuristic pre/post-processing rules that greatlysimplify the prediction tasks and improve generalizability,and 4) equips a fault tolerance module to correct errors andinappropriate responses. Our proposed method signiﬁcantlyoutperforms baselines and ties for ﬁrst place in the ofﬁcialevaluation. We make our source code publicly available.

Introduction

Task-oriented dialog systems aim to communicate withusers through natural language to accomplish a wide rangeof tasks, such as restaurant booking, weather querying, etc .With the rising trend of artiﬁcial intelligence, many devicesare incorporated with virtual assistants, such as Alexa, Siri,and Cortana. Task-oriented dialog systems have attracted at-tention from both academia and industry as a key componentin virtual assistants (Chen, Celikyilmaz, and Hakkani-Tur2018; Gao, Galley, and Li 2018).Real-world dialogue systems usually need to deal withcomplex tasks containing multiple goals and spanning overmultiple domains, which poses great challenges to the ex-isting task-oriented dialog systems. Traditionally a task-oriented dialog system uses a pipeline architecture that con-sists of the following modules: Natural Language Under-standing (NLU), Dialog Manager (DM) that tracks dialogstates and predicts actions, and Natural Language Genera-tion (NLG) (Williams et al. 2014; Bocklisch et al. 2017; Gaoet al. 2019; Zhang et al. 2019). These modules are usuallyisolated and optimized individually. Therefore, errors can * propagate from module to module and hurt the overall per-formance (Ham et al. 2020a; Gao, Galley, and Li 2018). Fur-ther, such pipeline-based solutions usually deal with simpletasks within a single domain, requiring rich domain knowl-edge and expert experiences. Hence it is prohibitively ex-pensive to build dialog systems at scale for complex taskswith multiple domains.Fully data-driven dialog systems have been extensivelystudied recently due to the success of deep learning. Theyjointly learn to understand the user’s language, inquiredatabases, and compose responses. These end-to-end dialogsystems do not rely on the traditional components and haveshown great potentials. Wen et al. (2017); Yang et al. (2017);Ham et al. (2020b) demonstrate that end-to-end systems out-perform the traditional pipeline approaches in task-orienteddialog scenarios. Zhang, Ou, and Yu (2019); Peng et al.(2020) focus on the benchmarks of the MultiWoz datasetand achieve top performance.In this paper, we introduce our submission for the Multi-domain Task-oriented Dialog Challenge at Dialog SystemTechnology Challenge 9 (DSTC9, Gunasekara et al. (2020)).Participants in the shared task build end-to-end dialog sys-tems that can assist human to fulﬁl single or multiple tasks,such as making a restaurant reservation, booking a hotel,etc. There are two types of evaluations: 1) human evaluationwhere the organizer recruits Amazon Mechanical Turkers(MTurkers) to chat with the system and assess whether tasksare accomplished, and 2) automatic evaluation where the or-ganizer provides a user simulator that scores each submis-sion based on its conversation with the system. Human eval-uation result is the only metric used for ﬁnal ranking. BothMTurkers and the user simulator are provided with a clearand pre-deﬁned user goal prior to the conversation. Theychat with the system by following the user goal. To supportthis, ConvLab-2 (Zhu et al. 2020) was released to serve asthe platform for dialog system development and evaluation,providing a user simulator and evaluator so that participantscan effectively run ofﬂine experiments and evaluations. Thetask provides the MultiWoz 2.1 (Eric et al. 2019) dataset forthe system development. In addition, any external datasetsand resources are allowed in the shared task.In this shared task, we adopt the idea of the end-to-enddialog system and propose several novel ideas to improve itsperformance in the real world scenario. There are ﬁve key a r X i v : . [ c s . C L ] F e b omponents in our system:• Domain/Task Adaptive Pretraining based on GPT-2

Following Peng et al. (2020), we build our dialog modelinitialized with GPT-2 to inherit its capability of pro-ducing human-like responses and leverage external top-ically related datasets (e.g., Schema-guided Dialog (Ras-togi et al. 2019), Taskmaster (Byrne et al. 2019)) for pre-training. However, unlike the conventional pretraining, weapply the domain/task adaptive pretraining (Gururanganet al. 2020) on the external topically related datasets to tai-lor GPT-2 from raw web-texts to the dialog domain beforeﬁnetuning with MultiWoz. To the best of our knowledge,this is the ﬁrst time to apply the this pretraining paradigmto the task-oriented dialog system. This pretraining bringsus non-trivial gains in the automatic evaluation.•

Multi-Task Finetuning

As suggested by Peng et al.(2020), we ﬂatten the dialog history, belief states,database query results, and the response into one string,and then ﬁnetune the pretrained model with the MultiWozdataset optimizing a combination of three objective: 1) togenerate belief states conditioned on the dialog history; 2)to generate response conditioned on the belief states anddialog history; and 3) to distinguish gold samples froma distractor with fake samples. In this way, we formatthe data generation process of dialog system (NLU, DST,POL, NLG) as a single neural model where the full se-quence can be learned in a auto-regressive manner.•

Domain-Aware Data Pre/Post-processing

Although thebackbone of the proposed method is an end-to-end model,it is important to apply proper data pre-processing andpost-processing because the input and the output of themodel are not free-form natural language. As the Multi-Woz dataset is human-human conversation based, we fol-low Wen et al. (2017); Zhang, Ou, and Yu (2019) to cre-ate rules to clean up and delexicalize agent utterances intotemplates, such as“ it ’s a hotel . there are 5 guesthouses in the area .do you prefer cheap or moderate for the price range ? ”is transformed to:“ it is a [value type] . there are 5 [value type]in the area . do you prefer [value pricerange] or[value pricerange] for the price range ? ”.It simpliﬁes the training data and makes the model to onlypredict the templates. To encourage knowledge sharingthrough semantically similar slots in different domains,we apply domain-adaptive delexicalization (Zhang, Ou,and Yu 2019) and compute the domain at each dialog turnto replace the placeholders with slots in the right domain.•

Fault Tolerance

We adjust GPT-2’s decoder conﬁgura-tion, e.g.increasing the beam size, to generate additionalresponses when errors or inappropriate responses occur.We observe signiﬁcant gains in both automatic and hu-man evaluations during our system development.• “User Interface”

It is a special rule-based post-processing module that polishes agent utterances andmake the conversation smooth. The modiﬁcations are only visible to the user. This improves the system perfor-mance a lot in the human evaluation.Our submission signiﬁcantly outperforms baseline meth-ods and ties for ﬁrst place in the ofﬁcial evaluation. We makeour source code publicly available. Related Work

Building end-to-end trainable neural networks is becoming anew research trend for task-orientated dialog systems (Wenet al. 2017; Lei et al. 2018; Mehri, Srinivasan, and Eskenazi2019). The ﬁrst efforts (e.g., (Mehri, Srinivasan, and Eske-nazi 2019)) are the fusion methods, which attempted to inte-grate pretrained dialog modules (i.e., NLU, DM, NLG) intoa neural dialog model. Sequicity (Lei et al. 2018) is the ﬁrstseq2seq architecture that integrates track dialogue believesin end-to-end task-oriented dialog. Though these methodshave achieved promising results, they were usually designedfor a speciﬁc domain, rendering difﬁculties in generalizingto multi-domains, e.g., the recently proposed multi-domaindataset MultiWoz (Eric et al. 2019). Subsequently, there areseveral models are proposed to handle the multi-domain re-sponse generation task (Zhao, Xie, and Eskenazi 2019; Chenet al. 2019; Qin et al. 2020). To prevent dialog acts grow-ing combinatorially with the number of domains, Chenet al. (2019) built a multi-layer hierarchical graph to rep-resent dialog acts to generate responses using BERT-baseddialog policy. Qin et al. (2020) leveraged domain-sharedfeatures across domains and proposed a shared-private net-work, Dynamic Fusion Network, to learn shared and speciﬁcknowledge, explicitly capturing the correlation between do-mains. However, these works need a signiﬁcant number ofin-domain training examples to achieve good performance.In our system, we aim to generalize to multiple new domainswith a few labelled examples via pretraining.To the best of our knowledge, the most related works toours are Ham et al. (2020a) and Peng et al. (2020). Hamet al. (2020a), DSTC8 Track 1 Winner, is the ﬁrst attempt toleverage GPT-2 to ﬁnetune on the new task-oriented dialogtask. Instead of ﬁnetuning GPT-2 directly on the target do-main, both Peng et al. (2020) and our model ﬁnetuned GPT-2further with a large-scale external task-oriented dialog datato tailor GPT-2 to the dialog domain and then ﬁnetuned it onthe target new domain. However, Peng et al. (2020) used thesame multi-task objectives during the two stages of exter-nal dialog data pretraining and target domain data ﬁnetun-ing. In our system, to endow the model with task-orientedlanguage generation ability, we apply Domain/Task adap-tive pretraining, i.e., optimizing the GPT-2 language modelobjective on the external domain dialog data and the targetMultiWoz dataset, respectively. Method

Figure 1 shows an overview of our system. We ﬁrst pre-train the GPT-2 model on external dialog related datasets( 1), simply using the language modeling objective. Then https://github.com/boliangz/dstc9 https://convlab.github.io/about.html PT-2

Schema-GuidedTaskmasterCamRest676

External Dialog Datasets ...

GPT-2(tailored to dialog domain)

Continued Pretraining

Domain/Task AdaptivePretraining

Multi-task Finetuning

User: am looking for a place to to stay that has cheap price range ... hotel hotel { pricerange = cheap, type = hotel } DB: hotel 1 match okay , do you have a specific area you want to stay in ? Dialog History Turn Domain Belief State DB Match Dialog Response EOS Token

Dialog Data

Flattened

Finetune GPT-2 on the flattened dialog data

Post Processing

GPT-2 transformers(tailored to dialog domain)

Fault Tolerance Mechanism

It generates new responses when errors or inappropriate responses occur. “User Interface”

It is a special post-processing module that polishes agent utterances and make the conversation smooth.

Final Objective = Turn Domain/Belief Prediction + Response Prediction + Contrastive Objective

Figure 1: An overview of our system. Continued pre-training tailors GPT-2 to dialog related domain. Multi-task ﬁne-tuningtrains GPT-2 to predict turn domain, belief state and dialog response under the MultiWoz setting. Two post processing modulesrevise the system predictions and make them more human-readablewe ﬁne-tune the dialog domain tailored GPT-2 on the Multi-Woz dataset with three distinct objectives. At last, we applya fault tolerance mechanism and a special “user interface” topolish the predicted responses.

The End-to-end Model

The end-to-end model in our system consists of two parts: 1)Domain/Task adaptive pretraining that continues pretrainingGPT-2 on external dialog related datasets to tailor GPT-2 tothe dialog domain, and 2) ﬁnetuning GPT-2 on the MultiWozdataset using three task speciﬁc objectives that are carefullydesigned to make the model learn to accurately predict beliefstate and response.

Domain/Task Adaptive Pretraining

Gururangan et al. (2020) propose the idea of “Don’t StopPretraining” and show that it is still helpful to tailor a pre-trained model to the domain of a target task. For DomainAdaptive Pretraining (DAPT), we collect publicly availabledialog related datasets, such as Taskmaster and Schema-guided Dialog dataset, and pretrain GPT-2 on raw utterancesusing the original GPT-2 language model objective. For TaskAdaptive Pretraining (TAPT), we pretrain GPT-2 only onraw utterances of the MultiWoz dataset. We show data statis-tics in Table 1.

Multi-task Fine-tuning

Many recent work attempt to use end-to-end neural mod-els, such as sequence-to-sequence and GPT-2, to solve task-oriented dialog problems and achieve remarkable results(Wen et al. 2016b; Budzianowski et al. 2018; Zhang, Ou,and Yu 2019; Peng et al. 2020; Ham et al. 2020a). In ourwork, we follow the ﬁnetuning strategy of Peng et al. (2020)where we concatenate the dialog history and annotations andﬂatten them into a string, and then use a combination of threeobjectives to ﬁne-tune GPT-2.Figure 1 shows an overview of the system. As shown inthe “Multi-task Fine-tuning” section of Figure 1, we ﬂattenpre-processed dialog data into a string of six c omponents: c : Dialog History We concatenate history utterances (plusthe user utterance of the current turn) and add “User:” and“System:” to show the role of the utterance. c : Turn Domain The domain of the current turn. “ ”(start of the belief state) and “ ”(domain) are specialdelimiters. c : Belief State The belief state of the current turn is anno-tated by human. It ends with “ ”(end of the beliefstate) c : DB Match We compute the number of entities thatmatch the requirements in the belief state. It ends with“ ”(end of the KB) : Dialog Response Delexicalized dialog response. c : EOS Token “ ”(end of the string) is used to computethe contrastive objective, which is elaborated below.During training, we adopt the multi-task ﬁne-tuning strat-egy of Peng et al. (2020):•

Belief Prediction : We use c (dialog history) to predict c (turn domain) and c (belief state), and deﬁne the ob-jective as: L B = log p ( c , c | c ) = T c , (cid:88) t =1 log p θ ( c t | c ” token of the positive and negative sam-ples. The objective function is: L C = y log( p θ ( positive )) + (1 − y ) log(1 − p θ ( negative )) Thus, the full ﬁne-tuning objective is: L = L B + L R + L C . During inference, there are two stages of predictions: 1)given c (dialog history), the model predicts c (turn do-main) and c (belief state), and queries the database to gen-erate c (DB match), and 2) the model predicts c (dialogresponse) based on c − . Pre/Post-processing

We utilize heuristic pre/post-processing rules that largelysimplify the prediction tasks and improve generalizability.•

Domain-Adaptive Delexicalization

In order to addressthe problem of massive entity number in the system re-sponse, we follow the delexicalization pipeline suggestedby Wen et al. (2017) to generate delexicalized sentenceswith placeholders during pre-processing, and then applythe post-processing module to the predicted sentence byreplacing the placeholders with the corresponding DBrecord. However, for the MultiWoz dataset, there aremany slots which exist in multiple domains. For example,slots name , type and address exist in both domains restau-rant and hotel . It results in great burden for the system togenerate the placeholder tokens if delexicalizing the sameslot into different placeholders, e.g., [ restaurant name ] and [ hotel name ] . To address this problem, we apply the domain-adaptive delexicalization (Zhang, Ou, and Yu2019) that uses an identical placeholder [ value name ] toencourage knowledge sharing through semantically simi-lar slots in different domains. Since our model could pre-dict domain for each turn, there is no ambiguity in thepost-processing stage for the placeholder replacement.• Turn Domain Computing

Turn domain is referred as thedomain involved in the current dialog turn. It is critical fora model to be able to predict an exact turn domain to facil-itate post-processing, e.g., a domain-adaptive placeholdercould be replaced with the slot in the right domain. To thisend, we need compute the turn domain from the MultiWozdataset to feed to the model during training. There is noturn domain label in the dataset. Although it may involvemultiple domains in one dialog, domains in a dialog areusually not changed back and forth. Based on this feature,we could compute turn domains by tracking the changesof the labeled belief states among the dialog turns. Specif-ically, for each domain with non empty constraint in thebelief state of the current dialog turn, if the domain is anew domain, i.e., is not mentioned in the dialog historyyet, or its corresponding constraint has been updated inthe current turn, then the domain is appended to the turndomain. The computed turn domain is empty, then inheritfrom the turn domain at the previous turn which is initial-ized as the domain general . As at each dialog turn inMultiWoz it usually involves only one domain, the ﬁnalcomputed turn domain includes one domain.•

Data Cleaning and Normalization

We apply the follow-ing steps to clean the dataset during pre-processing. First,all the dialog utterances are set to be lower case for eas-ier manipulations. Additionally, dialogs are processed asa set of dialog turns, each of which is in the form of oneuser utterance and one system response. Consecutive userutterances between system responses are merged into one.Recall that model training samples are composed of dia-log history, belief state, database state and response. Toget more dialog contexts, we maximize the number of di-alog turns in the dialog history to keep each sample lengthwithin 512. Moreover, different slot names/values withthe same semantics are normalized to the same slot/value.For example, departure and pickup location arenormalized to be departure . We apply this data clean-ing and normalization process to MultiWoz and all of theexternal datasets used for pretraining.

Fault Tolerance Mechanism

When the system fails or produces inappropriate responses,we increase the beam size from the default which is 1 of theGPT-2 decoder to produce a different response to correct theerror. We list a few representative errors and the correctedresults after using this mechanism, sampled from our systemoutputs:• Some predicted belief states are malformed. The commain the following predicted belief state breaks the dictio-nary data structure and throws an exception.

Dialog Context:

User: What ’s the entrance fee for abbey pool, andastroturf pitch ? efore:

Predicted belief: { name : abbey pool , and astroturfpitch , area : north } After:

Predicted belief: { name : abbey pool and astroturfpitch , area : north } • The system may predict bad templates where the place-holders cannot be replaced by any of the slots inthe belief states. In the following example, placeholder“ [value day] ” cannot be replaced as slot type “ day ” is notin the belief states. Therefore, the system uses the secondranked candidate of the decoder instead and makes a cor-rect response. Dialog Context:

User: Can you book me a table for 4 people? Can youtry 13:00 ?Predicted belief: restaurant { time = 13:00, people =4 } Before:

Predicted template: i have made your reservation for [value day] at 13:00 . your reference number is :[value ref] .

After:

Predicted template: what type of food do you like ? • Sometimes the system constantly repeat its responses andlead to bad user experience. It may also cause the dialogfall off the track. The fault tolerance mechanism is trig-gered when the system generates responses similar to itslast one (fuzzy match is used to measure the similarity).

Dialog Context:

User: i want a place to stay.Sys: okay , what area of town would you like to stayin ?User: i don’t have preference.

Before:

Sys: okay , i can help you with that . what area oftown would you like to stay in ?

After:

Sys: i have 33 guest houses in the east . do you havea price range ? “User Interface”

In the human evaluation, we notice that some errors are dueto the missing information in responses. “User Interface” is aspecial post-processing module to let us manually revise thepredicted responses to include more information to improveuser experience. Note that to avoid changing the data distri-bution, the original utterances are used as the dialog historyfor the next turn prediction, rather than the “User Interface”revised responses. We show some examples below:

Dialog Context

User: i ’ll be leaving from peterborough and going tocambridge on tuesday , sometime after 15:15 . could youbook that for me for 2 people ?

Before:

Sys: sure thing . i ’ve booked two tickets on tr5933. the total fee is 33 gbp and your reference number istl4r46ys .

After:

Sys: booking was successful . departure: peterborough,destination: cambridge, leave time: 15:19, people: 2,price: 33 pounds, reference number: tl4r46ys . is thereanything else i can help with ?

In this example, the system correctly books two train tick-ets based on the user’s requirement, while it does not providethe full information of the train to the user and cannot con-vince the user that it has booked the right train. In the humanevaluation error analysis, we notice that some users may fur-ther ask questions to conﬁrm the booking which may causethe system to fall off the track. This type of error occurs inbooking in the train, hotel, and restaurant domains. There-fore, we correct the system so that it provides more bookinginformation to the user to solve the problem.

Dialog Context:

User: i would like to find someplace to visit .

Before:

Sys: what area of the town do you prefer ?

After:

Sys: what area of the town do you prefer ? for examplenorth, south, or center ?

In the above example, the user is supposed to provide thesystem the area of the place, which should be one of thepre-deﬁned values: north, south, west, east and center. How-ever, sometimes the user goal does not include the area in-formation and the users do not know the pre-deﬁned val-ues, so they may reply with some requirements that thesystem cannot handle, such as “ is there anything by theriver? ” or “ i prefer anything close to the mountain. ”. Forsuch cases, we let the system provide a few options to let theuser choose, or at least inform the user what types of choicesthey may have. We apply this rule to many scenarios, forexample, attraction type and area, hotel price range, restau-rant food type, etc. This rule is triggered when the systemresponse matches some certain patterns, such as “the utter-ance starts with which/what/where and ends with questionmark”.This user interface is designed to solve problems in thehuman evaluation as generally as possible. It does not affectthe automatic evaluation but is very helpful in the real worldscenario.

Experiments and Results

Datasets

Dataset for Finetuning

The dataset model ﬁnetuning isMultiWoz 2.1 (Eric et al. 2019), which a large-scale human-human multi-domain task-oriented dialog dataset. It con-tains 8421/1000/1000 for training/validation/testing, respec-tively. As this challenge does not utilize the testing data forevaluation, we append it into the training data and use thetwo datasets for the model ﬁnetuning.

Dataset for Pretraining

Unlike the model ﬁnetuning sam-ples that are composed of dialog history, belief state,database state and response, the datasets for pretraining in-clude user or system utterances only, excluding belief state ame

MultiWoz 2.1 10,421 142,840 Restaurant, Police,Hotel, Train, Taxi,Attraction, HospitalSchema 6969 50,192 Restaurants, Hotels,Trains, TravelCamRest 676 5488 RestaurantTaskmaster2020 5873 98,662 Restaurant, Food,Hungry, Dessert,Lunch, Dinner, HotelTaskmaster2019 4349 89,076 Uber, RestaurantMSR-E2E 6969 50,192 Restaurant, Taxi

Table 1: Statistics of dialog corpora used in our system

Hyper-parameter Value

Max sequence length* 512Max response length † ‡ * Sequences longer than 512 are truncated from the head.Based on our calculation, 96% of the dialogs in Multi-Woz can ﬁt in 512 tokens. † User or system utterances longer than 128 tokens aretruncated from the tail. ‡ Training is terminated if the loss on development setdoes not decrease for three evaluations. The model con-verges in about three epochs.

Table 2: Hyper-parametersand database state information. The rationale for this is 1)the optimization objective for pretraining is based on lan-guage model; 2) external dialog datasets have good qual-ity dialog utterances but have poor quality labels for beliefstates or database information. In our experiments, the Task-Adaptive Pretraining (TAPT) dataset is based on MultiWoz.For Domain-Adaptive Pretraining (DAPT), we use the ex-ternal dialog data, i.e., Schema (Rastogi et al. 2019), Cam-Rest (Wen et al. 2016a), Taskmaster and MSR-E2E , asshown in Table 1. As suggested in Gururangan et al. (2020),the DAPT corpus from the similar domain as the target do-main could improve the performance whereas the irrelevantones may even worsen the performance. Therefore, we ex-tract the similar domains from the external dialog data as theones in MultiWoz. For example, for Schema, there are 17 do-mains in total in the raw dataset, but we use the four domainsRestaurants, Hotels, Trains, Travel (similar to the Attrac-tion domain in MultiWoz). Each sample of the DAPT/TAPTdataset is dialog utterance within 512. Training Details

We use Huggingface (Wolf et al. 2019)

GPT-2-small in our system. It has 124M parameters, consists of 12 trans-former decoder blocks, and is pre-trained on a large web- https://github.com/google-research-datasets/Taskmaster https://github.com/xiul-msr/e2e dialog challenge https://github.com/huggingface/transformers crawled dataset of various domains (Radford et al. 2019).The sub-word tokenizer we use is also from Huggingfacetransformers module.We use the same conﬁgurations for the Domain/TaskAdaptive pre-training and ﬁnetuning on the MultiWozdataset. Training details are listed in Table 2. Human Evaluation

Amazon Mechanical Turkers (MTurkers) are recruited tochat with the system to accomplish one or multiple tasksby following a pre-deﬁned user goal. After the conversation,MTurkers annotate whether the tasks are successfully ac-complished (Success or Fail), and rate the

Language Un-derstanding and

Response Appropriateness of the sys-tems with scores from 1 to 5. Also, the number of

Turns of the conversation are measured.As MTurkers have no access to the database, when pro-vided entity related information such as train ticket ID/price,restaurant name, etc, they cannot verify the correctness ofthe information. Thus, MTurkers are asked to note down allentity related information at the end of the conversation andthen the shared task organizers ground this information tothe database to verify their correctness. In the ofﬁcial evalu-ation, three success rate related metrics are reported:•

Success Rate w/o DB Grounding:

Annotation providedby MTurkers (Success or Fail).•

Success Rate w/ DB Grounding:

The dialog is a suc-cess only if 1) MTurks mark it as “Success”, and 2) theprovided entity related information can be found in thedatabase.•

Average Success:

The average of the above two scores.The ranking of the participants is based on Average Suc-cess. Table 3 shows the ofﬁcial ranking. We tie for ﬁrst placewith Team 1.

Automatic Evaluation

To ease system development and setup an objective way toevaluate the system, ConvLab2 provides a user simulatorbased evaluation where the simulator talks to the system andmeasures the performance using several metrics. In the of-ﬁcial evaluation, user simulator based automatic evaluationscores are reported for reference.The user simulator is a pipelined dialog system thatconsists of: 1) BERTNLU that parses agent utterancesinto structured information, such as dialog act and slotname/value, 2) Rule-based Policy that generates user actsusing pre-deﬁned rules, and 3) Template NLG that generatesnatural languages utterances using 900+ templates.Evaluation metrics include:•

Success Rate

A dialog is considered as Success only ifall informable and requestable slots are correctly ﬁlled.•

Book Rate

The average booked entities satisfy the goalconstraints among all domains.•

Inform Rate

Precision, recall and F-1 scores of how theuser requestable slots are ﬁlled.•

Turns

The number of turns of successful dialogs and alldialogs. ank Team Success Rate Language Response Turns

Avg. † w/ DB w/o DB Understanding Appropriateness

10 Team 10 19.5 6.0 33.0 3.23 2.93 18.89 Team 8 35.0 26.0 44.0 3.27 3.15 18.58 Team 9 55.2 43.2 67.2 4.15 3.98 19.27 Team 5 58.4 50.4 66.4 4.15 4.06 19.76 Team 4 60.3 51.4 69.2 4.49 4.22 17.75 Team 3 67.8 60.0 75.6 4.56 4.42 21.04 Team 6 70.6 60.8 80.4 4.41 4.41 20.13 Team 7 72.3 62.0 82.6 4.53 4.41 17.11 Team 1 74.8 70.2 79.4 4.54 4.47 18.5 † Ranking of the teams is based on the average success rate.

Table 3: Ofﬁcial results of the human evaluation. We tie for ﬁrst place with Team 1. The rank is based on the average successrate. Please refer to the Human Evaluation section for details of the evaluation metric.

Rank Team Success Book Inform Rate TurnsRate † Rate

P R F succ. all10 Team 10 21.4 0.0 55.4 60.0 54.1 11.0 25.99 Team 9 44.4 26.5 57.9 64.5 58.9 12.2 14.68 Team 8 52.6 66.7 57.5 80.7 64.8 13.2 22.57 Team 7 57.8 85.0 68.7 81.6 72.6 13.7 16.46 Team 6 67.7 90.8 70.4 85.6 75.2 12.8 14.25 Team 5 83.3 89.1 81.1 90.3 83.5 13.5 13.84 Team 4 89.8 96.3 72.4 96.0 80.1 15.1 15.83 Team 3 90.8 96.7 81.0 95.4 85.9 13.4 13.61 Team 1 93.0 94.6 84.1 96.2 88.1 12.5 12.7 † Ranking of the teams is based on the success rate.

Table 4: Ofﬁcial results of the user simulator based automatic evaluation. The rank is based on the success rate. Please refer tothe Automatic Evaluation section for details of the evaluation metric.

Method Auto. Eval.Success Rate (%)

Vanilla GPT-2 88w/ DAPT 91w/ TAPT 86w/ DAPT + TAPT 91

Table 5: Ablation study of Domain Adaptive Pretraining(

DAPT ) and Task Adaptive Pretraining (

TAPT ). We useDAPT or/and TAPT to tailor GPT-2 to dialog domain andthen ﬁnetune it on MultiWoz. The table shows that DAPTimproves the automatic evaluation success rate by 3%.Table 4 shows the ofﬁcial automatic evaluation results.We rank in second place. Table 5 shows an ablation studyof the impact of Domain Adaptive Pretraining (

DAPT ) andTask Adaptive Pretraining (

TAPT ). We use DAPT or/andTAPT to pretrain GPT-2 on external dialog related datasets,and then ﬁnetune the dialog domain tailored GPT-2 on Mul-tiWoz dataset. DAPT improves the system by 3% accu-racy over the baseline, while TAPT hurts the system per-formance. The combination of DAPT and TAPT achievessimilar performance as DAPT. As we use different randomseed from the ofﬁcial evaluation, the scores in Table 5 arenot ofﬁcial and not comparable with the ofﬁcial evaluationresults.

Conclusions and Future Work

In this paper, we introduce our submission for the End-to-end Multi-domain Task Completion Dialog shared task ofthe DSTC9. Our method proposes to use Domain Adaptiveand Task Adaptive pretraining to tailor the GPT-2 model todialog domain. Then we ﬁnetune it on the MultiWoz datasetusing three different task speciﬁc objectives. At last, we usea fault tolerance mechanism and a special user interfaceto polish system predictions and make them more humanreadable. Our system ties for ﬁrst place in the competition.Future work can focus on optimizing database groundingwhich can make the information in the response more con-sistent to the database.

Acknowledgements

We would like to thank Arkady Arkhangodsky, Han Zhaoand Jianwei Liu for their helpful comments and human eval-uation.

References

Bocklisch, T.; Faulkner, J.; Pawlowski, N.; and Nichol, A.2017. Rasa: Open source language understanding and dia-logue management. arXiv preprint arXiv:1712.05181 .udzianowski, P.; Wen, T.-H.; Tseng, B.-H.; Casanueva, I.;Ultes, S.; Ramadan, O.; and Gaˇsi´c, M. 2018. MultiWOZ- A Large-Scale Multi-Domain Wizard-of-Oz Dataset forTask-Oriented Dialogue Modelling. In

Proceedings of the2018 Conference on Empirical Methods in Natural Lan-guage Processing .Byrne, B.; Krishnamoorthi, K.; Sankar, C.; Neelakantan,A.; Duckworth, D.; Yavuz, S.; Goodrich, B.; Dubey, A.;Cedilnik, A.; and Kim, K.-Y. 2019. Taskmaster-1: To-ward a realistic and diverse dialog dataset. arXiv preprintarXiv:1909.05358 .Chen, W.; Chen, J.; Qin, P.; Yan, X.; and Wang, W. Y. 2019.Semantically Conditioned Dialog Response Generation viaHierarchical Disentangled Self-Attention. In

Proceedings ofthe 57th Annual Meeting of the Association for Computa-tional Linguistics .Chen, Y.-N.; Celikyilmaz, A.; and Hakkani-Tur, D. 2018.Deep learning for dialogue systems. In

Proceedings of the27th International Conference on Computational Linguis-tics: Tutorial Abstracts .Eric, M.; Goel, R.; Paul, S.; Kumar, A.; Sethi, A.; Ku, P.;Goyal, A. K.; Agarwal, S.; Gao, S.; and Hakkani-Tur, D.2019. MultiWOZ 2.1: A Consolidated Multi-Domain Di-alogue Dataset with State Corrections and State TrackingBaselines.Gao, J.; Galley, M.; and Li, L. 2018. Neural approaches toconversational AI. In

The 41st International ACM SIGIRConference on Research & Development in Information Re-trieval .Gao, S.; Sethi, A.; Agarwal, S.; Chung, T.; and Hakkani-Tur,D. 2019. Dialog state tracking: A neural reading comprehen-sion approach. arXiv preprint arXiv:1908.01946 .Gunasekara, C.; Kim, S.; D’Haro, L. F.; Rastogi, A.; Chen,Y.-N.; Eric, M.; Hedayatnia, B.; Gopalakrishnan, K.; Liu,Y.; Huang, C.-W.; Hakkani-T¨ur, D.; Li, J.; Zhu, Q.; Luo,L.; Liden, L.; Huang, K.; Shayandeh, S.; Liang, R.; Peng,B.; Zhang, Z.; Shukla, S.; Huang, M.; Gao, J.; Mehri, S.;Feng, Y.; Gordon, C.; Alavi, S. H.; Traum, D.; Eskenazi, M.;Beirami, A.; Eunjoon; Cho; Crook, P. A.; De, A.; Geram-ifard, A.; Kottur, S.; Moon, S.; Poddar, S.; and Subba, R.2020. Overview of the Ninth Dialog System TechnologyChallenge: DSTC9 .Gururangan, S.; Marasovi´c, A.; Swayamdipta, S.; Lo, K.;Beltagy, I.; Downey, D.; and Smith, N. A. 2020. Don’t StopPretraining: Adapt Language Models to Domains and Tasks. arXiv preprint arXiv:2004.10964 .Ham, D.; Lee, J.-G.; Jang, Y.; and Kim, K.-E. 2020a. End-to-End Neural Pipeline for Goal-Oriented Dialogue Systemsusing GPT-2. In

Proceedings of the 58th Annual Meeting ofthe Association for Computational Linguistics .Ham, D.; Lee, J.-G.; Jang, Y.; and Kim, K.-E. 2020b. End-to-End Neural Pipeline for Goal-Oriented Dialogue Systemsusing GPT-2. In

Proceedings of the 58th Annual Meeting ofthe Association for Computational Linguistics .Lei, W.; Jin, X.; Kan, M.-Y.; Ren, Z.; He, X.; and Yin, D.2018. Sequicity: Simplifying task-oriented dialogue systems with single sequence-to-sequence architectures. In

Proceed-ings of the 56th Annual Meeting of the Association for Com-putational Linguistics .Mehri, S.; Srinivasan, T.; and Eskenazi, M. 2019. Struc-tured fusion networks for dialog. arXiv preprintarXiv:1907.10016 .Peng, B.; Li, C.; Li, J.; Shayandeh, S.; Liden, L.; and Gao,J. 2020. SOLOIST: Few-shot Task-Oriented Dialog with ASingle Pre-trained Auto-regressive Model. arXiv preprintarXiv:2005.05298 .Qin, L.; Xu, X.; Che, W.; Zhang, Y.; and Liu, T. 2020. Dy-namic Fusion Network for Multi-Domain End-to-end Task-Oriented Dialog. In

Proceedings of the 58th Annual Meetingof the Association for Computational Linguistics .Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; andSutskever, I. 2019. Language models are unsupervised mul-titask learners.

OpenAI blog .Rastogi, A.; Zang, X.; Sunkara, S.; Gupta, R.; and Khai-tan, P. 2019. Towards scalable multi-domain conversationalagents: The schema-guided dialogue dataset. arXiv preprintarXiv:1909.05855 .Wen, T.-H.; Gasic, M.; Mrksic, N.; Rojas-Barahona, L. M.;Su, P.-H.; Ultes, S.; Vandyke, D.; and Young, S. 2016a. Con-ditional generation and snapshot learning in neural dialoguesystems. arXiv preprint arXiv:1606.03352 .Wen, T.-H.; Vandyke, D.; Mrkˇsi´c, N.; Gasic, M.; Bara-hona, L. M. R.; Su, P.-H.; Ultes, S.; and Young, S. 2017.A Network-based End-to-End Trainable Task-oriented Di-alogue System. In

Proceedings of the 15th Conference ofthe European Chapter of the Association for ComputationalLinguistics .Wen, T.-H.; Vandyke, D.; Mrksic, N.; Gasic, M.; Rojas-Barahona, L. M.; Su, P.-H.; Ultes, S.; and Young, S. 2016b.A network-based end-to-end trainable task-oriented dia-logue system. arXiv preprint arXiv:1604.04562 .Williams, J. D.; Henderson, M.; Raux, A.; Thomson, B.;Black, A.; and Ramachandran, D. 2014. The dialog statetracking challenge series.

AI Magazine .Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.;Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al.2019. HuggingFace’s Transformers: State-of-the-art NaturalLanguage Processing.

ArXiv .Yang, X.; Chen, Y.-N.; Hakkani-T¨ur, D.; Crook, P.; Li, X.;Gao, J.; and Deng, L. 2017. End-to-end joint learning ofnatural language understanding and dialogue manager. In .Zhang, J.-G.; Hashimoto, K.; Wu, C.-S.; Wan, Y.; Yu, P. S.;Socher, R.; and Xiong, C. 2019. Find or classify? dual strat-egy for slot-value predictions on multi-domain dialog statetracking. arXiv preprint arXiv:1910.03544 .Zhang, Y.; Ou, Z.; and Yu, Z. 2019. Task-Oriented DialogSystems that Consider Multiple Appropriate Responses un-der the Same Context. arXiv preprint arXiv:1911.10484 .hao, T.; Xie, K.; and Eskenazi, M. 2019. Rethinkingaction spaces for reinforcement learning in end-to-end di-alog agents with latent variable models. arXiv preprintarXiv:1902.08858 .Zhu, Q.; Zhang, Z.; Fang, Y.; Li, X.; Takanobu, R.; Li, J.;Peng, B.; Gao, J.; Zhu, X.; and Huang, M. 2020. Convlab-2:An open-source toolkit for building, evaluating, and diag-nosing dialogue systems. arXiv preprint arXiv:2002.04793arXiv preprint arXiv:2002.04793