[PDF] AuGPT: Dialogue with Pre-trained Language Models and Data Augmentation

Abstract

Attention-based pre-trained language models such as GPT-2 brought considerable progress to end-to-end dialogue modelling. However, they also present considerable risks for task-oriented dialogue, such as lack of knowledge grounding or diversity. To address these issues, we introduce modified training objectives for language model finetuning, and we employ massive data augmentation via back-translation to increase the diversity of the training data. We further examine the possibilities of combining data from multiples sources to improve performance on the target dataset. We carefully evaluate our contributions with both human and automatic methods. Our model achieves state-of-the-art performance on the MultiWOZ data and shows competitive performance in human evaluation.

Full PDF

AAuGPT: Dialogue with Pre-trained Language Modelsand Data Augmentation

Jonáš Kulhánek,

Vojtˇech Hudeˇcek, Tomáš Nekvinda and Ondˇrej Dušek Charles University, Faculty of Mathematics and Physics,Institute of Formal and Applied Linguistics Czech Technical University in Prague,Czech Institute of Informatics, Robotics and Cybernetics [email protected], {hudecek,nekvinda,odusek}@ufal.mff.cuni.cz

Abstract

Attention-based pre-trained language modelssuch as GPT-2 brought considerable progressto end-to-end dialogue modelling. However,they also present considerable risks for task-oriented dialogue, such as lack of knowledgegrounding or diversity. To address these is-sues, we introduce modiﬁed training objec-tives for language model ﬁnetuning, and weemploy massive data augmentation via back-translation to increase the diversity of the train-ing data. We further examine the possibilitiesof combining data from multiples sources toimprove performance on the target dataset. Wecarefully evaluate our contributions with bothhuman and automatic methods. Our modelachieves state-of-the-art performance on theMultiWOZ data and shows competitive perfor-mance in human evaluation.

Unlike traditional task-oriented systems based onmodularized pipelines (Young et al., 2013; Gaoet al., 2019), end-to-end dialogue systems inte-grate nearly all functionality required to hold adialogue into a single neural network (Wen et al.,2017; Eric et al., 2017; Lei et al., 2018), reduc-ing error-propagation and data annotation require-ments. While these systems are not yet ready forproduction use, they made considerable progressin recent years, especially with the advent of pre-trained neural language models (LMs) (Devlinet al., 2019; Radford et al., 2019; Zhang et al.,2020c). Systems such as GPT-2 ﬁnetuned byBudzianowski and Vuli´c (2019) show that withan LM pre-trained on a large number of general-domain dialogues without annotation, only smallamounts of data are required to perform in a giventask-oriented domain.The pre-trained LMs still run enormous risks.First, solely training for response generation may result in a lack of grounding for the responses,where the LM hallucinates words without any re-lation to the database. This has been addressed bymulti-task training and auxiliary training objectives(Peng et al., 2020) to an extent. Second, ﬁnetun-ing on small datasets may reduce response diver-sity and ﬂuency due to neural networks’ knownpropensity for catastrophic forgetting (Greco et al.,2019) – the model overﬁts the ﬁnetuning set tootightly, “forgetting” the language modeling capa-bilities learned during pre-training.This paper presents an end-to-end model formulti-domain task-oriented response generation onthe MultiWOZ data (Budzianowski et al., 2018), where we address the above problems with pre-trained LMs. AuGPT is based on the DialoGPTpre-trained LM. To improve response consistency,we build upon Peng et al. (2020)’s auxiliary train-ing objectives and introduce improvements. Forhigher output diversity and ﬂuency, we apply mas-sive training data augmentation through additionaltask-oriented datasets and back-translation (Sen-nrich et al., 2016). In addition to standard au-tomatic evaluation, we evaluate our system in ashared task human evaluation campaign, whereit reaches high performance. We also include adetailed error analysis on a sample of dialoguescollected in-house by experts.Our contributions can be summarized as follows:• We show that augmenting data using back-translation markedly improves performancein task-oriented settings.• We introduce an improved dialogue consis-tency classiﬁcation task used as an auxiliarytraining objective, based on subtle changesto the dialogue state (instead of replacing the https://convlab.github.io/ a r X i v : . [ c s . C L ] F e b tate with a random one). This brings addi-tional performance improvement.• To further improve output diversity, we exper-iment with unlikelihood loss (Welleck et al.,2020; Li et al., 2020).• We provide results for our model and com-parisons to multiple baselines on MultiWOZversions 2.0 and 2.1. To our knowledge, oursystem obtains state-of-the-art results for anend-to-end model on this dataset.To advocate research transparency and repro-ducibility, we publish our augmented training data,source code, and pre-trained models on GitHub. While the ﬁrst attempts to build generative end-to-end task-oriented systems mimicked the tradi-tional dialogue system components (Wen et al.,2017), the task was soon recast as a sequence pre-diction problem in a two-stage setup. A sequence-to-sequence (seq2seq) model ﬁrst generates the be-lief state based on dialogue context, then generatesthe system response based on the context and thebelief state (Sequicity; Lei et al., 2018). The ba-sic recurrent seq2seq architecture (Sutskever et al.,2014) was enhanced with various attention mecha-nisms to support copying tokens from the contextinto the belief state or from the belief state onto theoutput (Wu et al., 2019; Shu et al., 2019). Databaseoperations typically stay outside the model, butsome approaches even include the database in themodel via attention over possible entries (Eric et al.,2017; Wen et al., 2018; Madotto et al., 2018).Recently, large-scale task-oriented datasets wereproposed (Budzianowski et al., 2018; Byrne et al.,2019; Rastogi et al., 2020) to encourage research inthe ﬁeld of multi-domain dialogue modeling. Eventhough the aforementioned approaches achievedpromising results on smaller domains such as Cam-Rest676 (Wen et al., 2016), their ability to general-ize over multiple domains remains limited (Zhanget al., 2020b). To address this, Zhang et al. (2020a)introduce the LABES-S2S model that – in addi-tion to a two-stage seq2seq approach – models be-lief states as discrete latent variables. Zhang et al.(2020b) present DAMD, a three-stage seq2seq ar-chitecture which explicitly decodes the system ac-tion. They optimize for multiple good actions givena single belief state. https://github.com/ufal/augpt/ The line of research that is closest to our workmakes use of large pre-trained LMs based on thetransformer architecture (Vaswani et al., 2017) suchas GPT-2 (Radford et al., 2019) or BERT (Devlinet al., 2019). Finetuned variants of these mod-els achieved state-of-the-art results on many NLPtasks, e.g., question answering or natural languageinference (Liu et al., 2019), and they seem to be use-ful even in the context of dialogue systems. For ex-ample, Wu et al. (2020) propose ﬁnetuning BERT(Devlin et al., 2019) for task-oriented dialogue onmultiple datasets; Zhang et al. (2020c) extendedthe GPT-2 LM to model open-domain chit-chat.We follow research initiated by Budzianowskiand Vuli´c (2019), who use GPT-2 to model multi-domain task-oriented dialogues. Recently, threesimilar modiﬁcations to their model were proposed,namely SOLOIST (Peng et al., 2020), SimpleTOD(Hosseini-Asl et al., 2020), and the approach byHam et al. (2020). Our work extends these modelsand proposes a novel training approach and dataaugmentation strategies based on back-translation(Edunov et al., 2018; Federmann et al., 2019).

The task-oriented setting requires the dialogue sys-tem to respond adequately to the user’s input andfulﬁll its goal. The goal could be, e.g., bookinga train or requesting restaurant details. To achievethat, the system has to process the user’s input, keeptrack of the belief state with respect to user pref-erences regarding individual in-domain attributes(slots) and generate a relevant response in naturallanguage. The system also must be able to interactwith an external database to incorporate the neces-sary information into the generated response (seeFigure 1 for an example).Due to its excellent language modeling and lan-guage generation capabilities, we have chosen thepre-trained GPT-2 LM as our system’s backbonearchitecture. Similarly to Budzianowski and Vuli´c(2019), we use the LM to model both the beliefstate and the response.

The training instances for an LM-based task-oriented dialogue system can be considered as tu-ples ( c, b, d, r ) , where r is the system’s response, c is the context (i.e., a concatenation of all pre-vious utterances in the dialogue – both system’sand user’s), b is the system’s belief state which is PT-2GPT-2

BS: train {destination = Cambridge}DB: train 1113 matches Belief / responseSystem action (BCE)

User intent (BCE)

Belief prediction (CE) Response prediction (CE)13 none ﬁ nd_restaurant ﬁ nd_trainbook_train Consistency (BCE) goodbyereq_morenotify_successnotify_failure +- greedy decoding C a m b r i d g e = d e s t i n a t i o n { } a r r i v e b y t r a i n t h e r e h e ll o There are over 1,000 trains to [destination]. Where will you be departing from? nucleus unlikelihood training positivenegative t r a i n t h e r e h e ll o Figure 1: The architecture of AuGPT. The pipeline runs in two stages. First, a ﬁnetuned GPT-2 LM is used topredict a belief. Then the database results are obtained and everything is passed to the GPT-2 again to predict aﬁnal delexicalized response, along with possible auxiliary tasks (belief consistency, intent classiﬁcation, systemaction classiﬁcation). Unlikelihood loss is used for response prediction training. also used for querying the database, and d are thedatabase results.In our case, the dialogue system handles mul-tiple domains and the belief state is a set of pairs( domain name , domain belief ), where the domainbelief is an assignment of values into slots, i.e.,a set of pairs ( slot name , value ) (see Example 1).Similarly, the database results d are a set of pairs( domain name , domain database results ), wherethe domain database results are an ordered list ofentities returned by the database. We further deﬁnethe database result counts d c denoting the numberof results in d for each domain.Ideally, we would like our system to model theprobability distribution over possible responsesconditioned on the context p ( r | c ) . To simplify com-putation and model the interaction with an externaldatabase, this distribution can be factorized as fol-lows: p ( r | c ) = (cid:88) d p ( r | d, c ) p ( d | c )= (cid:88) d (cid:88) b p ( r | d, b, c ) p ( d | b ) p ( b | c )= (cid:88) b p ( r | Query ( b ) , b, c ) p ( b | c ) , (1)where p ( d | b ) is a deterministic distribution over thedatabase results, and Query is a function returningdatabase results.By using this formulation and by modeling p ( r | d, b, c ) and p ( b | c ) , our model would be able to process the context, query the database, andgenerate the response based on the database re-sults. However, we would face a problem with datasparsity when estimating parameters of p ( r | d, b, c ) .The reason for the data sparsity is the relativelysmall size of datasets for task-oriented dialoguesand the responses containing underrepresented,sometimes unique words, such as reference num-bers, hotel names, etc. To maximally reuse thetraining samples, we choose to train our model on delexicalized responses (Wen et al., 2015) denoted ¯ r , where slot values are replaced with placehold-ers. During inference, the responses are lexicalizedback deterministically using both the belief stateand the database results. We assume perfect lexi-calization, i.e., always being able to lexicalize theresponse ¯ r back based on d and b . Both the database lookup and the lexicalizationare deterministic, and the delexicalized response ¯ r does not depend on the database results d , butonly on their counts d c . Therefore, the distribution p ( r | d, b, c ) is equal to the distribution p (¯ r | d c , b, c ) ,and by maximizing its likelihood we are achievingthe goal of maximizing the likelihood of p ( r | c ) .We use the same language model ˆ p to modelthe belief state and to generate the delexicalizedprediction. That is, p (¯ r | d c , b, c ) ≈ ˆ p (¯ r | d c , b, c, θ ) (2) p ( b | c ) ≈ ˆ p ( b |∅ , ∅ , c, θ ) , (3) We found in our experiments on the MultiWOZ data (seeSection 4) that this assumption was almost always fulﬁlled. here we denote the model’s parameters as θ .In the MultiWOZ dataset (Budzianowski et al.,2018; Eric et al., 2020, see Section 4), responsesare delexicalized by replacing concrete values withplaceholder tokens of the form domain_slot . Forbetter generalization across domains, we chose touse only slot instead. We had noticed it was neverthe case that a response would involve more thanone domain. Therefore, we decided to train ourmodel to detect the active domain and used the pre-dicted active domain during the ﬁnal lexicalization.The model predicts the active domain by outputtingit as the ﬁrst domain in the belief state. The otherdomains then follow in lexicographical order. Thedisadvantage of this approach is that we cannotdetermine the active domain if the belief state isempty. However, in such a case the lexicalizationwould fail anyway, so the system’s performance isnot affected by this decision.Belief state: train{ leave at=15:30,arrive by=17:15 },hotel { price range = cheap }DB: train 23 matches, hotel no match Example 1: String format for AuGPT’s belief state anddatabase result count.

To generate the belief state and to input thedatabase result counts to our model, we need astring representation. To fully exploit pre-trainingon natural language texts, we have chosen a com-pact representation containing as few special tokensas possible (see Example 1).

Although the parameters are shared for the beliefstate predictor and the delexicalized response pre-dictor, the training objectives slightly differ. We usethe cross-entropy loss for both predictions. For theresponse prediction, the unlikelihood loss (Wellecket al., 2020; Li et al., 2020) is used as an additionalobjective. The unlikelihood loss gives a penalty foreach repeated token, which helps the model avoidrepetitions and makes frequent words less likely,increasing the answers’ diversity.To help the model learn a better internal rep-resentation from the data, we employ additionalauxiliary tasks. Similarly to Devlin et al. (2019)and Peng et al. (2020), we train a binary classiﬁerto detect dialogue inconsistencies. In each trainingbatch, we corrupt half of the samples by randomly applying one or more of the following changes withthe same probability:1. We replace the belief state b with another be-lief state, sampled uniformly randomly fromthe training data.2. We replace the delexicalized response ¯ r with adifferent randomly chosen one. If this changeis applied in combination with the ﬁrst one,the delexicalized response and the belief stateare taken from the same random sample.3. A different valid value is uniformly sampledfor each slot in the belief state. In this case,the domain names and domain order are un-changed (i.e., the active domain is the same).The ﬁrst two changes are the same as those appliedby Peng et al. (2020), whereas the third one is anew one which we ﬁnd very useful in the contextof multiple domains, where it is much more chal-lenging to detect if the belief state was changedwhen the domain names are kept the same. Theconsistency detection binary classiﬁer is trained torecognize negative samples from the positive onesbased on logits of the last response token. It is rep-resented by an afﬁne classiﬁer trained using binarycross-entropy (BCE).We also experiment with additional two clas-siﬁers predicting the user intent and the systemaction. These are implemented as two fully-connected layers attached to the feature representa-tions of the last context token and the last databaseresult token, respectively. However, based on ourexperimental results, we decided not to use thesetasks in the ﬁnal model.We train the whole pipeline by optimizing thenon-weighted sum of individual component losses,i.e., cross-entropy for the belief state and the re-sponse prediction, unlikelihood loss for the re-sponse, and BCE for the consistency detection aresummed in our ﬁnal system. For each user input, the system transitions throughseveral stages before the ﬁnal response is gener-ated. First, only the previous dialogue context ispassed to the LM, which greedily generates thestring representation of the belief state. The be-lief state is then parsed and passed to the databasehandler. The database handler then constructs aquery and returns a set of results for each domain.e take the number of results for each domain andgenerate the string representation of database resultcounts (see Example 1). All strings are concate-nated and again passed to the language model. Thistime, we utilize the nucleus sampling (Holtzmanet al., 2020) to generate the delexicalized response.We found nucleus sampling useful for generatingthe response since it increases diversity, but weprefer greedy decoding for the belief state with aﬁxed structure. Finally, the tokens in the delexical-ized response are substituted by values from thedatabase results and the belief state. The process isillustrated in Figure 1.

Following its successful usage in other NLP tasks,(Konstas et al., 2017; Elder et al., 2020), we exper-iment with data augmentation using paraphrases,i.e., variants of training utterances with differentsurface forms. In our setup, we generate multi-ple paraphrases for each training utterance and usethem to augment the training data. This way, weeffectively increase the variability of the data.Generating paraphrases is not a trivial process.Various data-driven approaches were proposed,the majority of them corpora-based (Madnani andDorr, 2010). Recently, machine translation sys-tems proved strong performance in generating para-phrases using the back-translation procedure (Sen-nrich et al., 2016; Edunov et al., 2018; Federmannet al., 2019). We take advantage of these ﬁndingsand use a trained multilingual machine translationmodel (Macháˇcek et al., 2020; Edunov et al., 2018)to paraphrase our data. We employ ten intermediatelanguages and thus obtain a set of different para-phrases for each input utterance. When training,we choose the input user utterance uniformly atrandom from the set of all variants of the utteranceincluding the original one.

We consider a series of experiments to compare ourmodel to current state-of-the-art methods, and in-clude human evaluation and detailed error analysis.We also carefully evaluate all proposed contribu-tions through a series of ablation experiments.

We have used several datasets for training our sys-tem and for the ﬁnal evaluation and comparison.We use MultiWOZ 2.1, an enhanced version of MultiWOZ 2.0 (Budzianowski et al., 2018) thatreduces the amount of noise in the data; we also usethe 2.0 version in additional experiments so thatwe can compare to previous works. The datasetcontains 7 distinct domains (all related to touristinformation, e.g., hotels, restaurants) and 10,438dialogues, 7,032 of which are multi-domain.We experiment with pre-training our model onadditional datasets. For the pre-training phase, weuse Taskmaster-1 (Byrne et al., 2019) and Schema-Guided Dialogue (Rastogi et al., 2020). BothTaskmaster-1 and Schema-Guided Dialogue aremulti-domain, task-oriented, large dialogue cor-pora consisting of 12,215 and 22,825 dialogues,respectively. Taskmaster-1 was obtained using theWizard-of-Oz and self-dialogue methods, while thecollection of Schema-Guided Dialogue is some-what artiﬁcial – humans are only employed to para-phrase machine-generated utterances.

Although the MultiWOZ 2.1 dataset was collectedby humans, it contains a lot of inconsistencies. Wehypothesize that when using only clean sampleswhich are consistent with the database, the beneﬁtof using higher quality training data outweighs thedecrease in the number of training samples. Thisclaim is further supported by experiments (see theAblation section). To ﬁlter the training data, wechoose only those dialogues where the annotateddialogue goal corresponds with the turn-level anno-tated data. When using the clean samples, we omitabout 30% of the training data.To effectively combine all our datasets, weuniﬁed the domain-slot pairs in the beliefstates and the delexicalization. However, thedatasets use different naming conventions (e.g., leaveAt vs. leave_at ) and different domainand slot names even though the correspondingdomain-slot pairs describe the same concepts(e.g., restaurant-food vs. restaurant-type ).Therefore, we created a new uniﬁed ontology andmanually designed a mapping between slot names.Notably, we decided to rename some slots so theyuse natural language tokens, as we base our modelon the GPT-2 LM which is pre-trained on naturallanguage texts (e.g. “ leaveAt ” → “ leave at ”).Our ﬁnal ontology that uniﬁes all three datasetscontains 22 domains and 135 slots.We use our own implementation of delexicaliza-tion, which directly produces our belief state stringultiWOZ 2.0 MultiWOZ 2.1method inform success BLEU inform success BLEUHuman 91.0 82.7 – 86.3 79.1 – AuGPT

Table 1: Comparison with previous works on the MultiWOZ dataset (see Section 4.4 for a description of themetrics).

MD-Sequicity is a variant of Lei et al. (2018)’s model, extended for a multi-domain setting. inform turnmethod complete success book P R F1 succ all

AuGPT

Table 2: ConvLab evaluation comparison with other works (see Section 4.5 for a description of the metrics). representation (see Section 3.1 and Example 1).

We implement our model in the PyTorch frame-work (Paszke et al., 2019). The model extends the small variant of the GPT-2 model. It consists of 12transformer blocks with a model layer size equalto 768, having 124 million parameters in total. Forall auxiliary tasks, we use a dropout of 0.1 withlabel smoothing 0.1. We use the AdamW optimizer(Loshchilov and Hutter, 2019). For greater trainingeffectiveness, we employ mixed-precision training(Micikevicius et al., 2018) through PyTorch AMP.The ﬁnetuning runs for 8 epochs on the MultiWOZ2.1 data when all the training examples are used,and for the corresponding number of minibatchesif a lower number of samples is used when usingonly clean samples. The training takes less thanone day when using 4 GPUs.

To compare with previous results on MultiWOZ,we evaluate the model performance with a set ofcorpus-based intrinsic metrics on both versions ofthe data. In the case of MultiWOZ 2.0, we use theoriginal delexicalization used also by other com-pared methods (Peng et al., 2020; Hosseini-Aslet al., 2020; Zhang et al., 2020b). For MultiWOZ2.1, we use our own delexicalization. We employ the original evaluation scheme by Budzianowskiet al. (2018), which provides two metrics – the in-form rate and the success rate . The inform rate isthe percentage of dialogues in which the systemprovided an appropriate entity, whereas the successrate is the percentage of dialogues in which thesystem outputted all the requested information. Ad-ditionally, we compute the BLEU score (Papineniet al., 2002) between the generated system utter-ances and the ground truth to get an approximationof the output ﬂuency. Note that both the informrate and the success rate are unaffected by usinga different delexicalization and these metrics canbe directly compared to other methods. A differentdelexicalization could, however, render a slightlydifferent BLEU, but based on preliminary results,we believe this change has almost no effect.

We use the ConvLab 2 platform (Zhu et al., 2020)for automatic evaluation. The platform includes anagent-based evaluation component, therefore weobtain results gathered from interaction betweenour system and the simulated user agent. We runthe evaluation component 1,000 times, i.e. on 1,000simulated conversations. The agent mimics user be-havior, interacts with the system under evaluation,and computes multiple metrics, among which themost relevant are complete , success and book rates. verage Success Success NLU ResponseMethod Success w/ DB w/o DB score appropriateness TurnsBaseline 69.6 56.8 82.4 4.34 4.18 18.5Winner Table 3: Human evaluation results obtained during the shared task using Amazon Mechanical Turk. Note that only4 submissions outperformed the Baseline according to the average success metric.

The complete rate reﬂects the ratio of dialogues thatare completed, i.e. all the user requests have beenmet. The success rate computes the percentage ofdialogues which are successful, meaning the sys-tem captures correct informed entities and providesa valid booking if requested. Finally, the book rate is the proportion of dialogues where the system wasable to book the correct entity (hotel, restaurant,train) if it was asked to. We also compute precision,recall and

F1 score for the informed entities andthe average number of turns in the dialogue.

Thanks to our participation in a task-oriented dia-logue shared task, the best one of our submissionswas evaluated by human judges on the Amazon Me-chanical Turk platform. The judges communicatedwith the agent in natural language and rated thesystem afterward with respect to the success/fail-ure of the dialogue, language understanding score,and response appropriateness. Information pro-vided by the system was additionally checked forconsistency with the database, and the average ofsuccess rates given by the judges and by databasegrounding is used as the main metric.In addition to the crowdsourced evaluation, weperform a detailed in-house error analysis of themodel behavior based on human interactions withour ﬁnal system. Expert annotators followed ran-domly chosen dialogue goals accompanying theMultiWOZ test set and recorded any incorrect sys-tem behavior.

In this section, we ﬁrst describe and discuss thequantitative results for both corpus-based andcrowdsourced human evaluation. In the secondpart, we also perform a qualitative analysis of themodel behavior.

Table 1 shows a comparison between our meth-ods and current state-of-the-art systems, which aredescribed in the Related Work section. Since Mul-tiWOZ 2.1 has been released quite recently andsome of the compared methods do not provide re-sults with this version, we report results on bothMultiWOZ 2.0 and MultiWOZ 2.1. As we cansee, AuGPT outperforms all other approaches interms of the inform and success metrics. However,DAMD and LABES-S2S produce higher BLEUscores. This would indicate better ﬂuency of thesemodels, however, one would need human evalua-tion to conﬁdently claim that. One possible reasonfor this behavior would be our removal of sometraining samples (see Data Preprocessing), whichmay have decreased the BLEU score. Importantly,thanks to the higher success metric, we can say thatour model is better at providing all the necessaryinformation in the responses.Table 2 shows a comparison with two other mod-els in the ConvLab evaluation scheme with a sim-ulated user. The compared systems were chosenbecause they both implement fully trainable end-to-end methods. Our system outperforms both com-pared systems by a wide margin. Our model isable to perform well not just in a single-turn re-sponse generation scenario, but over the course ofthe whole dialogue. As the example of DAMDshows, this is not always guaranteed.

We show the human evaluation results in Table 3.Our model outperformed the baseline on the mainaverage success metric (along with only half ofother submissions). We did not match the winningsubmission in terms of average success, but weachieved the best overall success rate based onhuman judgments (without checking consistencywith the database). This suggests that the biggestdrawback of our model was the integration withdatabase grounding. This can be caused by our rroneous BS DB Policy OtherDialogues Error Error Error ErrorAll dialogues 50 30 4 21 6Unsuccessful dialogues 17 10 3 2 2

Table 4: Interactive analysis performed by human evaluators using 130 prepared dialogue goals. 17 of thesedialogues contained an error that caused the dialogue to fail. We show summary statistics regarding the number ofrespective error sources (BS = belief state, DB = database). Note that some of the dialogues contain more than oneerror.

Type Count Source DescriptionHallucinated values 21 BS/Policy Used a slot value in the reply that is not grounded in the DB nor in thecontextWrong lexicalization 6 Policy Repeats the same value in a list of choices during lexicalizationMissing information 5 Policy Makes booking while not all information is speciﬁedIgnored input 5 BS Keeps asking for information that was providedBad domain 4 BS Fails to focus on the correct domainFalse response 4 Policy States a different value of a slot than the value stored in DBRepeated output 3 Policy Repeats the same slot twice on the outputFailed booking 3 DB/Policy Booking was unsuccessful due to DB mismatchOther 10 BS/DB/P/Oth (Various rare errors that could not be categorized)

Table 5: Distribution of the most common error types encountered during the human evaluation of 130 dialogues.Absolute counts of errors in the 50 erroneous dialogues are shown. The total error count is 61 as some dialoguescontained multiple errors. The most likely source of the error (cf. Table 4) and a short description are given foreach type. modiﬁed delexicalization, which makes it harderin some cases to match the database properly. Inaddition, our model needed the smallest number ofturns overall to complete the evaluation dialogues.We note that out of 5 submitted model variantsin the competition (cf. Table 6), only the modelvariant scoring best in the automatic evaluationwas considered for human evaluation. In our case,the chosen model variant did not include additionaldata pre-training, which could possibly improveperformance in the human evaluation scenario.

Our expert annotators evaluated 130 dialogues intotal, 50 of which contained at least one error. How-ever, in most cases, the system was able to recoverfrom the errors, resulting in an overall success rateof 86.9% (i.e., 17 unsuccessful dialogues).The purpose of this analysis was to ﬁnd out whatkind of errors occur during real interaction. Theannotators were familiar with the model architec-ture and were instructed to categorize the errorsaccording to the cause of the problem. Speciﬁcally,they identiﬁed which component caused the respec-tive error based on the internal system state. Inaddition, the annotators categorized the errors intomore speciﬁc types.The overall results are given in Table 4. We [...] user:

I want to make a booking at sitar tandoori belief: restaurant {name = sitar tandoori}train {arrive by = 18:00, day = sunday, depar-ture = cambridge, destination = london liverpoolstreet, people = 1} system:

What day would you like the booking for? user: the same day, 13:30 belief: restaurant {day = sunday, name = sitar tandoori,people = 1, time = 13:30}train {arrive by = 18:00, day = sunday, depar-ture = cambridge, destination = london liverpoolstreet, people = 1} system:

Booking was successful. The table will be re-served for 15 minutes. Reference number is:00000200.

Example 2: A dialogue where the system did not askfor necessary information. The user previously bookeda train trip for one person (cf. the belief state in the ﬁrstturn shown). When they went on to book a restaurant,the system took over the number of persons from thetrain ride and did not ask again. observe that the most common reason for a faileddialogue is an error related to the belief state. Also,although policy errors happen relatively often, theyrarely cause the whole dialogue to fail. We observethat we have a slightly higher number of success-ful dialogues compared to the 82.6% success rate(without checking database consistency) found inhuman evaluation (cf. Table 3). The most likely ser:

I’m looking for a theatre in Cambridge belief: attraction {type = theatre}system: There are 5 theatres in Cambridge. Is there anarea of town you prefer? user: no I don’t care belief: attraction {type = theatre}system: I recommend Adc Theatre in the centre. Wouldyou like more information? user: no, I’m also looking for some place to stay belief: attraction {type = theatre} system:

I have 5 theatres in the centre. Do you have aprice range? user:

I’m looking for a place to stay belief: attraction {type = theatre} system:

There are 5 theatres in the centre. Do you have apreference for area? user:

I’m looking for a hotel belief: hotel {type = hotel}attraction {type = theatre} system:

There are 9 hotels in the north. Do you have aprice range?

Example 3: Dialogue sample with a bad domain focusand a hallucination. First, the system ignores that theuser switched from searching for a theater to searchingfor a hotel. After accepting the new domain, the systemreplies with hotels “in the north” even though the userdid not specify. cause of this behavior is that our expert annotatorswere more motivated to recover from erroneoussystem behavior and ﬁnish the dialogue.Fine-grained error types identiﬁed by annotatorsare given in Table 5. We now provide examplesfor some of the frequent types which deserve moredetail. By far the most common error was halluci-nated values – lack of grounding for some of theinformation provided (see the end of Example 2).One of the causes of this error, as in the case ofthis example, were hallucinations coming from thedecoder. This probably stems from an imperfectdelexicalization of some of the training samples.Another error type is missing information , i.e., notasking for information that is required (Example 3).In this case, the system uses information from adifferent domain without the user explicitly con-ﬁrming this. A most probable cause of this is thatmost instances of the training data carry over theinformation. A third common error type is baddomain (see the beginning of Example 2). Here,the system does not react to the user’s request for adifferent search (hotels instead of attractions in theexample). This might be caused by a less frequentwording of the request in the given context, andusually gets resolved by rephrasing the request.Apart from the errors, many examples of non-trivial positive behavior were found during the anal- user :

I am looking for a cheap chinse restaurant. belief: restaurant {food = chinese, price range = cheap} system:

I have 4 cheap chinese restaurants. What area oftown would you like to dine in? user: downtown belief: restaurant {area = centre, food = chinese, pricerange = cheap} system:

I have 3 cheap chinese restaurants in the centre.Would you like to book at Charlie Chan?...

Example 4: A snippet from challenging yet successfuldialogue. The system deals with input noise (chinse) aswell as with paraphrases (downtown). ysis. As illustrated in Example 4, the model is ableto deal with paraphrases and is robust to a certainamount of noise in the data. Speciﬁcally, it han-dles typos, various time formats, etc. Interactionbetween domains is also successful in most cases –the system is able to resolve references to anotherdomain’s belief state (e.g., make a booking for thesame group of people as done before in a differentvenue).

We tested many variants of our method with differ-ent combinations of our proposed system’s compo-nents to evaluate their contributions. The resultsare presented in Table 6. Namely, we are interestedin the following components: (1) the unlikelihoodloss, (2) the auxiliary tasks, (3) the data augmen-tation, (4) the modiﬁed consistency task and (5) unclean data ﬁltering.We can see that all proposed contributions whichare a part of our ﬁnal system have a positive effecton the system performance with respect to the pri-mary metrics. We can see that removing either thepre-training or the back-translations decreases theBLEU score and, more importantly, the successrates. Furthermore, we notice the positive effectof using our improved consistency detection taskover the one used in SOLOIST (Peng et al., 2020),which in turn scores better than no consistency de-tection.Removing either the unlikelihood loss or train-ing on all data as opposed to only “clean” samplesclearly reduces performance. However, we did notnotice any increase in performance when the userintent prediction and system action prediction aux-iliary tasks were used (cf. Section 3.2). The reasonfor this behavior could be that the model learns torepresent the actions well enough implicitly, with-out the need for these additional objectives. How- ultiWOZ 2.1 ConvLab 2inform turnmethod inf suc BLEU comp suc book P R F1 suc all

AuGPT w. all auxiliary 91.1 71.4 16.8 88.7 59.2 86.0 64.6 81.1 69.9

Table 6: Ablation study (inf = inform, suc = success, book = book rate; see Section 4.4 for a description of metrics).The model version with the best ConvLab 2 success rate is chosen as our best model. Variants are denoted withtheir respective modiﬁcations compared to the default: “w/o. unlikelihood” = unlikelihood loss was not used fortraining; “w/o. clean” uses all training samples as opposed to using only the ones consistent with the database;“w/o. pre-training” = the additional Taskmaster-1 and Schema-Guided datasets were not used for training; “allauxiliary” = using two additional auxiliary tasks (see the Method section for details); “w/o. consistency” = dialogueconsistency task is not used; “old consistency” refers to the consistency task as deﬁned by Peng et al. (2020) (seethe Section 3.2 for details). ever, these tasks are not a part of our ﬁnal model.

We present a dialogue modeling pipeline based onthe pre-trained GPT-2 language model. AuGPTuses modiﬁed training objectives and employs dataaugmentation to increase the diversity of gener-ated utterances. Our experiments show that theproposed approach performs better than state-of-the-art baselines in a multi-domain scenario on theMultiWOZ dataset. We also run a series of ablationexperiments to assess the individual contributionsof the modiﬁcations. According to our detailedablation study, training data augmentation usingback-translation via multiple languages and a mod-iﬁed auxiliary training objective for dialogue con-sistency detection are the features that contributemost to our system’s performance. Additionally,we perform a qualitative analysis of the outputs togive a better insight into our model behavior.In the future, we plan to construct a latent repre-sentation of the belief state and optimize it jointlywith the language model. We will replace the deter-ministic lexicalization with a trainable alternative,and possibly even integrate the database moduleinto the model. To improve the transfer to new do-mains, we will learn a domain embedding and opti-mize it jointly with the model, unifying all datasets.

Acknowledgments

This work was supported by the Charles Univer-sity GAUK grant No. 302120, the SVV projectNo. 260575, and the Charles University projectPRIMUS/19/SCI/10. Jonáš Kulhánek was sup-ported by the European Regional DevelopmentFund under the project Robotics for Industry 4.0(reg. no. CZ.02.1.01/0.0/0.0/15_003/0000470).Additional computational resources were suppliedby the project “e-Infrastruktura CZ” (e-INFRALM2018140) provided within the program Projectsof Large Research, Development and InnovationsInfrastructures.

References

Paweł Budzianowski and Ivan Vuli´c. 2019. Hello, it’sGPT-2 – how can I help you? towards the use of pre-trained language models for task-oriented dialoguesystems. In

Proceedings of the 3rd Workshop onNeural Generation and Translation (WNGT) , pages15–22, Hong Kong.Paweł Budzianowski, Tsung-Hsien Wen, Bo-HsiangTseng, Inigo Casanueva, Stefan Ultes, Osman Ra-madan, and Milica Gaši´c. 2018. MultiWOZ –a large-scale multi-domain Wizard-of-Oz datasetfor task-oriented dialogue modelling. In

Proceed-ings of the 2018 Conference on Empirical Methodsin Natural Language Processing (EMNLP) , page5016–5026, Brussels, Belgium.Bill Byrne, Karthik Krishnamoorthi, ChinnadhuraiSankar, Arvind Neelakantan, Daniel Duckworth,Semih Yavuz, Ben Goodrich, Amit Dubey, Kyu-oung Kim, and Andy Cedilnik. 2019. Taskmaster-1: Toward a realistic and diverse dialog dataset.In

Proceedings of the 2019 Conference on Empiri-cal Methods in Natural Language Processing andthe 9th International Joint Conference on Natu-ral Language Processing (EMNLP-IJCNLP) , page4516–4525, Hong Kong.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In

Proceedings of the 2019 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies (NAACL-HLT) , pages 4171–4186, Min-neapolis, MN, USA.Sergey Edunov, Myle Ott, Michael Auli, and DavidGrangier. 2018. Understanding back-translation atscale. In

Proceedings of the 2018 EMNLP , pages489–500, Brussels, Belgium.Henry Elder, Robert Burke, Alexander O’Connor, andJennifer Foster. 2020. Shape of Synth to Come:Why We Should Use Synthetic Data for English Sur-face Realization. In

Proceedings of the 58th ACL ,pages 7465–7471, Online.Mihail Eric, Rahul Goel, Shachi Paul, Adarsh Kumar,Abhishek Sethi, Peter Ku, Anuj Kumar Goyal, San-chit Agarwal, Shuyang Gao, and Dilek Hakkani-Tur. 2020. MultiWOZ 2.1: A consolidated multi-domain dialogue dataset with state corrections andstate tracking baselines. In

Proceedings of the12th Language Resources and Evaluation Confer-ence (LREC) , pages 422–428, Marseille, France.Mihail Eric, Lakshmi Krishnan, Francois Charette, andChristopher D. Manning. 2017. Key-value retrievalnetworks for task-oriented dialogue. In

Proceedingsof the 18th Annual SIGdial Meeting on Discourseand Dialogue , page 37–49, Saarbrücken, Germany.Christian Federmann, Oussama Elachqar, and ChrisQuirk. 2019. Multilingual whispers: Generatingparaphrases with translation. In

Proceedings ofthe 5th Workshop on Noisy User-generated Text (W-NUT 2019) , pages 17–26, Hong Kong.Jianfeng Gao, Michel Galley, and Lihong Li. 2019.Neural Approaches to Conversational AI: Ques-tion Answering, Task-oriented Dialogues and SocialChatbots. volume 13 of

Foundations and Trends inInformation Retrieval . now publishers.Claudio Greco, Barbara Plank, Raquel Fernández, andRaffaella Bernardi. 2019. Psycholinguistics MeetsContinual Learning: Measuring Catastrophic Forget-ting in Visual Question Answering. In

Proceedingsof the 57th Annual Meeting of the Association forComputational Linguistics , pages 3601–3605, Flo-rence, Italy. Association for Computational Linguis-tics. Donghoon Ham, Jeong-Gwan Lee, Youngsoo Jang,and Kee-Eung Kim. 2020. End-to-end neuralpipeline for goal-oriented dialogue systems usingGPT-2. In

Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguistics(ACL) , pages 583–592, Online.Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, andYejin Choi. 2020. The curious case of neural text de-generation. In

Proceedings of the International Con-ference on Learning Representations (ICLR) , On-line.Ehsan Hosseini-Asl, Bryan McCann, Chien-Sheng Wu,Semih Yavuz, and Richard Socher. 2020. A simplelanguage model for task-oriented dialogue. arXivpreprint arXiv:2005.00796 .Ioannis Konstas, Srinivasan Iyer, Mark Yatskar, YejinChoi, and Luke Zettlemoyer. 2017. Neural AMR:Sequence-to-Sequence Models for Parsing and Gen-eration. In

Proceedings of the 55th ACL , pages 146–157, Vancouver, Canada.Wenqiang Lei, Xisen Jin, Min-Yen Kan, ZhaochunRen, Xiangnan He, and Dawei Yin. 2018. Sequicity:Simplifying task-oriented dialogue systems with sin-gle sequence-to-sequence architectures. In

Proceed-ings of the 56th Annual Meeting of the Associationfor Computational Linguistics (Volume 1: Long Pa-pers) , pages 1437–1447, Melbourne, Australia.Margaret Li, Stephen Roller, Ilia Kulikov, SeanWelleck, Y.-Lan Boureau, Kyunghyun Cho, and Ja-son Weston. 2020. Don’t Say That! Making Incon-sistent Dialogue Unlikely with Unlikelihood Train-ing. In

Proceedings of the 58th Annual Meetingof the Association for Computational Linguistics(ACL) , page 4715–4728, Online.Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,Luke Zettlemoyer, and Veselin Stoyanov. 2019.RoBERTa: A robustly optimized BERT pretrainingapproach. arXiv preprint arXiv:1907.11692 .Ilya Loshchilov and Frank Hutter. 2019. Decoupledweight decay regularization. In

Proceedings of theInternational Conference on Learning Representa-tions (ICLR) , New Orleans, LA, USA.Dominik Macháˇcek, Jonáš Kratochvíl, Sangeet Sagar,Matúš Žilinec, Ondˇrej Bojar, Thai-Son Nguyen, Fe-lix Schneider, Philip Williams, and Yuekun Yao.2020. Elitr non-native speech translation at iwslt2020. In

Proceedings of the 17th International Con-ference on Spoken Language Translation (IWSLT) ,page 200–208, Online.Nitin Madnani and Bonnie J Dorr. 2010. Generat-ing phrasal and sentential paraphrases: A surveyof data-driven methods.

Computational Linguistics ,36(3):341–387.ndrea Madotto, Chien-Sheng Wu, and Pascale Fung.2018. Mem2Seq: Effectively incorporating knowl-edge bases into end-to-end task-oriented dialog sys-tems. In

Proceedings of the 56th Annual Meet-ing of the Association for Computational Linguistics(ACL) , pages 1468–1478, Melbourne, Australia.Paulius Micikevicius, Sharan Narang, Jonah Alben,Gregory Diamos, Erich Elsen, David Garcia, BorisGinsburg, Michael Houston, Oleksii Kuchaiev,Ganesh Venkatesh, et al. 2018. Mixed precisiontraining. In

Proceedings of the ICLR , Vancouver,Canada.Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic eval-uation of machine translation. In

Proceedings of the40th Annual Meeting of the Association for Compu-tational Linguistics (ACL) , pages 311–318, Philadel-phia, PA, USA.Adam Paszke, Sam Gross, Francisco Massa, AdamLerer, James Bradbury, Gregory Chanan, TrevorKilleen, Zeming Lin, Natalia Gimelshein, LucaAntiga, Alban Desmaison, Andreas Kopf, EdwardYang, Zachary DeVito, Martin Raison, Alykhan Te-jani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang,Junjie Bai, and Soumith Chintala. 2019. Pytorch:An imperative style, high-performance deep learn-ing library. In

Advances in Neural InformationProcessing Systems 32 (NeurIPS) , pages 8024–8035.Vancouver, Canada.Baolin Peng, Chunyuan Li, Jinchao Li, Shahin Shayan-deh, Lars Liden, and Jianfeng Gao. 2020. Soloist:Few-shot task-oriented dialog with a single pre-trained auto-regressive model. arXiv preprintarXiv:2005.05298 .Alec Radford, Jeffrey Wu, Rewon Child, David Luan,Dario Amodei, and Ilya Sutskever. 2019. LanguageModels are Unsupervised Multitask Learners. Tech-nical report, OpenAI.Abhinav Rastogi, Xiaoxue Zang, Srinivas Sunkara,Raghav Gupta, and Pranav Khaitan. 2020. Towardsscalable multi-domain conversational agents: Theschema-guided dialogue dataset. In

Proceedingsof the AAAI Conference on Artiﬁcial Intelligence ,pages 8689–8696, New York, NY, USA.Rico Sennrich, Barry Haddow, and Alexandra Birch.2016. Improving Neural Machine Translation Mod-els with Monolingual Data. In

Proceedings of the54th Annual Meeting of the Association for Compu-tational Linguistics (ACL) , pages 86–96, Berlin, Ger-many.Lei Shu, Piero Molino, Mahdi Namazifar, Hu Xu,Bing Liu, Huaixiu Zheng, and Gokhan Tur. 2019.Flexibly-structured model for task-oriented dia-logues. In

Proceedings of the 20th Annual SIGdialMeeting on Discourse and Dialogue , pages 178–187,Stockholm, Sweden. Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.Sequence to sequence learning with neural net-works. In Z. Ghahramani, M. Welling, C. Cortes,N. D. Lawrence, and K. Q. Weinberger, editors,

Ad-vances in Neural Information Processing Systems 27(NeurIPS) , pages 3104–3112. Montréal, Canada.Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In

Advances in Neural Information Pro-cessing Systems (NeurIPS) , pages 5998–6008, LongBeach, CA, USA.Sean Welleck, Ilia Kulikov, Stephen Roller, EmilyDinan, Kyunghyun Cho, and Jason Weston. 2020.Neural text generation with unlikelihood training.In

Proceedings of the International Conference onLearning Representations (ICLR) , Online.Haoyang Wen, Yijia Liu, Wanxiang Che, Libo Qin,and Ting Liu. 2018. Sequence-to-sequence learn-ing for task-oriented dialogue with dialogue staterepresentation. In

Proceedings of the 27th Inter-national Conference on Computational Linguistics(COLING) , pages 3781–3792, Santa Fe, NM, USA.Tsung-Hsien Wen, Milica Gasic, Dongho Kim, NikolaMrkši´c, Pei-Hao Su, David Vandyke, and SteveYoung. 2015. Stochastic language generation in di-alogue using recurrent neural networks with convo-lutional sentence reranking. In

Proceedings of the16th Annual Meeting of the Special Interest Groupon Discourse and Dialogue (SIGdial) , pages 275–284, Prague, Czechia.Tsung-Hsien Wen, Milica Gaši´c, Nikola Mrkši´c,Lina M. Rojas-Barahona, Pei-Hao Su, Stefan Ultes,David Vandyke, and Steve Young. 2016. Condi-tional generation and snapshot learning in neural di-alogue systems. In

Proceedings of the 2016 Con-ference on Empirical Methods in Natural LanguageProcessing (EMNLP) , pages 2153–2162, Austin,TX, USA.Tsung-Hsien Wen, David Vandyke, Nikola Mrkši´c,Milica Gaši´c, Lina M. Rojas-Barahona, Pei-Hao u,Stefan Ultes, and Steve Young. 2017. A network-based end-to-end trainable task-oriented dialoguesystem. In

Proceedings of the 15th Conference ofthe European Chapter of the Association for Com-putational Linguistics (EACL) , pages 438–449, Va-lencia, Spain.Chien-Sheng Wu, Steven Hoi, Richard Socher, andCaiming Xiong. 2020. Tod-bert: Pre-trained naturallanguage understanding for task-oriented dialogues.In

Proceedings of the 2020 Conference on EmpiricalMethods in Natural Language Processing (EMNLP) ,page 917–929, Online.Chien-Sheng Wu, Richard Socher, and Caiming Xiong.2019. Global-to-local memory pointer networks fortask-oriented dialogue. In

Proceedings of the Inter-national Conference on Learning Representations(ICLR) , New Orleans, LA, USA.. Young, M. Gaši´c, B. Thomson, and J. D. Williams.2013. Pomdp-based statistical spoken dialog sys-tems: A review.

Proceedings of the IEEE ,101(5):1160–1179.Yichi Zhang, Zhijian Ou, Huixin Wang, and Jun-lan Feng. 2020a. A probabilistic end-to-end task-oriented dialog model with latent belief states to-wards semi-supervised learning. In

Proceedings ofthe 2020 EMNLP , page 9207–9219, Online.Yichi Zhang, Zhijian Ou, and Zhou Yu. 2020b. Task-oriented dialog systems that consider multiple ap-propriate responses under the same context. In

Pro-ceedings of the AAAI Conference on Artiﬁcial Intel-ligence , pages 9604–9611, New York, NY, USA.Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen,Chris Brockett, Xiang Gao, Jianfeng Gao, JingjingLiu, and Bill Dolan. 2020c. DIALOGPT : Large-scale generative pre-training for conversational re-sponse generation. In

Proceedings of the 58th An-nual Meeting of the Association for ComputationalLinguistics (ACL): System Demonstrations , pages270–278, Online.Qi Zhu, Zheng Zhang, Yan Fang, Xiang Li, RyuichiTakanobu, Jinchao Li, Baolin Peng, Jianfeng Gao,Xiaoyan Zhu, and Minlie Huang. 2020. Convlab-2:An open-source toolkit for building, evaluating, anddiagnosing dialogue systems. In