[PDF] How to Build User Simulators to Train RL-based Dialog Systems

Abstract

User simulators are essential for training reinforcement learning (RL) based dialog models. The performance of the simulator directly impacts the RL policy. However, building a good user simulator that models real user behaviors is challenging. We propose a method of standardizing user simulator building that can be used by the community to compare dialog system quality using the same set of user simulators fairly. We present implementations of six user simulators trained with different dialog planning and generation methods. We then calculate a set of automatic metrics to evaluate the quality of these simulators both directly and indirectly. We also ask human users to assess the simulators directly and indirectly by rating the simulated dialogs and interacting with the trained systems. This paper presents a comprehensive evaluation framework for user simulator study and provides a better understanding of the pros and cons of different user simulators, as well as their impacts on the trained systems.

Full PDF

HHow to Build User Simulators to Train RL-based Dialog Systems

Weiyan Shi ∗ , Kun Qian ∗ , Xuewei Wang and Zhou Yu University of California, Davis, Carnegie Mellon University { wyshi, kunqian, joyu } @ucdavis.edu, [email protected] Abstract

User simulators are essential for training rein-forcement learning (RL) based dialog models.The performance of the simulator directly im-pacts the RL policy. However, building a gooduser simulator that models real user behaviorsis challenging. We propose a method of stan-dardizing user simulator building that can beused by the community to compare dialog sys-tem quality using the same set of user simu-lators fairly. We present implementations ofsix user simulators trained with different dia-log planning and generation methods. We thencalculate a set of automatic metrics to eval-uate the quality of these simulators both di-rectly and indirectly. We also ask human usersto assess the simulators directly and indirectlyby rating the simulated dialogs and interactingwith the trained systems. This paper presents acomprehensive evaluation framework for usersimulator study and provides a better under-standing of the pros and cons of different usersimulators, as well as their impacts on thetrained systems. Reinforcement Learning has gained more andmore attention in dialog system training becauseit treats the dialog planning as a sequential deci-sion problem and focuses on long-term rewards(Su et al., 2017). However, RL requires inter-action with the environment, and obtaining realhuman users to interact with the system is bothtime-consuming and labor-intensive. Therefore,building user simulators to interact with the sys-tem before deployment to real users becomes aneconomical choice (Williams et al., 2017; Li et al.,2016). But the performance of the user simula-tor has a direct impact on the trained RL policy. * Equal contribution. The code and data are released at https://github.com/wyshi/user-simulator . Such an intertwined relation between user simula-tor and dialog system makes the whole process a“chicken and egg” problem. This naturally leadsto the question of how different user simulatorsimpact the system performance, and how to buildappropriate user simulators for different tasks.In previous RL-based dialog system literature,people reported their system’s performance, suchas success rate, on their speciﬁc user simulators(Liu and Lane, 2017; Shi and Yu, 2018), but thedetails of the user simulators are not sufﬁcient toreproduce the results. User simulators’ quality canvary in multiple aspects, which could lead to un-fair comparison between different trained systems.For instance, RL systems built with more compli-cated user simulators will have lower scores on theautomatic metrics, compared to those built usingsimpler user simulators. However, the good per-formance may not necessarily transfer when thesystem is tested by real users. In fact, models thathave a low score but are trained on better simula-tors may actually perform better in real situationsbecause they have experienced more complex sce-narios. In order to obtain a fairer comparison be-tween systems, we propose a set of standardizeduser simulators. We pick the popular restaurantsearch task from Multiwoz (Budzianowski et al.,2018) and analyze the pros and cons of differentuser simulator building methods.The potential gap between automatic metricsand real human evaluation also makes user sim-ulator hard to build. The ideal evaluator of adialog system should be its end-users. But asstated before, to obtain real user evaluation istime-consuming. Therefore, many automatic met-rics have been studied to evaluate a user simula-tor (Pietquin and Hastie, 2013; Kobsa, 1994) fromdifferent perspectives. However, we do not knowhow these automatic metrics correlate with humansatisfaction. In this paper, we ask human users to a r X i v : . [ c s . C L ] S e p oth rate the dialogs generated by the user simu-lators, and interact with the dialog systems trainedwith them, in order to quantify the gap betweenthe automatic metrics and human evaluation.This paper presents three contributions: ﬁrst,we annotate the user dialog acts in the restau-rant domain in Multiwoz 2.0; second, we buildmultiple user simulators in the standard restaurantsearch domain and publish the code to facilitatefurther development of RL-based dialog systemtraining algorithms; third, we perform comprehen-sive evaluations on the user simulators and trainedRL systems, including automatic evaluation, hu-man rating simulated dialogs, human interactingwith trained systems and cross study between sim-ulators and systems, to measure the gap betweenautomatic dialog completion metrics with real hu-man satisfaction, and provide meaningful insightson how to develop better user simulators. One line of prior user simulator research focuseson agenda-based user simulator (ABUS) (Schatz-mann et al., 2006, 2007; Schatzmann and Young,2009; Li et al., 2016) and it is most commonlyused in task-oriented dialog systems. An agenda-based user simulator is built on hand-crafted rulesaccording to an agenda provided at the beginningof a dialog. This mechanism of ABUS makes iteasier to explicitly integrate context and agendainto the dialog planning. Schatzmann and Young(2009) presented a statistical hidden agenda usersimulator, tested it against real users and showedthat a superior result in automatic metrics does notguarantee a better result in the real situation. Liet al. (2016) proposed an agenda-based user simu-lator in the movie domain and published a genericuser simulator building framework. In this work,we build a similar agenda-based user simulator inthe restaurant domain, and focus more on analyz-ing the effects of using different user simulators.However, it’s not feasible to build agenda-baseduser simulators for more complex tasks without anexplicit agenda. Therefore, people have also stud-ied how to build user simulators in a data-drivenfashion. He et al. (2018) ﬁt a supervised-learning-based user simulator to perform RL training on anegotiation task. Asri et al. (2016) developed aseq2seq model for user simulation in the restau-rant search domain, which took the dialog con-text into consideration without the help of external data structure. Kreyssig et al. (2018) introducedthe Neural User Simulator (NUS) which learneduser behaviour from a corpus and generates nat-ural language directly instead of semantic outputsuch as dialog acts. However, unlike in ABUS,how to infuse the agenda into the dialog planningand assure consistency in data-driven user simula-tors has been an enduring challenge. In this paper,we present a supervised-learning-based user simu-lator and integrate the agenda into the policy learn-ing. Furthermore, we compare such a data-drivenmethod with its agenda-based counterpart.Another line of user simulator work treats theuser simulator itself as a dialog system, and trainthe simulator together with the RL system itera-tively (Liu and Lane, 2017; Shah et al., 2018).Shah et al. (2018) proposed the Machines Talk-ing To Machines (M2M) framework to bootstrapboth user and system agents with dialog self-play.Liu and Lane (2017) presented a method for iter-ative dialog policy training and address the prob-lem of building reliable simulators by optimizingthe system and the user jointly. But such itera-tive approach requires extra effort in setting up RLand designing reward for the user simulator, whichmay result in the two agents exploiting the task,and leads to numerical instability.Another challenging research question is howuser simulator performance can be evaluated(Schatztnann et al., 2005; Ai and Litman, 2011a,b;Engelbrecht et al., 2009; Hashimoto et al., 2019).Pietquin and Hastie (2013) conducted a compre-hensive survey over metrics that have been usedto assess user simulators, such as perplexity andBLEU score (Papineni et al., 2002). However,some of the metrics are designed speciﬁcally forlanguage generation evaluation, and as Liu et al.(2016) pointed out, these automatic metrics barelycorrelate with human evaluation. Therefore, Aiand Litman (2011a) involved human judges to di-rectly rate the simulated dialog. Schatzmann andYoung (2009) asked humans to interact with thetrained systems to perform indirect human eval-uation. Schatztnann et al. (2005) proposed cross-model evaluation to compare user simulators sincehuman involvement is expensive. We combine theexisting evaluation methods and conduct compre-hensive assessments to measure the gap betweenautomatic metrics and human satisfaction.

Dataset

We choose the restaurant domain in Multiwoz 2.0(Budzianowski et al., 2018) as our dataset, becauseit’s the most classic domain in task-oriented dia-log systems. The system’s task is to help usersﬁnd restaurants, provide restaurant informationand make reservations. There are a total of 1,310dialogs annotated with informable slots (e.g. food,area ) that narrow downs the restaurant choice,and requestable slots (e.g. address, phone ) thattrack users’ detailed requests about the restaurant.But because the original task in Multiwoz was tomodel the system response, it only contains dia-log act annotation on the system-side but not onthe user-side. To build user simulators, we needto model user behaviors, and therefore, we anno-tate the user intent in Multiwoz. In order to builduser simulators, we need to model user behaviorand therefore, we annotate the user-side dialog actin the restaurant domain of Multiwoz. Two hu-man expert annotators analyze the data and agreeon a set of seven user dialog acts (

U serActs ): “inform restaurant type” , “inform restaurant typechange” , “anything else” , “request restaurantinfo” , “make reservation” , “make reservationchange time” , and “goodbye” . Because the data isrelatively clean and constrained in domain, the an-notation is performed by designing regular expres-sion ﬁrst and cleaned by human annotators later.We manually checked 10% of the data (around 500utterances) and the accuracy for automatic annota-tions is 94%. These annotated user dialog acts willserve as the foundation of the user simulator actionspace U serActs . The annotated data is released tofacilitate user simulator study.

According to Li et al. (2016), user simulatorbuilding eventually boils down to two importanttasks: building 1) a dialog manager (DM) (Hen-derson et al., 2014; Cuay´ahuitl et al., 2015; Younget al., 2013) that governs the simulator’s nextmove; and 2) a natural language generation mod-ule (NLG) (Tran and Nguyen, 2017; Duˇsek andJurˇc´ıˇcek, 2016) that translates the semantic out-put from dialog manager into natural language.The user simulator can adopt either agenda-based approach or model-based approach for the dia-log manager. While for NLG, the user simula-tor can use the dialog act to select pre-deﬁnedtemplates, retrieve user utterances from previously collected dialogs, or generate the surface form ut-terance directly with pre-trained language model(Jung et al., 2009).The dialog manager module ensures the intrin-sic logical consistency of the user simulator, whilethe NLG module controls the extrinsic languageﬂuency. DM and NLG play an equally importantrole in the user simulator design and must go hand-in-hand to imitate user behaviours. Therefore, wepropose to test different combinations of DM andNLG methods to answer the question of how tobuild the best user simulator.In task-oriented dialog systems, the user simu-lator’s task is to complete a pre-deﬁned goal byinteracting with the system. Multiwoz providesdetailed goals for each dialog, which serves asthe goal database. These goals consist of sub-tasks, such as request information or make reser-vation. An example goal is, “

You’re looking foran Italian restaurant in the moderate price rangein the east. Once you ﬁnd the restaurant, youwant to book a table for 5 people at 12:15 onMonday. Make sure you get the reference num-ber. ” During initial RL experiments, we ﬁnd thatsimilar to supervised learning, the data imbalancein goals will impact the reinforce learning in thesimulated tasked-oriented dialog setting. We ﬁndthat 2/3 of the goals contain the sub-task “askinfo” and the rest 1/3 are about “make reserva-tion”. Because the user simulators are all goal-driven, the RL policy is only able to experiencethe “reservation” scenario 1/3 of the time on av-erage, which will result in the model favoring the“ask info” scenario more, especially in the earlytraining stage. This further misleads the policy (Suet al., 2017). Therefore, we augment the goal setwith more “make reservation” sub-task from Mul-tiWoz to make the sub-tasks of “make reservation”and “ask info” even. This augmented goal set withmore even distribution serves as our goal database.We randomly sample a goal from the goal databaseduring training. A user goal deﬁnes the agenda theuser simulator needs to follow, so we’ll use “goal”and “agenda” interchangeably in this paper.

We employ the traditional agenda-based stack-like user simulator (Schatzmann andYoung, 2009; Li et al., 2016), where the dialogmanager chooses a dialog act among the user dia-log act set

U serActs mentioned in Section 3. Theialog act transition is governed by hand-set rulesand probabilities based on the initial goal. Forexample, after the system makes a recommenda-tion, the user can go on to the next sub-task, orask if there is another option. Fig. 1 shows a typ-ical agenda. Because the restaurant task is a user-initiated task, agenda-based simulator’s ﬁrst dia-log act is always “inform restaurant type”. The di-

Figure 1: An example user agenda for the restauranttask alog history is managed with the user dialog stateby pushing and popping important slots to ensureconsistency. Although the restaurant search taskis simple, designing an agenda-based system forthe task is non-trivial, because there are many cor-ner cases to be handled. However, the advantageof building agenda-based system is that it does notrequire thousands of annotated dialogs.

Model-based

It requires speciﬁc human expertiseto design rules for agenda-based user simulators(compared to more easily accessible annotation),and the process is both labor-intensive and error-prone. Moreover, for complicated tasks such asnegotiation, it is not practical to design rules inthe policy (He et al., 2018). Therefore, we ex-plore the possibility of building dialog managerwith supervised learning methods. Compared toagenda-based simulators which require special ex-pert knowledge, supervised learning methods re-quire less expert involvement. We utilize Sequic-ity (Lei et al., 2018) to construct model-baseduser simulator. Sequicity is a simple seq2seq di-alog system model with copy and attention mech-anism. It also used belief span to track the dia-log states. For example, inform: { Name:“CaffeeUno”; Phone:“01223448620” } records the in-formation that the system offers and this wouldbe kept in belief span throughout the dialog,while request: { “food”, “price range” } meansthe system is asking for more information from Figure 2: The end-to-end simulator and user dialog actpredictor share the most part of their model, colored inblack, except the decoder. All the parameters coloredin red are related to the dialog act predictor and theparameters in blue color are for sentence decoder. user to locate a restaurant, which would be re-moved from belief span once the request is ful-ﬁlled. There are 13 types of system dialogacts. To focus on the valuable information toﬁt the model, we combine these dialog actsinto 5 categories: { “inform”,“request”,“bookinform”,“select”, “recommend” } . Similarly,we deﬁne three types of user goals “in-form”,“request” and “book” , and record them inbelief span, denoted as G . So, at time t , we ﬁrstupdate belief span with a seq2seq model, based oncurrent system response R t , previous belief state B t − and previous user utterance U t − : B t = seq2seq ( B t − , U t − , R t ) Then we incorporate user goal and the contextabove to generate current user utterance: U t = seq2seq ( B t − , U t − , R t | B t , G ) As illustrated in Fig. 2, we build a GRU-basedencoder for all the B t − , U t − , R t and the goal G . Then we decode the current belief span anduser utterance separately. Both decoders are a one-layer GRU with copy and attention mechanism. Toevaluate the dialog manager alone, we also modifythe Sequicity’s second decoder to generate systemdialog act ( A t ) instead of system utterances. A t = seq2seq ( B t − , U t − , R t | B t , G ) Dialog act-based NLG is formalized as U t = M ( A t ) , where A t is the selected dialog act by theialog manager, U t is the generated user utterance.We describe three different dialog-act-based NLGmethods. Template

Template method requires human ex-perts to write a variety of delexicalized templatesfor each dialog act. By searching in the templates,it translates A t into human-readable utterances.The quality of the templates have direct impact onthe NLG quality. Retrieval

Template method suffers from limitedvocabulary size and language diversity. An alter-native method is Retrieval-based NLG (Wu et al.,2016; Hu et al., 2014). The model retrieves userutterances with A t as their dialog act in the train-ing dataset. Following He et al. (2018), we rep-resent the context by a TF-IDF weighted bag-of-words vector and compute the similarity score be-tween the candidate’s context vector and the cur-rent context vector to retrieval U t . Generation

Generation method (Wen et al.,2015a,b) does not need expert involvement torewrite templates, but requires dialog act annota-tion similar to retrieval method. We build a vanillaseq2seq (Sutskever et al., 2014) model using theannotated data adding A t in the input. Traditionally, hand-crafted dialog acts plus slotvalues are used as the discrete action space in RLtraining (Raux et al., 2005). Dialog action spacecan also be on the word-level. However, previ-ous study shows degenerate behavior when usingword-level action space (Zhao et al., 2019), as itis difﬁcult to design a reward. We choose theﬁrst approach and use the discrete action spacewith six system dialog acts: “ask restaurant type”,“present restaurant search result”, “provide restau-rant info”, “ask reservation info”, “inform reserva-tion result”, “goodbye”. Simple action masks areapplied to avoid impossible actions such as mak-ing reservation before presenting a restaurant.We use a 2-layer bidirectional-GRU with 200hidden units to train a NLU module. For sim-plicity, we use the template-based method in thesystem’s NLG module. We used policy gradientmethod to train dialog systems (Williams, 1992).During RL training, a discounted factor of 0.9 isapplied to all the experiences with the maximumnumber of turns to be 10. We also apply the (cid:15) -greedy exploration strategy (Tokic, 2010). All theRL systems use the same RL state representation, which consists of traditional dialog state and wordcount vector of the current utterance. The samereward function is used, which is +1 for task suc-cess, − for task failure and − . for each addi-tional turn to encourage the RL policy to ﬁnish thetask faster rather than slower. We ﬁx the RL modelevery 1,000 episodes and test for 100 dialogs tocalculate the average success rate, shown in Fig. 3.Besides RL systems, we also build a rule-basedsystem Rule-System , which serves as the third-party system to interact with each user simulatorand generate simulated dialogs for human evalu-ation. The only difference between

Rule-System and the RL-based systems is their policy selec-tion module, where

Rule-System uses hand-craftedrules while RL-based systems use RL policy.

Evaluating the quality of a user simulator is an en-during challenge. Traditionally, we report direct automatic metrics of the user simulator, such asperplexity (Ai and Litman, 2011b; Pietquin andHastie, 2013). Besides, the performance of the RLsystem trained with a speciﬁc simulator gives usan indirect assessment of the user simulator’s abil-ity to imitate user behaviours.The ultimate goal of the user simulator is tobuild a task-oriented RL system to serve real users.Therefore, the most ideal evaluation should beconducted by human. Therefore, we ﬁrst askedhuman to read the simulated dialogs and rate theuser simulator’s performance directly . We thenhired Amazon Mechanic Turkers (AMT) to inter-act with the RL systems trained with different sim-ulators and rate their performance. Besides, wealso performed cross study between user simula-tors and systems trained with different simulatorsto see if the systems’ performance can be trans-ferred to a different simulated setting. Finally, wemeasure the gap between the automatic metricsand human evaluation scores, and share insightson how to evaluate user simulator effectively.

Perplexity measures the lan-guage generation quality of the user simulator.The results are shown in Table 1. For each simula-tor model, we generate 200 dialogs with the third-party

Rule-System and train a trigram languagemodel with the data. Then we test the model andcompute the perplexity with 5000 user utterances imulators NLU DM NLG PPL Vocab Utt Hu.Fl Hu.Co Hu.Go Hu.Div Hu.All

Agenda-Template (AgenT) SL Agenda Template 10.32 180 9.65 4.07 4.56 4.88 2.4 4.50Agenda-Retrieval (AgenR) SL Agenda Retrieval 33.90

383 11.61

159 8.07 3.32 3.92 4.64 2.5 3.36SL-Template (SLT) SL Template 9.32 192 9.83

SL-Retrieval (SLR) SL Retrieval 29.36 346 11.06 4.40 3.99 4.88

Table 1: Automatic metrics and human evaluation scores of different user simulators. Automatic metrics include,perplexity per word (PPL), vocabulary size (Vocab), average utterance length (Utt). Human evaluation metricsinclude, sentence ﬂuency (Hu.Fl), coherent (Hu.Co), goal adherence (Hu.Go), language diversity (Hu.Div) andoverall score (Hu.All). sampled from MultiWoz. Although the perplexityfor retrieval models is the highest in both agenda-based and SL-based simulators, it also possessesthe biggest vocabulary set and the longest av-erage utterance length. Another common auto-matic metrics used to assess the language modelis BLEU, but since this is a user simulator studyand we don’t have ground truth, BLEU score isnot available.

Vocabulary Size (Direct)

Vocabulary size is asimple and straightforward metric that measuresthe language diversity. As expected, retrieval-based models have the biggest vocabulary set.However, Agenda-Generation has the smallest vo-cabulary set. The possible reason behind is that weadopt a vanilla greedy seq2seq that suffers fromgenerating the most frequent words. SL-End2Endin Table 1 trains the NLU, DM and NLG jointly,and therefore, the vocabulary size is slightly largerthan the template-based methods.

Average utterance length (Direct)

Average utter-ance length is another simple metric to assess thelanguage model and language diversity. As ex-pected, retrieval-based methods are doing the best,but SL-End2End is also doing a good job in gen-erating long sentences.

Success Rate (Indirect)

The success rate is themost commonly used metric in reporting RL di-alog system performance. Also, it can reﬂectthe user simulator’s certain behaviour. The suc-cess rate of various user simulators are shown inFig. 3. SL-based user simulators converge fasterthan rule-based simulators. It can be explained bythe observation that SL tries to capture the ma-jor paths in the original data, and counts thoseas success, instead of exploring all the possiblepaths like in the agenda-based simulators. In gen-eral, retrieval-based simulators converge slowerthan other NLG methods because retrieval-basedapproach has a bigger vocabulary size.

Figure 3: Average success rate during RL training.

The direct evaluation of the user simulator is con-ducted by asking 10 volunteers to read the sim-ulated dialogs between different simulators andthe third-party

Rule-System . Each of the 10 vol-unteers would rate ﬁve randomly-selected dialogsgenerated from each model, and the average ofthe total 50 ratings is reported as the ﬁnal human-evaluation score. The

Rule-System is built solelybased on hand-crafted rules with no knowledgeabout any of the simulators, and therefore is fair toall of them. We design four metrics to assess theuser simulator’s behaviour from multiple aspects.The results are shown in Table 1.

Fluency focuses on the language quality, such asgrammar, within each utterance unit. Agenda-Template (AgenT) and SL-Template (SLT) re-ceived the two highest ﬂuency scores because thetemplates are all written by human.

Coherence focuses on the relation quality be-tween different turns within one dialog. SL-Template (SLT) simulator performs the best in co-herence, but agenda-based simulators in generalare a bit more coherence than SL-based ones.

Goal Adherence focuses on the relation betweenthe goal and the simulator-generated utterances.Both agenda-based and SL-based simulators ingeneral stick to the goal with the exception of

L System Solved Ratio Satisfaction Efﬁciency Naturalness Rule-likeness Dialog Length Auto Success

Sys-AgenT 0.814 ± . ± . ± . ± . ± . ± . ± . Sys-AgenR ± . ± . ± . ± . ± . ± . ± . Sys-AgenG 0.904 ± . ± . ± . ± . ± . ± . ± . Sys-SLT 0.781 ± . ± . ± . ± . ± . ± . ± . Sys-SLR 0.823 ± . ± . ± . ± . ± . ± . ± . Sys-SLE 0.607 ± . ± . ± . ± . ± . ± . ± . Table 2: Human evaluation of RL systems trained with different simulators on AMT with 95% conﬁdence intervals.Each row represents one RL system, e.g. Sys-AgenT means the RL system trained with the AgenT simulator.

SL-End2End (SLE). This may be because SLEis training all the modules together and thus hasmore difﬁculty infusing the goal.

Diversity focuses on the language diversity be-tween simulated dialogs of the same simulator.Each simulator will be given one diversity score.Retrieval-based methods surpass other methods indiversity, but it is not as good in ﬂuency, whiletemplate-based methods outperform in ﬂuency butsuffer on diversity as expected. Generative meth-ods suffer from generating generic sentences asmentioned before.

Overall

We ask the human to rate the overall sim-ulator quality. Except for SL-End2End, SL-basedmethods are favoured by human over agenda-based methods. Agenda-Template is comparableto SL-based simulators because of its ﬂuent re-sponses and carefully-designed policy.

The ultimate goal of user simulator building isto train better system policies. Automatic met-rics such as success rate can give us a sense onthe system’s performance, but the ultimate evalua-tion should be conducted on human so that we canknow the real performance of the system policywhen deployed.Motivated by this, we tested the RL systemstrained with various user simulators on AmazonMechanical Turk (AMT) (Miller et al., 2017), andasked Turkers to interact with the system and ob-tained their opinions. Each system is tested on 100Turkers. The results are shown in Table 2. TheAMT interface is in the Appendix.We also listed two common automatic metricsin Table 2 to compare. The “Dialog Length”column shows the average dialog length of theTurker-Machine dialogs, which reﬂects the sys-tem’s efﬁciency to some extent. The “AutoSuccess” column represents the automatic suc-cess rate. It’s the convergent success rate fromFig. 3, measured by freezing the policy and test- ing against the user simulator for 100 episodes.Previous approaches have utilized these two auto-matic metrics to evaluate the system’s efﬁciencyand success (Williams et al., 2017; Shi et al.,2019), but we ﬁnd that due to user individual dif-ference, such automatic metrics have relativelybig variances and don’t always correlate with ef-ﬁciency perceived by human. For example, someusers tend to provide all slots in one turn, whileothers provide slots only when necessary; someusers would even go off-the-script and ask aboutrestaurants not mentioned in the goal. Therefore,we should caution against relying solely on theautomatic metrics to represent user opinion andthe best way is to ask the users directly for theirthoughts on the system’s performance from multi-ple aspects as follows.

Solved Ratio.

Each Turker is given a goal atthe beginning, the same as in the simulated set-ting. At the end of the dialog, we ask the Turkerif the system has solved his/her problem. Thereare three types of answers to this question, “Yes”is coded as 1, “Partially solved” is coded as 0.5,and “No” is coded as 0. Sys-AgenR is the sys-tem trained with the Agenda-Retrieval (AgenR)simulator and it received the highest score, bet-ter than the Sys-AgenT trained with the AgenTsimulator, which is reasonable because throughretrieval, Agenda-Retrieval (AgenR) simulatorspresent more language diversity to the system dur-ing training. When interacting with a real user,the systems that can handle more language vari-ations will do better.The SL-based simulators re-ceived relatively low scores. Further investigationon this cause is presented in the discussion section.“Auto-Success” has been used to reﬂect thesolved ratio previously. However, it’s not neces-sarily correlated with the user-rated solved ratio.For example, Sys-AgenG’s Auto-Success rate ismuch higher than Sys-AgenR’s Auto Success rate,but the users think that these two systems performthe same in terms of

Solved Ratio . sr \ Sys

Sys-AgenT Sys-AgenR Sys-AgenG Sys-SLT Sys-SLR Sys-SLE

AgenT 0.975 0.960 0.790 0.305 0.300 0.200AgenR 0.540 0.900 0.785 0.230 0.230 0.235AgenG 0.725 0.975 0.950 0.355 0.300 0.20SLT 0.985 0.985 0.985 0.990 0.965 0.730SLR 0.925 0.975 0.965 0.975 0.935 0.630SLE 0.770 0.820 0.815 0.840 0.705 0.770Average 0.820

Table 3: Cross study results. Each row represents one user simulator, each column represents one RL systemtrained with a speciﬁc simulator. Each entry shows the average success rate obtained by having the user simulatorinteracting with the RL system for 200 times.

Satisfaction.

Solving the user’s problem doesn’tnecessarily lead to user satisfaction sometimes. Italso depends on the system’s efﬁciency and la-tency. Therefore, besides Solved Ratio, we alsodirectly ask Turkers how satisﬁed they are with thesystem. The result shows that among all systems,Sys-AgenR model received the highest score. Thepositive correlation between the “Solved Ratio”and “Satisfaction” in Table 1 also indicates auto-matic task completion rate is a good estimator foruser satisfaction.

Efﬁciency.

We directly ask Turkers how efﬁcientthe system is in solving their problems since dia-log length doesn’t always correlate with the sys-tem efﬁciency. For example, although the dialoglength of Sys-AgenG and Sys-SLE are similar toeach other, users rated Sys-AgenG to be the mostefﬁcient one and Sys-SLE to be the most inefﬁ-cient one. Again we suspect this is caused bydifferent user communication pattern where someusers prefer providing slots across multiple turnswhile others prefer providing all slots in one turn.

Naturalness.

We ask the Turkers to rate the nat-uralness of the system responses. All the systemsshare the same template-based NLG module de-signed by human experts, thus there shouldn’t bea signiﬁcant difference in the naturalness score.However, according to Table 2, we ﬁnd that thenaturalness score seems to correlate with the over-all system performance. A possible reason is thatthe end-user is rating the system’s naturalness bythe overall performance instead of the system re-sponses alone. When the dialog policy is bad, evenif the NLG module can generate natural system re-sponses, the users would still think the system isunnatural. This suggests that when designing di-alog systems, NLG and policy selection modulesshould go hand-in-hand in evaluation.

Rule-likeness

We also ask the users to what ex- tend they think the system is designed by hand-crafted rules on a scale from 1 to 5, ﬁve meansit is heavily handcrafted. Among all the models,Sys-SLT that is trained with the SL-Template sim-ulator receives the lowest score, meaning it’s theleast rigid system. This is because SL-Template’sdialog manager is learned with supervised learn-ing, less rigid than the agenda-based dialog policy,which further leads to a less rigid behaviour of thetrained dialog system.

From the last column in Table 2, we ﬁnd that al-though the automatic success rates claimed by theuser simulator used to train the system are all rela-tively high, the high automatic success rate doesn’ttransfer to real human satisfaction. In our setting,each simulator can be viewed as a new user withdifferent communicating habits; therefore, we arecurious to see if the performance can transfer toa different simulator when we test the RL systemtrained with simulator A against simulator B. Ta-ble 3 shows a cross study between the six usersimulators and the six systems trained with dif-ferent simulators, where we ﬁx the systems, haveeach simulator interact with each system for 200episodes, and calculate the average success rate.The diagonal should reﬂect the “Auto Success”column in Table 2, but since the 200 episodes arerandom and the “Auto Success” is the convergentsuccess rate, the exact number won’t be the same.The last row in Table 3 shows the averagesuccess rate of each system across user simula-tors. There are some interesting ﬁndings. 1) Sys-AgenR that is trained with the Agenda-Retrievalsimulator has the best average success rate, whichagrees with the human evaluation on MTurk. 2)A common practice to compare RL systems S , ... S n is to ﬁx one user simulator U and then com- igure 4: Correlation between sentence ﬂuency andperplexity, and correlation between sentence diversityand perplexity. Green circles represent the human ratedscore, while the blue squares are the average score overdifferent raters. pare the success rate of S , ..., S n on U . However,by looking at the ﬁfth row for the SL-Retrievalsimulator, it will prefer Sys-SLT (0.975) over Sys-AgenG (0.965), but actually the average perfor-mance of Sys-AgenG (0.882) is better than Sys-SLT (0.616) from the last row. This suggests thatwhen we want to compare two systems but don’thave the resource to do human evaluation on thesystem performance, instead of solely comparingtheir success rates tested on one simulator, weshould build different types of user simulators andtest the systems against multiple simulators to geta more holistic view of the systems. 3) The diago-nal in the table is usually the highest, meaning thatRL policy does a good job optimizing for its ownsimulator but may not generalize to other user sim-ulators. For example, the upper right corner per-forms the worst because the systems trained withSL-based simulators are worse in general, whosereason we will discuss later. To test if the automatic metrics can reﬂect hu-man evaluation, we compute the correlation be-tween perplexity (PPL) and human evaluated ﬂu-ency (Hu.Fl) and the correlation between perplex-ity and human evaluated diversity score (Hu.Div),which are − . with p > . and . with p = 0 . respectively. We also visualize thesemetrics in Fig. 4. It shows that as an automaticmetric, perplexity is a good estimator for languagediversity but not for language ﬂuency. SL-based simulators perform relatively worse thanAgenda-based simulators when interacting withreal users. We investigate the data and ﬁnd it’scaused by SL-based simulators not exploring allpossible paths. We draw the different dialog act

Figure 5: Dialog act distribution comparison. Act1 toAct7 corresponds to the seven user dialog acts, “informrestaurant type” , “inform restaurant type change” , “ask info” , “make reservation” , “make reservationchange time” , “anything else” , and “goodbye” distributions on simulated conversations in Fig. 5.For example, in agenda-based simulators, we ex-plicitly have a rule for the dialog act “anythingelse” (Act6 in Fig. 5) but no such rules exist in SL-based simulators. Therefore, the RL model willexperience the “anything else” scenario more inAgenda-based simulators than in SL-based simu-lators. When real users ask about “anything else”,RL systems trained with Agenda-based simula-tors will have more experiences in handling such acase, compared to systems trained with SL-basedsimulators.In this paper, we perform in-depth studies on therestaurant domain as it’s the most well-studied do-main in task-oriented dialog systems, yet there’sstill no standard user simulator available. In thefuture we plan to include more domain using vari-ous domain-adaptive methods (Qian and Yu, 2019;Tran and Nguyen, 2018; Gaˇsi´c et al., 2017) to sup-port multi-domain dialog system research, and in-corporate our work into more and more standard-ized dialog system platforms (Lee et al., 2019). User simulators are essential components in train-ing RL-based dialog systems. However, buildinguser simulators is not a trivial task. In this pa-per, we surveyed through different ways to builduser simulators at the levels of dialog managerand NLG, and analyzed the pros and cons of eachmethod. Further, we evaluated each simulator withautomatic metrics and human evaluations both di-rectly and indirectly, and shared insights on betteruser simulator building based on comprehensiveanalysis. eferences

Hua Ai and Diane Litman. 2011a. Assessing user sim-ulation for dialog systems using human judges andautomatic evaluation measures.

Natural LanguageEngineering , 17(4):511–540.Hua Ai and Diane Litman. 2011b. Comparing usersimulations for dialogue strategy learning.

ACMTransactions on Speech and Language Processing(TSLP) , 7(3):9.Layla El Asri, Jing He, and Kaheer Suleman. 2016.A sequence-to-sequence model for user simula-tion in spoken dialogue systems. arXiv preprintarXiv:1607.00070 .Paweł Budzianowski, Tsung-Hsien Wen, Bo-HsiangTseng, I˜nigo Casanueva, Stefan Ultes, Osman Ra-madan, and Milica Gaˇsi´c. 2018. Multiwoz-alarge-scale multi-domain wizard-of-oz dataset fortask-oriented dialogue modelling. arXiv preprintarXiv:1810.00278 .Heriberto Cuay´ahuitl, Simon Keizer, and OliverLemon. 2015. Strategic dialogue managementvia deep reinforcement learning. arXiv preprintarXiv:1511.08099 .Ondˇrej Duˇsek and Filip Jurˇc´ıˇcek. 2016. Sequence-to-sequence generation for spoken dialogue viadeep syntax trees and strings. arXiv preprintarXiv:1606.05491 .Klaus-Peter Engelbrecht, Michael Quade, and Sebas-tian M¨oller. 2009. Analysis of a new simulation ap-proach to dialog system evaluation.

Speech Commu-nication , 51(12):1234–1252.Milica Gaˇsi´c, Nikola Mrkˇsi´c, Lina M Rojas-Barahona,Pei-Hao Su, Stefan Ultes, David Vandyke, Tsung-Hsien Wen, and Steve Young. 2017. Dialogue man-ager domain adaptation using gaussian process rein-forcement learning.

Computer Speech & Language ,45:552–569.Tatsunori B Hashimoto, Hugh Zhang, and Percy Liang.2019. Unifying human and statistical evaluationfor natural language generation. arXiv preprintarXiv:1904.02792 .He He, Derek Chen, Anusha Balakrishnan, and PercyLiang. 2018. Decoupling strategy and gener-ation in negotiation dialogues. arXiv preprintarXiv:1808.09637 .Matthew Henderson, Blaise Thomson, and Jason DWilliams. 2014. The third dialog state tracking chal-lenge. In , pages 324–329. IEEE.Baotian Hu, Zhengdong Lu, Hang Li, and QingcaiChen. 2014. Convolutional neural network architec-tures for matching natural language sentences. In

Advances in neural information processing systems ,pages 2042–2050. Sangkeun Jung, Cheongjae Lee, Kyungduk Kim, Min-woo Jeong, and Gary Geunbae Lee. 2009. Data-driven user simulation for automated evaluation ofspoken dialog systems.

Computer Speech & Lan-guage , 23(4):479–509.Alfred Kobsa. 1994. User modeling and user-adaptedinteraction. In

CHI Conference Companion , pages415–416.Florian Kreyssig, Inigo Casanueva, PawelBudzianowski, and Milica Gasic. 2018. Neuraluser simulation for corpus-based policy optimisa-tion for spoken dialogue systems. arXiv preprintarXiv:1805.06966 .Sungjin Lee, Qi Zhu, Ryuichi Takanobu, Xiang Li,Yaoqin Zhang, Zheng Zhang, Jinchao Li, BaolinPeng, Xiujun Li, Minlie Huang, et al. 2019. Con-vlab: Multi-domain end-to-end dialog system plat-form. arXiv preprint arXiv:1904.08637 .Wenqiang Lei, Xisen Jin, Min-Yen Kan, ZhaochunRen, Xiangnan He, and Dawei Yin. 2018. Sequic-ity: Simplifying task-oriented dialogue systems withsingle sequence-to-sequence architectures. In

Pro-ceedings of the 56th Annual Meeting of the Associa-tion for Computational Linguistics (Volume 1: LongPapers) , pages 1437–1447.Xiujun Li, Zachary C Lipton, Bhuwan Dhingra, LihongLi, Jianfeng Gao, and Yun-Nung Chen. 2016. Auser simulator for task-completion dialogues. arXivpreprint arXiv:1612.05688 .Bing Liu and Ian Lane. 2017. Iterative policy learn-ing in end-to-end trainable task-oriented neural dia-log models. In , pages482–489. IEEE.Chia-Wei Liu, Ryan Lowe, Iulian V Serban, MichaelNoseworthy, Laurent Charlin, and Joelle Pineau.2016. How not to evaluate your dialogue system:An empirical study of unsupervised evaluation met-rics for dialogue response generation. arXiv preprintarXiv:1603.08023 .Alexander H Miller, Will Feng, Adam Fisch, Jiasen Lu,Dhruv Batra, Antoine Bordes, Devi Parikh, and Ja-son Weston. 2017. Parlai: A dialog research soft-ware platform. arXiv preprint arXiv:1705.06476 .Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic eval-uation of machine translation. In

Proceedings ofthe 40th annual meeting on association for compu-tational linguistics , pages 311–318. Association forComputational Linguistics.Olivier Pietquin and Helen Hastie. 2013. A survey onmetrics for the evaluation of user simulations.

Theknowledge engineering review , 28(1):59–73.Kun Qian and Zhou Yu. 2019. Domain adaptive di-alog generation via meta learning. arXiv preprintarXiv:1906.03520 .ntoine Raux, Brian Langner, Dan Bohus, Alan WBlack, and Maxine Eskenazi. 2005. Let’s go pub-lic! taking a spoken dialog system to the real world.In

Ninth European conference on speech communi-cation and technology .Jost Schatzmann, Blaise Thomson, Karl Weilhammer,Hui Ye, and Steve Young. 2007. Agenda-based usersimulation for bootstrapping a pomdp dialogue sys-tem. In

Human Language Technologies 2007: TheConference of the North American Chapter of theAssociation for Computational Linguistics; Com-panion Volume, Short Papers , pages 149–152. As-sociation for Computational Linguistics.Jost Schatzmann, Karl Weilhammer, Matt Stuttle, andSteve Young. 2006. A survey of statistical user sim-ulation techniques for reinforcement-learning of di-alogue management strategies.

The knowledge en-gineering review , 21(2):97–126.Jost Schatzmann and Steve Young. 2009. The hid-den agenda user simulation model.

IEEE transac-tions on audio, speech, and language processing ,17(4):733–747.Jost Schatztnann, Matthew N Stuttle, Karl Weilham-mer, and Steve Young. 2005. Effects of the usermodel on simulation-based learning of dialoguestrategies. In

IEEE Workshop on Automatic SpeechRecognition and Understanding, 2005. , pages 220–225. IEEE.Pararth Shah, Dilek Hakkani-T¨ur, Gokhan T¨ur, Ab-hinav Rastogi, Ankur Bapna, Neha Nayak, andLarry Heck. 2018. Building a conversational agentovernight with dialogue self-play. arXiv preprintarXiv:1801.04871 .Weiyan Shi and Zhou Yu. 2018. Sentiment adap-tive end-to-end dialog systems. arXiv preprintarXiv:1804.10731 .Weiyan Shi, Tiancheng Zhao, and Zhou Yu. 2019. Un-supervised dialog structure learning. arXiv preprintarXiv:1904.03736 .Pei-Hao Su, Pawel Budzianowski, Stefan Ultes, Mil-ica Gasic, and Steve Young. 2017. Sample-efﬁcientactor-critic reinforcement learning with superviseddata for dialogue management. arXiv preprintarXiv:1707.00130 .Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.Sequence to sequence learning with neural net-works. In

Advances in neural information process-ing systems , pages 3104–3112.Michel Tokic. 2010. Adaptive ε -greedy explorationin reinforcement learning based on value differ-ences. In Annual Conference on Artiﬁcial Intelli-gence , pages 203–210. Springer.Van-Khanh Tran and Le-Minh Nguyen. 2017. Seman-tic reﬁnement gru-based neural language generation for spoken dialogue systems. In

International Con-ference of the Paciﬁc Association for ComputationalLinguistics , pages 63–75. Springer.Van-Khanh Tran and Le-Minh Nguyen. 2018. Ad-versarial domain adaptation for variational neurallanguage generation in dialogue systems. arXivpreprint arXiv:1808.02586 .Tsung-Hsien Wen, Milica Gasic, Dongho Kim, NikolaMrksic, Pei-Hao Su, David Vandyke, and SteveYoung. 2015a. Stochastic language generationin dialogue using recurrent neural networks withconvolutional sentence reranking. arXiv preprintarXiv:1508.01755 .Tsung-Hsien Wen, Milica Gasic, Nikola Mrksic, Pei-Hao Su, David Vandyke, and Steve Young. 2015b.Semantically conditioned lstm-based natural lan-guage generation for spoken dialogue systems. arXiv preprint arXiv:1508.01745 .Jason D Williams, Kavosh Asadi, and GeoffreyZweig. 2017. Hybrid code networks: practicaland efﬁcient end-to-end dialog control with super-vised and reinforcement learning. arXiv preprintarXiv:1702.03274 .Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforce-ment learning.

Machine learning , 8(3-4):229–256.Yu Wu, Wei Wu, Chen Xing, Ming Zhou, andZhoujun Li. 2016. Sequential matching network:A new architecture for multi-turn response selec-tion in retrieval-based chatbots. arXiv preprintarXiv:1612.01627 .Steve Young, Milica Gaˇsi´c, Blaise Thomson, and Ja-son D Williams. 2013. Pomdp-based statistical spo-ken dialog systems: A review.

Proceedings of theIEEE , 101(5):1160–1179.Tiancheng Zhao, Kaige Xie, and Maxine Eskenazi.2019. Rethinking action spaces for reinforcementlearning in end-to-end dialog agents with latent vari-able models. arXiv preprint arXiv:1902.08858 . Appendices

A.1 Generated Dialog Example

Figure 6: Generated dialog example