[PDF] A Probabilistic End-To-End Task-Oriented Dialog Model with Latent Belief States towards Semi-Supervised Learning

Abstract

Structured belief states are crucial for user goal tracking and database query in task-oriented dialog systems. However, training belief trackers often requires expensive turn-level annotations of every user utterance. In this paper we aim at alleviating the reliance on belief state labels in building end-to-end dialog systems, by leveraging unlabeled dialog data towards semi-supervised learning. We propose a probabilistic dialog model, called the LAtent BElief State (LABES) model, where belief states are represented as discrete latent variables and jointly modeled with system responses given user inputs. Such latent variable modeling enables us to develop semi-supervised learning under the principled variational learning framework. Furthermore, we introduce LABES-S2S, which is a copy-augmented Seq2Seq model instantiation of LABES. In supervised experiments, LABES-S2S obtains strong results on three benchmark datasets of different scales. In utilizing unlabeled dialog data, semi-supervised LABES-S2S significantly outperforms both supervised-only and semi-supervised baselines. Remarkably, we can reduce the annotation demands to 50% without performance loss on MultiWOZ.

Full PDF

AA Probabilistic End-To-End Task-Oriented Dialog Model with LatentBelief States towards Semi-Supervised Learning

Yichi Zhang , Zhijian Ou ∗ , Min Hu , Junlan Feng Speech Processing and Machine Intelligence Lab, Tsinghua University, Beijing, China China Mobile Research Institute, Beijing, China [email protected] , [email protected] Abstract

Structured belief states are crucial for user goaltracking and database query in task-orienteddialog systems. However, training belief track-ers often requires expensive turn-level anno-tations of every user utterance. In this pa-per we aim at alleviating the reliance on be-lief state labels in building end-to-end dialogsystems, by leveraging unlabeled dialog datatowards semi-supervised learning. We pro-pose a probabilistic dialog model, called theLAtent BElief State (LABES) model, wherebelief states are represented as discrete la-tent variables and jointly modeled with sys-tem responses given user inputs. Such la-tent variable modeling enables us to developsemi-supervised learning under the principledvariational learning framework. Furthermore,we introduce LABES-S2S, which is a copy-augmented Seq2Seq model instantiation ofLABES . In supervised experiments, LABES-S2S obtains strong results on three benchmarkdatasets of different scales. In utilizing un-labeled dialog data, semi-supervised LABES-S2S signiﬁcantly outperforms both supervised-only and semi-supervised baselines. Remark-ably, we can reduce the annotation demands to50% without performance loss on MultiWOZ. Belief tracking (also known as dialog state track-ing) is an important component in task-oriented di-alog systems. The system tracks user goals throughmultiple dialog turns, i.e. infers structured beliefstates expressed in terms of slots and values (e.g.in Figure 1), to query an external database (Hender-son et al., 2014). Different belief tracking modelshave been proposed in recent years, either trainedindependently (Mrkˇsi´c et al., 2017; Ren et al., 2018; ∗ Corresponding author. Code available at https://github.com/thu-spmi/LABES

I need to find a Thai restaurant that's in the south section of the city.There are three restaurants in the south part of town that serve Thai food. Do you have a cuisine preference? belief state DB Figure 1: The cues for inferring belief states from userinputs and system responses. The system response re-veals the belief state either directly in the form of wordrepetition (red), or indirectly in the form of the databasequery result (green) determined by the belief state.

Wu et al., 2019) or within end-to-end (E2E) train-able dialog systems (Wen et al., 2017a,b; Liu andLane, 2017; Lei et al., 2018; Shu et al., 2019; Lianget al., 2020; Zhang et al., 2020).Existing belief trackers mainly depend on super-vised learning with human annotations of beliefstates for every user utterance. However, collect-ing these turn-level annotations is labor-intensiveand time-consuming, and often requires domainknowledge to identify slots correctly. BuildingE2E trainable dialog systems, called E2E dialogsystems for short, even further magniﬁes the de-mand for increased amounts of labeled data (Gaoet al., 2020; Zhang et al., 2020).Notably, there are often easily-available unla-beled dialog data such as between customers andtrained human agents accumulated in real-worldcustomer services. In this paper, we are inter-ested in reducing the reliance on belief state an-notations in building E2E task-oriented dialog sys-tems, by leveraging unlabeled dialog data towardssemi-supervised learning. Intuitively, the dialogdata, even unlabeled, can be used to enhance theperformance of belief tracking and thus beneﬁt thewhole dialog system, because there are cues from a r X i v : . [ c s . C L ] O c t ser inputs and system responses which reveal thebelief states, as shown in Figure 1.Technically, we propose a latent variable modelfor task-oriented dialogs, called the LA tent BE lief S tate (LABES) dialog model. The model generallyconsists of multiple (e.g. T ) turns of user inputs u T and system responses r T which are observa-tions, and belief states b T which are latent vari-ables. Basically, LABES is a conditional generativemodel of belief states and system responses givenuser inputs, i.e. p θ ( b T , r T | u T ) . Once built, themodel can be used to infer belief states and generateresponses. More importantly, such latent variablemodeling enables us to develop semi-supervisedlearning on a mix of labeled and unlabeled data un-der the principled variational learning framework(Kingma and Welling, 2014; Sohn et al., 2015).In this manner, we hope that the LABES modelcan exploit the cues for belief tracking from userinputs and system responses. Furthermore, we de-velop LABES-S2S, which is a speciﬁc model in-stantiation of LABES, employing copy-augmentedSeq2Seq (Gu et al., 2016) based conditional distri-butions in implementing p θ ( b T , r T | u T ) .We show the advantage of our model com-pared to other E2E task-oriented dialog models,and demonstrate the effectiveness of our semi-supervised learning scheme on three benchmarktask-oriented datasets: CamRest676 (Wen et al.,2017b), In-Car (Eric et al., 2017) and MultiWOZ(Budzianowski et al., 2018) across various scalesand domains. In supervised experiments, LABES-S2S obtains state-of-the-art results on CamRest676and In-Car, and outperforms all the existing mod-els which do not leverage large pretrained languagemodels on MultiWOZ. In utilizing unlabeled dialogdata, semi-supervised LABES-S2S signiﬁcantlyoutperforms both supervised-only and prior semi-supervised baselines. Remarkably, we can reducethe annotation requirements to 50% without perfor-mance loss on MultiWOZ, which is equivalent tosaving around 30,000 annotations. On use of unlabeled data for belief tracking.

Classic methods such as self-training (Rosenberget al., 2005), also known as pseudo-labeling (Lee,2013), has been applied to belief tracking (Tsenget al., 2019). Recently, the pretraining-and-ﬁne-tuning approach has received increasing interests(Heck et al., 2020; Peng et al., 2020; Hosseini-Asl et al., 2020). The generative model based semi-supervised learning approach, which blends unsu-pervised and supervised learning, has also beenstudied (Wen et al., 2017a; Jin et al., 2018). No-tably, the two approaches are orthogonal and couldbe jointly used. Our work belongs to the secondapproach, aiming to leverage unlabeled dialog databeyond of using general text corpus. A related workclose to ours is SEDST (Jin et al., 2018), whichalso perform semi-supervised learning for belieftracking. Remarkably, our model is optimized un-der the principled variational learning framework,while SEDST is trained with an ad-hoc combina-tion of posterior regularization and auto-encoding.Experimental in §6.2 show the superiority of ourmodel over SEDST. See Appendix A for differ-ences in model structures between SEDST andLABES-S2S.

End-to-end task-oriented dialog systems.

Ourmodel belongs to the family of E2E task-orienteddialog models (Wen et al., 2017a,b; Li et al., 2017;Lei et al., 2018; Mehri et al., 2019; Wu et al., 2019;Peng et al., 2020; Hosseini-Asl et al., 2020). Weborrow some elements from the Sequicity (Leiet al., 2018) model, such as representing the beliefstate as a natural language sequence (a text span),and using copy-augmented Seq2Seq learning (Guet al., 2016). But compared to Sequicity and allits follow-up works (Jin et al., 2018; Shu et al.,2019; Zhang et al., 2020; Liang et al., 2020), a fea-ture in our LABES-S2S model is that the transitionbetween belief states across turns and the depen-dency between system responses and belief statesare well statistically modeled. This new designresults in a completely different graphical modelstructure, which enables rigorous probabilistic vari-ational learning. See Appendix A for details.

Latent variable models for dialog.

Latent vari-ables have been used in dialog models. For nontask-oriented dialogs, latent variables are intro-duced to improve diversity (Serban et al., 2017;Zhao et al., 2017; Gao et al., 2019), control lan-guage styles (Gao et al., 2019) or incorporateknowledge (Kim et al., 2020) in dialog generation.For task-oriented dialogs, there are prior studieswhich use latent internal states via hidden Markovmodels (Zhai and Williams, 2014) or variationalautoencoders (Shi et al., 2019) to discover the un-derlying dialog structures. In Wen et al. (2017a)and Zhao et al. (2019), dialog acts are treated asatent variables, together with variational learningand reinforcement learning, aiming to improve re-sponse generation. To the best of our knowledge,we are the ﬁrst to model belief state as discrete la-tent variables, and propose to learn these structuredrepresentations via the variational principle.

We ﬁrst introduce LABES as a general dialog mod-eling framework in this section. For dialog turn t , let u t be the user utterance, b t be the currentbelief state after observed u t and r t be the corre-sponding system response. In addition, denote c t as the dialog context or model input at turn t , suchas c t (cid:44) { r t − , u t } as in this work. Note that c t can include longer dialog history depending on spe-ciﬁc implementations. Let d t be the database queryresult which can be obtained through a database-lookup operation given the belief state b t .Our goal is to model the joint distribution ofbelief states and system responses given the user in-puts, p θ ( b T , r T | u T ) , where T is the total num-ber of turns and θ denotes the model parameters.In LABES, we assume the joint distribution fol-lows the directed probabilistic graphical model il-lustrated in Figure 2, which can be formulated as: p θ ( b T , r T | u T )= p θ ( b T | u T ) p θ ( r T | b T , u T )= T (cid:89) t =1 p θ ( b t | b t − ,c t ) p θ ( r t | c t ,b t ,d t ) where b is an empty state. Intuitively, we refer theconditional distribution p θ ( b t | b t − ,c t ) as the beliefstate decoder, and p θ ( r t | c t ,b t ,d t ) the response de-coder in the above decomposition. Note that theprobability p ( d t | b t ) is omitted as database result d t is deterministically obtained given b t . Thus thesystem response can be generated as a three-stepprocess: ﬁrst predict the belief state b t , then use b t to query the database and obtain d t , ﬁnally generatethe system response r t based on all the conditions. Unsupervised Learning

We introduce an inference model q φ ( b t | b t − , c t , r t ) (described by dash arrows in Figure 2) to approxi-mate the true posterior p θ ( b t | b t − , c t , r t ) . Then wecan derive the variational evidence lower bound 𝑐𝑐 𝑡𝑡 𝑟𝑟 𝑡𝑡 𝑏𝑏 𝑡𝑡 𝑑𝑑 𝑡𝑡 𝑐𝑐 𝑡𝑡−1 𝑟𝑟 𝑡𝑡−1 𝑏𝑏 𝑡𝑡−1 𝑑𝑑 𝑡𝑡−1 observed variableslatent variables Figure 2: The probabilistic graphical model of LABES.Solid arrows describe the conditional generative model p θ , and dash arrows describe the approximate posteriormodel q φ . Note that we set c t (cid:44) { r t − , u t } in ourmodel, and omit u t from the graph for simplicity. (ELBO) for unsupervised learning as follows: J un = E q φ ( b T ) (cid:20) log p θ ( b T , r T | u T ) q φ ( b T | u T , r T ) (cid:21) = T (cid:88) t =1 E q φ ( b T ) (cid:2) log p θ ( r t | c t , b t , d t ) (cid:3) − α KL (cid:2) q φ ( b t | b t − , c t , r t ) (cid:107) p θ ( b t | b t − , c t ) (cid:3) where q φ ( b T ) (cid:44) T (cid:89) t =1 q φ ( b t | b t − , c t , r t ) and α is a hyperparameter to control the weight ofthe KL term introduced by Higgins et al. (2017).Optimizing J un requires drawing posterior be-lief state samples b T ∼ q φ ( b T | u T , r T ) to es-timate the expectations. Here we use a sequen-tial sampling strategy similar to Kim et al. (2020),where each b t sampled from q φ ( b t | b t − , c t , r t ) atturn t is used as the condition to generate the nextturn’s belief state b t +1 . For calculating gradientswith discrete latent variables, which is non-trivial,some methods have been proposed such as us-ing a score function estimator (Williams, 1992)or categorical reparameterization trick (Jang et al.,2017). In this paper, we employ the simple Straight-Through estimator (Bengio et al., 2013), wherethe sampled discrete token indexes are used forforward computation, and the continuous softmaxprobability of each token is used for backward gra-dient calculation. Although the Straight-Throughestimator is biased, we ﬁnd it works pretty well inour experiments, therefore leave the exploration ofother optimization methods as future work. 𝑏𝑏 𝑡𝑡−1 𝑑𝑑𝑑𝑑𝑑𝑑 ℎ 𝑏𝑏 𝑡𝑡 𝑑𝑑𝑑𝑑𝑑𝑑 ℎ 𝑑𝑑 𝑡𝑡 𝑑𝑑𝑒𝑒𝑑𝑑 ℎ 𝑏𝑏 𝑡𝑡+1 𝑑𝑑𝑑𝑑𝑑𝑑 ℎ 𝑟𝑟 𝑡𝑡 𝑑𝑑𝑑𝑑𝑑𝑑 ℎ 𝑟𝑟 𝑡𝑡−1 𝑑𝑑𝑑𝑑𝑑𝑑 𝑢𝑢 𝑡𝑡−1 𝑟𝑟 𝑡𝑡−1 𝑏𝑏 𝑡𝑡−1 𝑢𝑢 𝑡𝑡 𝑟𝑟 𝑡𝑡 ℎ 𝑑𝑑 𝑡𝑡+1 𝑑𝑑𝑒𝑒𝑑𝑑 𝑢𝑢 𝑡𝑡+1 𝑏𝑏 𝑡𝑡 ℎ 𝑑𝑑 𝑡𝑡−1 𝑑𝑑𝑒𝑒𝑑𝑑 𝑟𝑟 𝑡𝑡−2 ℎ 𝑏𝑏 𝑡𝑡−1 𝑑𝑑𝑒𝑒𝑑𝑑 ℎ 𝑏𝑏 𝑡𝑡 𝑑𝑑𝑒𝑒𝑑𝑑 turn 𝑡𝑡 turn 𝑡𝑡 − 𝑑𝑑 𝑡𝑡−1 𝑑𝑑 𝑡𝑡 DB observed variables hidden statesdatabaselatent variables DBDB (a) Overview of LABES-S2S. price fast foodfoodrestaurant ℎ 𝑓𝑓2 ℎ 𝑓𝑓1 type fast ℎ 𝑓𝑓3 foodcheap ℎ 𝑝𝑝2 ℎ 𝑝𝑝1 cheap ℎ 𝑎𝑎1 hotel … (b) Structure of the belief state decoder. Figure 3: (a) shows the computational graph of LABES-S2S. In (b), rectangles in different colors denote differentword embeddings, and the embedding of domain names and slot names are concatenated as the initial input. Notethat the same (i.e. weight-tied) decoder is shared across all slots. Decoding stops when a slot-speciﬁc end-of-sentence symbol is generated, which is possible to be the ﬁrst output if the slot does not appear in the dialog.

Semi-Supervised Learning

When b t labels are available, we can easily trainthe generative model p θ and inference model q φ via supervised maximum likelihoods: J sup = T (cid:88) t =1 (cid:2) log p θ ( b t | b t − , c t )+log p θ ( r t | c t , b t , d t )+ log q φ ( b t | b t − , c t , r t ) (cid:3) When a mix of labeled and unlabeled data isavailable, we perform semi-supervised learningusing a combination of the supervised objective J sup and the unsupervised objective J un . Speciﬁ-cally, we ﬁrst pretrain p θ and q φ on small-sized la-beled data until convergence. Then we draw super-vised and unsupervised minibatches from labeledand unlabeled data and perform stochastic gradi-ent ascent over J sup and J un , respectively. Weuse supervised pretraining ﬁrst because training q φ ( b t | b t − , c t , r t ) to correctly generate slot valuesand special outputs such as “dontcare” and end-of-sentence tokens as much as possible is importantto improve sample efﬁciency in subsequent semi-supervised learning. In the above probabilistic dialog model LABES,the belief state decoder p θ ( b t | b t − ,c t ) and the re-sponse decoder p θ ( r t | c t ,b t ,d t ) can be ﬂexibly im- plemented. In this section we introduce LABES-S2S as an instantiation of the general LABESmodel based on copy-augmented Seq2Seq con-ditional distributions (Gu et al., 2016), which isshown in Figure 3(a) and described in the following.The responses are generated through two Seq2Seqprocesses: 1) decode the belief state given dialogcontext and last turn’s belief state and 2) decode thesystem response given dialog context, the decodedbelief state and database query result. Belief State Decoder

The belief state decoder is implemented via aSeq2Seq process, as shown in Figure 3(b). Inspiredby Shu et al. (2019), we use a single GRU decoderto decode the value for each informable slot sepa-rately, feeding the embedding of each slot name asthe initial input. In multi-domain setting, the do-main name embedding is concatenated with the slotname embedding to distinguish slots with identicalnames in different domains (Wu et al., 2019).We use two bi-directional GRUs (Cho et al.,2014) to encode the dialog context c t and previousbelief state b t − into a sequence of hidden vectors h encc t and h encb t − respectively, which are the inputs tothe belief state decoder. As there are multiple slots,and their values can also consist of multiple tokens,we denote the i -th token of slot s by b s,it . To de-code each token b s,it , we ﬁrst compute an attentionvector over the encoder vectors. Then the attentionector and the embedding of the last decoded token e ( b s,i − t ) are concatenated and fed into the decoderGRU to get the decoder hidden state h decb s,it , denotedas h decs,i for simplicity: a s,it = Attn( h encc t ◦ h encb t − , h decs,i ) h decs,i = GRU( a s,it ◦ e ( b s,i − t ) , h decs,i − )ˆ h decs,i = dropout (cid:0) h decs,i ◦ e ( b s,i − t ) (cid:1) where ◦ denotes vector concatenation. We use thelast hidden state of the dialog context encoder as h decs, , and the slot name embedding as e ( b s, t ) . Wereuse e ( b s,i − t ) to form ˆ h decs,i to give more emphasison the slot name embedding and add a dropoutlayer to reduce overﬁtting. ˆ h decs,i is then used tocompute a generative score ψ gen for each token w in the vocabulary V , and a copy score ψ cp forwords appeared in c t and b t − . Finally, these twoscores are combined and normalized to form theﬁnal decoding probability following: ψ gen ( b s,it = w ) = v T w W gen ˆ h decs,i , w ∈ V ψ cp ( b s,it = x j ) = h enc T x j W cp ˆ h decs,i , x j ∈ c t ∪ b t − p ( b s,it = w ) = 1 Z (cid:18) e ψ gen ( w ) + (cid:88) j : x j = w e ψ cp ( x j ) (cid:19) where W gen and W cp are trainable parameters, v w is the one-hot representation of w , x j is the j -thtoken in c t ∪ b t − and Z is the normalization term.With copy mechanism, it is easier for the modelto extract words mentioned by the user and keepthe unchanged values from previous belief state.Meanwhile, the decoder can also generate tokensnot appeared in input sequences, e.g. the special to-ken “dontcare” or end-of-sentence symbols. Sincethe decoding for each slot is independent with eachother, all the slots can be decoded in parallel tospeed up.The posterior network q φ ( b t | b t − , c t , r t ) is con-structed through a similar process, where the onlydifference is that the system response r t is alsoencoded and used as an additional input to the de-coder. Note that the posterior network is separatelyparameterized with φ . Response Decoder

The response decoder is implemented via anotherSeq2Seq process. After obtaining the belief state b t , we use it to query a database to ﬁnd entitiesthat meet user’s need, e.g. Thai restaurants in the south area. The query result d t is represented as a 5-dimension one-hot vector to indicate 0, 1, 2, 3 and > b t − ) to encode the currentbelief state b t into hidden vectors h encb t . Then foreach token r it in the response, the decoder state h decr t,i can be computed as follows: a it = Attn( h encc t ◦ h encb t , h decr t,i ) h decr t,i = GRU( a it ◦ e ( r i − t ) ◦ d t , h decr t,i − )ˆ h decr t,i = h decr t,i ◦ a it ◦ d t Note that dropout is not used for ˆ h decr t,i , since re-sponse generation is not likely to overﬁt, comparedto belief tracking in practice. We omit the probabil-ity formulas because they are almost the same asin the belief state decoder, except for changing thecopy source from c t ∪ b t − to c t ∪ b t . We evaluate the proposed model on three bench-mark task-oriented dialog datasets: the CambridgeRestaurant (CamRest676) (Wen et al., 2017b), Stan-ford In-Car Assistant (In-Car) (Eric et al., 2017)and MultiWOZ (Budzianowski et al., 2018), with676/3031/10438 dialogs respectively. In particular,MultiWOZ is one of the most challenging datasetup-to-date given its multi-domain setting, complexontology and diverse language styles. As there aresome belief state annotation errors in MultiWOZ,we use the corrected version MultiWOZ 2.1 (Ericet al., 2019) in our experiments. See Appendix Bfor more detailed introductions and statistics.

We evaluate the model performance under theend-to-end setting, i.e. the model needs to ﬁrstpredict belief states and then generate responseased on its own belief predictions. For evaluat-ing belief tracking performance, we use the com-monly used joint goal accuracy , which isthe proportion of dialog turns where all slot valuesare correctly predicted. For evaluating responsegeneration, we use

BLEU (Papineni et al., 2002)to measure the general language quality. The re-sponse quality towards task completion is measuredby dataset-speciﬁc metrics to facilitate compari-son with prior works. For CamRest676 and In-Car, we use

Match and

SuccF1 following Leiet al. (2018). For MultiWOZ, we use

Inform and

Success as in Budzianowski et al. (2018),and also a combined score computed through(

Inform + Success ) × . BLEU as the overallresponse quality suggested by Mehri et al. (2019).

In our experiments, we compare our model to vari-ous Dialog State Tracking (DST) and End-to-End(E2E) baseline models. Recently, large-scale pre-trained language models (LM) such as BERT (De-vlin et al., 2019) and GPT-2 (Radford et al., 2019)are used to improve the performance of dialog mod-els, however in the cost of tens-fold larger modelsizes and computations. We distinguish them fromlight-weighted models trained from scratch in ourcomparison.

Independent DST Models:

For CamRest676,we compare to StateNet (Ren et al., 2018) andTripPy (Heck et al., 2020), which are the SOTAmodel without/with BERT respectively. For Multi-WOZ, we compare to BERT-free models TRADE(Wu et al., 2019), NADST (Le et al., 2020b) andCSFN-DST (Zhu et al., 2020), and BERT-basedmodels including TripPy, the BERT version ofCSFN and DST-Picklist (Zhang et al., 2019).

E2E Models:

E2E models can be divided intothree sub-categories. The TSCP (Lei et al., 2018),SEDST (Jin et al., 2018), FSDM (Shu et al., 2019),MOSS (Liang et al., 2020) and DAMD (Zhanget al., 2020) are based on the copy-augmentedSeq2Seq learning framework proposed by Lei et al.(2018). LIDM (Wen et al., 2017a), SFN (Mehriet al., 2019) and UniConv (Le et al., 2020a) aremodular designed, connected through neural statesand trained end-to-end. SimpleTOD (Hosseini-Aslet al., 2020) and SOLOLIST (Peng et al., 2020) aretwo recent models, which both use a single auto-regressive language model, initialized from GPT-2,to build the entire system.

Semi-Supervised Methods:

First, we comparewith SEDST (Jin et al., 2018) for semi-supervisedbelief tracking performance. SEDST is also a E2Edialog model based on copy-augmented Seq2Seqlearning (see Appendix A for more details). Overunlabled dialog data, SEDST is trained throughposterior regularization (PR), where a posteriornetwork is used to model the posterior belief dis-tribution given system responses, and then guidethe learning of prior belief tracker through min-imizing the KL divergence between them. Sec-ond, based on the LABES-S2S model, we com-pare our variational learning (VL) method to a clas-sic semi-supervised learning baseline, self-training(ST), which performs as its name suggests. Speciﬁ-cally, after supervised pretraining over small-sizedlabeled dialogs, we run the system to generatepseudo belief states b t over unlabeled dialogs, andthen train the response decoder p θ ( r t | b t , c t , d t ) ina supervised manner. The gradients will propagatethrough the discrete belief states by the StraightThrough gradient estimator (Bengio et al., 2013)over the computational graph, thus also adjustingthe belief state decoder p θ ( b t | b t − , c t ) . In our experiments, we report both the best resultand the statistical result obtained from multipleindependent runs with different random seeds. De-tails are described in the caption of each table. Theimplementation details of our model is available inAppendix C. Results are organized to show the ad-vantage of our proposed LABES-S2S model overexisting models (§6.1) and the effectiveness of oursemi-supervised learning method (§6.2).

We ﬁrst train our LABES-S2S model under full su-pervision and compare with other baseline modelson the benchmarks. The results are given in Table1 and Table 2.As shown in Table 1, LABES-S2S obtains newSOTA joint goal accuracy on CamRest676 and thehighest match scores on both CamRest676 and In-Car datasets. Its BLEU scores are also beyondor close to the previous SOTA models. The rela-tively low SuccF1 is due to that in LABES-S2S,we do not apply additional dialog act modeling andreinforcement ﬁne-tuning to encourage slot tokengeneration as in other E2E models.Table 2 shows the MultiWOZ results. Among ype Model CamRest676 In-CarJoint Goal Match SuccF1 BLEU Match SuccF1 BLEUDST StateNet (Ren et al., 2018) 88.9 - - - - - -TripPy (Heck et al., 2020) 92.7 ± ∗ ∗ ∗ ∗ - - -LABES-S2S (best) LABES-S2S (statistical) 91.7 ± ± ± ± ± ± ± Table 1: Results on CamRest676 and In-Car. The model with the highest joint goal accuracy on the developmentset of CamRest676 is shown as the best result, as similarly reported in prior work. Statistical results are reportedas the mean and standard deviation of 5 runs. ∗ denotes results obtained by our run of the open-source code. Model Conﬁgure Belief Tracking Response GenerationType Model Size Pretrained LM Joint Goal Inform Success BLEU CombinedDST TRADE (Wu et al., 2019) 10.2M no 45.60 - - - -NADST (Le et al., 2020b) 12.9M no 49.04 - - - -CSFN-DST (Zhu et al., 2020) 63M no 50.81 - - - -E2E TSCP (Lei et al., 2018) 1.4M no 37.53 66.41 45.32 15.54 71.41SFN + RL (Mehri et al., 2019) 1.4M no 21.17 ∗ ∗ LABES-S2S (statistical) 3.8M no 50.05 76.89 63.30 17.92 88.01DST CSFN-DST + BERT (Zhu et al., 2020) 115M BERT 52.88 - - - -DST-Picklist (Zhang et al., 2019) 220M BERT 53.30 - - - -TripPy (Heck et al., 2020) 110M BERT 55.29 - - - -E2E SimpleTOD (Hosseini-Asl et al., 2020) 81M DistilGPT-2 56.45 85.00 70.05 15.23 92.98SOLOLIST (Peng et al., 2020) 117M GPT-2 - 85.50 72.90 16.54 95.74

Table 2: Results on MultiWOZ 2.1. The model with the highest validation joint goal accuracy is shown as the bestresult, as similarly reported in prior work. The standard deviations for the statistical results are in Table 5 in theappendix. ∗ denotes results obtained by our run of the open-source code. all the models without using large pretrained LMs,LABES-S2S performs the best in belief trackingjoint goal accuracy and 3 out of the 4 response gen-eration metrics. Although the response generationperformance is not as good as recent GPT-2 basedSimpleTOD and SOLOLIST, our model is muchsmaller and thus computational cheaper. In our semi-supervised experiments, we ﬁrst splitthe data according to a ﬁxed proportion, then trainthe models using only labeled data (SupOnly), orusing both labeled and unlabeled data (Semi) withthe proposed variational learning method (Semi-VL), self-training (Semi-ST) and posterior regular-ization (Semi-PR) introduced in §5.3 respectively.We conduct experiments with 50% and 25% la-beled data on CamRest676 and In-Car following Jin et al. (2018), and change the labeled data pro-portion from 10% to 100% on MultiWOZ. Theresults are shown in Table 3 and Figure 4.In Table 3, we can see that semi-supervised learn-ing methods outperform the supervised learningbaseline consistently in all experiments for the twodatasets. In particular, the improvement of Semi-VL over SupOnly on our model is signiﬁcantlylarger than Semi-PR over SupOnly on SEDST inmost metrics, and Semi-VL obtains a joint goal ac-curacy of 1.3% ∼ abeledData Model & Method CamRest676 In-CarJoint Goal Match SuccF1 BLEU Joint Goal Match SuccF1 BLEU50% LABES-S2S + SupOnly 83.3 91.8 80.5 23.8 77.9 81.0 74.5 20.4LABES-S2S + Semi-ST 86.3 93.1 83.1 25.3 79.8 83.4 74.8 22.1LABES-S2S + Semi-VL 89.7 94.4 83.1 25.3 81.1 84.1 77.5 22.6SEDST + SupOnly 78.5 89.1 65.0 18.6 74.4 74.1 69.2 16.9SEDST + Semi-PR 79.5 91.1 71.2 21.4 77.2 77.8 75.0 19.425% LABES-S2S + SupOnly 68.8 85.9 75.3 21.7 74.3 73.7 62.8 15.8LABES-S2S + Semi-ST 74.1 91.1 82.5 25.4 74.9 74.4 76.9 22.5LABES-S2S + Semi-VL 77.5 93.6 81.4 25.5 78.8 79.3 76.6 22.4SEDST + SupOnly 64.2 80.3 66.8 16.9 57.8 51.0 50.4 14.1SEDST + Semi-PR 65.1 83.0 71.7 22.1 63.6 59.9 70.4 19.3 Table 3:

SupOnly denotes training with only labeled data, and

Semi denotes training with both labeled and unla-beled data in each dataset. ST, VL and PR denote self-training, variational learning and posterior regularization(Jin et al., 2018) respectively. Results of SEDST are obtained by our run of the open-source code. All the scoresin this table are the mean from 5 runs. /DEHOHG'DWD 6XS2QO\6HPL676HPL9/ (a) Joint Goal Accuracy /DEHOHG'DWD 6XS2QO\6HPL676HPL9/ (b) Combined Score

Figure 4: Performance of different methods w.r.t label-ing proportion on MultiWOZ 2.1. The dash line corre-sponds to the baseline trained with 100% labeled data.

Note that Semi-VL and Semi-ST are fed with thesame set of system responses, thus they obtain sim-ilar SuccF1 and BLEU scores in Table 3, whichmainly measure the response quality.The results on MultiWOZ shown in Figure 4also support the above conclusions. From the plotof metric scores w.r.t labeling proportions, we cansee how many labels can be reduced clearly. OurLABES-S2S model trained with Semi-VL obtainsa joint goal accuracy of 49.47% and a combinedscore of 89.21 on only 50% of labeled data, whichis very close to 50.05% and 88.01 obtained un-der 100% supervision. This indicates that we canreduce 50% of labels without losing performance,which results in reducing around 30,000 belief stateannotations given the size of MultiWOZ. Moreover,it can be seen from Figure 4 that our Semi-VI canimprove the belief tracking and response generationperformance when labeling only 10% of dialogues,and the smaller amount of labels, the larger gainobtained by Semi-VI.

Dialog u : I am looking for an expensive restaurant that serves Russian food. b : { food: Russian, pricerange: expensive } r : There is no expensive restaurant that serves Russian food. Can Ihelp you with anything else? u : Yes, do you have British type food ? b : { food: British, pricerange: expensive } r : Yes, there are 6 options. Does the part of town matter? u : I don’t care.SupOnly b : { food: British, pricerange: expensive } r : [ v.name ] is located at [ v.addr ] . How else I can help you with?Semi-VL b : { food: British, pricerange: expensive, area: dontcare } r : [ v.name ] is a nice British restaurant in the [ v.area ] of town.Would you like their address?Dialog u : I need a train going to Bishops Stortford. b : { destination: Bishops Stortford } r : OK. Are you departing from Cambridge? u : Yes, that’s why I’m requesting help from Cambridge towninfo.SupOnly b : { destination: Bishops Stortford } r : Where will you be departing from?Semi-VL b : { destination: Bishops Stortford, departure: Cambridge } r : I have [ v.choice ] trains that meet your criteria. What daywould you like to travel? Table 4: Comparison of two example turns generatedby our model with supervised learning only (SupOnly)and semi-supervised variational learning (Semi-VL).

We give two examples where the model trainedwith Semi-VL improves over the supervised-training-only baseline. In both examples, the userindicates his/her goal implicitly with a short reply.These rarely occurred corner cases are missed bythe baseline model, but successfully captured aftersemi-supervised learning. Moreover, we can seethat Semi-VL helps our model learn the cue word“British” which contributes to a more informativeesponse in the ﬁrst dialog, and in the second di-alog, avoid the incoherent error caused by errorpropagation, thus improve the response generationquality.

In this paper we are interested in reducing beliefstate annotation cost for building E2E task-orienteddialog systems. We propose a conditional genera-tive model of dialogs - LABES, where belief statesare modeled as latent variables, and unlabeled di-alog data can be effectively leveraged to improvebelief tracking through semi-supervised variationallearning. Furthermore, we develop LABES-S2S,which is a copy-augmented Seq2Seq model instan-tiation of LABES. We show the strong benchmarkperformance of LABES-S2S and the effectivenessof our semi-supervised learning method on threebenchmark datasets. In our experiments on Multi-WOZ, we can save around 50%, i.e. around 30,000belief state annotations without performance loss.There are some interesting directions for fu-ture work. First, the LABES model is generaland can be enhanced by, e.g. incorporating large-scale pre-trained language models, allowing otheroptions for the belief state decoder and the re-sponse decoder such as Transformer based. Sec-ond, we can analogously introduce dialog acts a T as latent variables to deﬁne the joint distribu-tion p θ ( b T , a T , r T | u T ) , which can be trainedwith semi-supervised learning and reinforcementlearning as well. Acknowledgments

This work is supported by NSFC 61976122, Min-istry of Education and China Mobile joint fundingMCM20170301.

References

Yoshua Bengio, Nicholas L´eonard, and AaronCourville. 2013. Estimating or propagating gradi-ents through stochastic neurons for conditional com-putation. arXiv preprint arXiv:1308.3432 .Paweł Budzianowski, Tsung-Hsien Wen, Bo-HsiangTseng, I˜nigo Casanueva, Stefan Ultes, Osman Ra-madan, and Milica Gasic. 2018. Multiwoz-a large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling. In

Proceedings of the2018 Conference on Empirical Methods in NaturalLanguage Processing , pages 5016–5026. Kyunghyun Cho, Bart van Merri¨enboer, Caglar Gul-cehre, Dzmitry Bahdanau, Fethi Bougares, HolgerSchwenk, and Yoshua Bengio. 2014. Learningphrase representations using rnn encoder–decoderfor statistical machine translation. In

Proceedings ofthe 2014 Conference on Empirical Methods in Nat-ural Language Processing (EMNLP) , pages 1724–1734.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. Bert: Pre-training ofdeep bidirectional transformers for language under-standing. In

Proceedings of the 2019 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, Volume 1 (Long and Short Papers) , pages4171–4186.Mihail Eric, Rahul Goel, Shachi Paul, Abhishek Sethi,Sanchit Agarwal, Shuyag Gao, and Dilek Hakkani-Tur. 2019. Multiwoz 2.1: Multi-domain dialoguestate corrections and state tracking baselines. arXivpreprint arXiv:1907.01669 .Mihail Eric, Lakshmi Krishnan, Francois Charette, andChristopher D. Manning. 2017. Key-value retrievalnetworks for task-oriented dialogue. In

Proceedingsof the 18th Annual SIGdial Meeting on Discourseand Dialogue , pages 37–49.Jun Gao, Wei Bi, Xiaojiang Liu, Junhui Li, GuodongZhou, and Shuming Shi. 2019. A discrete cvae forresponse generation on short-text conversation. In , pages 1898–1908.Silin Gao, Yichi Zhang, Zhijian Ou, and Zhou Yu. 2020.Paraphrase augmented task-oriented dialog genera-tion. arXiv preprint arXiv:2004.07462 .Xiang Gao, Yizhe Zhang, Sungjin Lee, Michel Galley,Chris Brockett, Jianfeng Gao, and Bill Dolan. 2019.Structuring latent spaces for stylized response gen-eration. In

Proceedings of the 2019 Conference onEmpirical Methods in Natural Language Processingand the 9th International Joint Conference on Natu-ral Language Processing (EMNLP-IJCNLP) , pages1814–1823.Jiatao Gu, Zhengdong Lu, Hang Li, and Victor OKLi. 2016. Incorporating copying mechanism insequence-to-sequence learning. In

Proceedings ofthe 54th Annual Meeting of the Association for Com-putational Linguistics (Volume 1: Long Papers) ,pages 1631–1640.Michael Heck, Carel van Niekerk, Nurul Lubis, Chris-tian Geishauser, Hsien-Chin Lin, Marco Moresi, andMilica Gaˇsi´c. 2020. Trippy: A triple copy strategyfor value independent neural dialog state tracking. arXiv preprint arXiv:2005.02877 .Matthew Henderson, Blaise Thomson, and Jason DWilliams. 2014. The second dialog state trackinghallenge. In

Proceedings of the 15th annual meet-ing of the special interest group on discourse anddialogue (SIGDIAL) , pages 263–272.Irina Higgins, Lo¨ıc Matthey, Arka Pal, ChristopherBurgess, Xavier Glorot, Matthew Botvinick, ShakirMohamed, and Alexander Lerchner. 2017. beta-vae:Learning basic visual concepts with a constrainedvariational framework. In . OpenReview.net.Ehsan Hosseini-Asl, Bryan McCann, Chien-Sheng Wu,Semih Yavuz, and Richard Socher. 2020. A simplelanguage model for task-oriented dialogue. arXivpreprint arXiv:2005.00796 .Eric Jang, Shixiang Gu, and Ben Poole. 2017. Categor-ical reparameterization with gumbel-softmax. In . OpenReview.net.Xisen Jin, Wenqiang Lei, Zhaochun Ren, HongshenChen, Shangsong Liang, Yihong Zhao, and DaweiYin. 2018. Explicit state tracking with semi-supervisionfor neural dialogue generation. In

Pro-ceedings of the 27th ACM International Conferenceon Information and Knowledge Management , pages1403–1412.Byeongchang Kim, Jaewoo Ahn, and Gunhee Kim.2020. Sequential latent knowledge selection forknowledge-grounded dialogue. In .OpenReview.net.Diederik P. Kingma and Max Welling. 2014. Auto-encoding variational bayes. In .Hung Le, Doyen Sahoo, Chenghao Liu, Nancy FChen, and Steven CH Hoi. 2020a. Uniconv: Auniﬁed conversational neural architecture for multi-domain task-oriented dialogues. arXiv preprintarXiv:2004.14307 .Hung Le, Richard Socher, and Steven C. H. Hoi. 2020b.Non-autoregressive dialog state tracking. In . OpenReview.net.Dong-Hyun Lee. 2013. Pseudo-label: The simple andefﬁcient semi-supervised learning method for deepneural networks. In

Workshop on challenges in rep-resentation learning, ICML , volume 3, page 2.Wenqiang Lei, Xisen Jin, Min-Yen Kan, ZhaochunRen, Xiangnan He, and Dawei Yin. 2018. Sequic-ity: Simplifying task-oriented dialogue systems withsingle sequence-to-sequence architectures. In

ACL 2018: 56th Annual Meeting of the Association forComputational Linguistics , volume 1, pages 1437–1447.Xiujun Li, Yun-Nung Chen, Lihong Li, Jianfeng Gao,and Asli Celikyilmaz. 2017. End-to-end task-completion neural dialogue systems. In

Proceedingsof the Eighth International Joint Conference on Nat-ural Language Processing (Volume 1: Long Papers) ,volume 1, pages 733–743.Weixin Liang, Youzhi Tian, Chengcai Cheng, and ZhouYu. 2020. Moss: End-to-end dialog system frame-work with modular supervision. In

AAAI 2020 : TheThirty-Fourth AAAI Conference on Artiﬁcial Intelli-gence .Bing Liu and Ian Lane. 2017. An end-to-end train-able neural network model with belief tracking fortask-oriented dialog.

Proc. Interspeech 2017 , pages2506–2510.Shikib Mehri, Tejas Srinivasan, and Maxine Eskenazi.2019. Structured fusion networks for dialog. In

Pro-ceedings of the 20th Annual SIGdial Meeting on Dis-course and Dialogue .Nikola Mrkˇsi´c, Diarmuid ´O S´eaghdha, Tsung-HsienWen, Blaise Thomson, and Steve Young. 2017. Neu-ral belief tracker: Data-driven dialogue state track-ing. In

Proceedings of the 55th Annual Meeting ofthe Association for Computational Linguistics (Vol-ume 1: Long Papers) , pages 1777–1788.Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic eval-uation of machine translation. In

Proceedings ofthe 40th annual meeting on association for compu-tational linguistics , pages 311–318. Association forComputational Linguistics.Baolin Peng, Chunyuan Li, Jinchao Li, Shahin Shayan-deh, Lars Liden, and Jianfeng Gao. 2020. Soloist:Few-shot task-oriented dialog with a single pre-trained auto-regressive model. arXiv preprintarXiv:2005.05298 .Jeffrey Pennington, Richard Socher, and Christopher DManning. 2014. Glove: Global vectors for word rep-resentation. In

Proceedings of the 2014 conferenceon empirical methods in natural language process-ing (EMNLP) , pages 1532–1543.Alec Radford, Jeffrey Wu, Rewon Child, David Luan,Dario Amodei, and Ilya Sutskever. 2019. Languagemodels are unsupervised multitask learners.

OpenAIBlog , 1(8):9.Liliang Ren, Kaige Xie, Lu Chen, and Kai Yu. 2018.Towards universal dialogue state tracking. In

Pro-ceedings of the 2018 Conference on Empirical Meth-ods in Natural Language Processing , pages 2780–2786.Chuck Rosenberg, Martial Hebert, and Henry Schnei-derman. 2005. Semi-supervised self-training of ob-ject detection models.

WACV/MOTION , 2.ulian Vlad Serban, Alessandro Sordoni, Ryan Lowe,Laurent Charlin, Joelle Pineau, Aaron Courville, andYoshua Bengio. 2017. A hierarchical latent variableencoder-decoder model for generating dialogues. In

Thirty-First AAAI Conference on Artiﬁcial Intelli-gence .Weiyan Shi, Tiancheng Zhao, and Zhou Yu. 2019. Un-supervised dialog structure learning. In

Proceed-ings of the 2019 Conference of the North AmericanChapter of the Association for Computational Lin-guistics: Human Language Technologies, Volume 1(Long and Short Papers) , pages 1797–1807.Lei Shu, Piero Molino, Mahdi Namazifar, Hu Xu,Bing Liu, Huaixiu Zheng, and G¨okhan T¨ur. 2019.Flexibly-structured model for task-oriented dia-logues. In

Proceedings of the 20th Annual SIGdialMeeting on Discourse and Dialogue .Kihyuk Sohn, Honglak Lee, and Xinchen Yan. 2015.Learning structured output representation usingdeep conditional generative models. In

Advances inneural information processing systems , pages 3483–3491.Bo-Hsiang Tseng, Marek Rei, Paweł Budzianowski,Richard Turner, Bill Byrne, and Anna Korhonen.2019. Semi-supervised bootstrapping of dialoguestate trackers for task-oriented modelling. In , pages 1273–1278.Tsung-Hsien Wen, Milica Gasic, Nikola Mrkˇsi´c, Pei-Hao Su, David Vandyke, and Steve Young. 2015. Se-mantically conditioned lstm-based natural languagegeneration for spoken dialogue systems. In

Proceed-ings of the 2015 Conference on Empirical Methodsin Natural Language Processing , pages 1711–1721.Tsung-Hsien Wen, Yishu Miao, Phil Blunsom, andSteve J. Young. 2017a. Latent intention dia-logue models. In

Proceedings of the 34th Inter-national Conference on Machine Learning, ICML2017, Sydney, NSW, Australia, 6-11 August 2017 ,volume 70 of

Proceedings of Machine Learning Re-search , pages 3732–3741. PMLR.Tsung-Hsien Wen, David Vandyke, Nikola Mrkˇsi´c,Milica Gasic, Lina M Rojas Barahona, Pei-Hao Su,Stefan Ultes, and Steve Young. 2017b. A network-based end-to-end trainable task-oriented dialoguesystem. In

Proceedings of the 15th Conference ofthe European Chapter of the Association for Compu-tational Linguistics: Volume 1, Long Papers , pages438–449.Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforce-ment learning.

Machine learning , 8(3-4):229–256.Chien-Sheng Wu, Andrea Madotto, Ehsan Hosseini-Asl, Caiming Xiong, Richard Socher, and PascaleFung. 2019. Transferable multi-domain state gen-erator for task-oriented dialogue systems. In

ACL 2019 : The 57th Annual Meeting of the Associationfor Computational Linguistics , pages 808–819.Qingyang Wu, Yichi Zhang, Yu Li, and Zhou Yu.2019. Alternating recurrent dialog model with large-scale pre-trained language models. arXiv preprintarXiv:1910.03756 .Ke Zhai and Jason D Williams. 2014. Discovering la-tent structure in task-oriented dialogues. In

Proceed-ings of the 52nd Annual Meeting of the Associationfor Computational Linguistics (Volume 1: Long Pa-pers) , pages 36–46.Jian-Guo Zhang, Kazuma Hashimoto, Chien-ShengWu, Yao Wan, Philip S Yu, Richard Socher, andCaiming Xiong. 2019. Find or classify? dual strat-egy for slot-value predictions on multi-domain dia-log state tracking. arXiv preprint arXiv:1910.03544 .Yichi Zhang, Zhijian Ou, and Zhou Yu. 2020. Task-oriented dialog systems that consider multiple appro-priate responses under the same context. In

AAAI2020 : The Thirty-Fourth AAAI Conference on Arti-ﬁcial Intelligence .Tiancheng Zhao, Kaige Xie, and Maxine Eskenazi.2019. Rethinking action spaces for reinforcementlearning in end-to-end dialog agents with latent vari-able models. In

NAACL-HLT 2019: Annual Con-ference of the North American Chapter of the Asso-ciation for Computational Linguistics , pages 1208–1218.Tiancheng Zhao, Ran Zhao, and Maxine Eskenazi.2017. Learning discourse-level diversity for neuraldialog models using conditional variational autoen-coders. In

Proceedings of the 55th Annual Meet-ing of the Association for Computational Linguistics(Volume 1: Long Papers) , pages 654–664.Su Zhu, Jieyu Li, Lu Chen, and Kai Yu. 2020. Efﬁ-cient context and schema fusion networks for multi-domain dialogue state tracking. arXiv preprintarXiv:2004.03386 . A Model Comparisons with Prior Work

In this section, we comment on the differences be-tween our LABES-S2S model and Sequicity (Leiet al., 2018) in both models and learning methods.Note that SEDST (Jin et al., 2018) employs thesame model structure as Sequicity. First, Figure5 shows the difference in computational graphsbetween Sequicity/SEDST and LABES-S2S. ForSequicity/SEDST, b t and r t are decoded directlyfrom the belief state decoder’s hidden states h decb t ,thus the conditional probability of r t given b t andthe state transition probability between b t − and b t are not considered . In contrast, LABES-S2S Strictly speaking, the transition between belief statesacross turns and the dependency between system responses 𝑡𝑡 𝑟𝑟 𝑡𝑡 𝑏𝑏 𝑡𝑡 ℎ 𝑏𝑏 𝑡𝑡 𝑑𝑑𝑑𝑑𝑑𝑑 ℎ 𝑑𝑑 𝑡𝑡 𝑑𝑑𝑒𝑒𝑑𝑑 ℎ 𝑟𝑟𝑑𝑑𝑑𝑑𝑑𝑑 ℎ 𝑏𝑏 𝑡𝑡−1 𝑑𝑑𝑑𝑑𝑑𝑑 ℎ 𝑏𝑏 𝑡𝑡+1 𝑑𝑑𝑑𝑑𝑑𝑑 𝑑𝑑 𝑡𝑡 𝑐𝑐 𝑡𝑡 𝑟𝑟 𝑡𝑡 𝑏𝑏 𝑡𝑡 ℎ 𝑏𝑏 𝑡𝑡 𝑑𝑑𝑑𝑑𝑑𝑑 ℎ 𝑑𝑑 𝑡𝑡 𝑑𝑑𝑒𝑒𝑑𝑑 ℎ 𝑟𝑟 𝑡𝑡 𝑑𝑑𝑑𝑑𝑑𝑑 𝑑𝑑 𝑡𝑡 ℎ 𝑏𝑏 𝑡𝑡 𝑑𝑑𝑒𝑒𝑑𝑑 ℎ 𝑏𝑏 𝑡𝑡−1 𝑑𝑑𝑒𝑒𝑑𝑑 Sequicity/SEDSTLABES-S2S (Ours)observed variables hidden statesdatabaselatent variables

DBDB DB

Figure 5: Comparison of computational graphs. model introduces an additional b t encoder and usesthe encoder hidden states h encb t to generate systemresponse and next turn’s belief state, thus the condi-tional probability p θ ( r t | b t , c t ) and state transitionprobability p θ ( b t | b t − , c t ) are well deﬁned by twocomplete Seq2Seq processes.Second, the difference in models can also beclearly seen from the probabilistic graphical modelstructures as shown in Figure 6. LABES-S2S isa conditional generative model where the beliefstates are latent variables. In contrast, Sequic-ity/SEDST do not treat the belief states as latentvariables.Third, the above differences in models leadto differences in learning methods for Sequic-ity/SEDST and LABES-S2S. Sequicity can onlybe trained on labeled data via multi-task supervisedlearning. SEDST resorts to an ad-hoc combinationof posterior regularization and auto-encoding forsemi-supervised learning. Remarkably, LABES-S2S is optimized under the principled variationallearning framework. and belief states are modeled very weakly in Sequicity/SEDST,only owing to the copy mechanism. For simpliciy, we omitsuch relations in both Figure 5 and 6. 𝑐𝑐 : 𝑡𝑡 𝑟𝑟 𝑡𝑡 𝑏𝑏 𝑡𝑡 𝑑𝑑 𝑡𝑡 𝑏𝑏 𝑡𝑡−1 𝑐𝑐 𝑡𝑡 𝑟𝑟 𝑡𝑡 𝑏𝑏 𝑡𝑡 𝑑𝑑 𝑡𝑡 Sequicity/SEDST LABES-S2S (Ours) observed variables latent variables

Figure 6: Comparison of probabilistic graphical modelstructures.

B Datasets

In our experiments, we evaluate different models onthree benchmark task-oriented datasets with differ-ent scales and ontology complexities (Table 6). TheCambridge Restaurant (CamRest676) dataset (Wenet al., 2017b) contains single-domain dialogs wherethe system assists users to ﬁnd a restaurant. TheStanford In-Car Assistant (In-Car) dataset (Ericet al., 2017) consists of dialogs between a userand a in-car assistant system covering three tasks:calendar scheduling, weather information retrievaland point-of-interest navigation. The MultiWOZ(Budzianowski et al., 2018) dataset is a large-scalehuman-human multi-domain dataset containing di-alogs in seven domains including attraction, hotel,hospital, police, restaurant, train, and taxi. It ismore challenging due to its multi-domain setting,complex ontology and diverse language styles. Asthere are some belief state annotation errors in Mul-tiWOZ, we use the corrected version MultiWOZ2.1 (Eric et al., 2019) in our experiments. We fol-low the data preprocessing setting in Zhang et al.(2020), whose data cleaning is developed based onWu et al. (2019).

C Implementation Details

In our implementation of LABES-S2S, weuse 1-layer bi-directinonal GRU as encodersand standard GRU as decoders. The hid-den sizes are 100/100/200, vocabulary sizes are800/1400/3000, and learning rates of Adam op-timizer are − / − / − for CamRest676/In-Car/MultiWOZ respectively. In all experiments,the embedding size is 50 and we use GloVe (Pen-nington et al., 2014) to initialize the embedding ma-trix. Dropout rate is 0.35 and λ for variational infer-ece is 0.5, which are selected via grid search from { . , . , . , . , . , . , . , . , . } and { . , . , . , . , . , . } respectively. The odel Belief Tracking Response GenerationJoint Goal Inform Success BLEU CombinedLABES-S2S (statistical) 50.05 ± . ± . ± . ± . ± . Table 5: Statistical results of our LABES-S2S model with standard deviations on MultiWOZ 2.1.

CamRest676 In-Car MultiWOZ