Adversarial Learning on the Latent Space for Diverse Dialog Generation
Kashif Khan, Gaurav Sahu, Vikash Balasubramanian, Lili Mou, Olga Vechtomova
AAdversarial Learning on the Latent Space for Diverse Dialog Generation
Kashif Khan ∗ , Gaurav Sahu ∗ , Vikash Balasubramanian ∗ , Lili Mou , Olga Vechtomova University of Waterloo Dept. Computing Science, University of AlbertaAlberta Machine Intelligence Institute (Amii) { kashif.khan,gaurav.sahu,v9balasu,ovechtom } @[email protected] Abstract
Generating relevant responses in a dialog is challenging, and requires not only proper modelingof context in the conversation, but also being able to generate fluent sentences during inference.In this paper, we propose a two-step framework based on generative adversarial nets for gen-erating conditioned responses. Our model first learns a meaningful representation of sentencesby autoencoding, and then learns to map an input query to the response representation, which isin turn decoded as a response sentence. Both quantitative and qualitative evaluations show thatour model generates more fluent, relevant, and diverse responses than existing state-of-the-artmethods. Dialog generation is a challenging problem because it not only requires us to model the context in aconversation but also to exploit it to generate a relevant and fluent response. A dialog generation systemcan be divided into two parts: 1) encoding the context of the conversation, and 2) generating a responseconditioned on the given context. A generated response is considered to be “good” if it is meaningful,fluent, and most importantly, relevant to the given context.With the advancement of deep learning, sequence-to-sequence (Seq2Seq) models (Sutskever et al.,2014) are adopted for dialog systems to encode conversational context and generate a response. However,they suffer from the problem of generic utterance generation, e.g., always generating “I don’t know”(Serban et al., 2016; Li et al., 2016). One possible explanation (Wei et al., 2019) is the high uncertaintyin dialog generation. A plausible response is analogous to a “mode” of a continuous distribution, and theresponse distribution is thus multimodal. However, the decoder of a Seq2Seq model is trained by cross-entropy loss, which is equivalent to minimizing the KL divergence between the target and predicteddistributions. The asymmetric nature of KL divergence makes the learned distribution wide-spreading,analogous to the mode-averaging problem for continuous variables.Variational encoder-decoders (Serban et al., 2017; Bahuleyan et al., 2018; Zhao et al., 2017) andWasserstein encoder-decoders (Bahuleyan et al., 2019) adopt probabilistic modeling to encourage di-versity in responses. However, their decoders are also trained by cross-entropy loss against the targetsequence, still making the model generate generic utterances.In this paper, we propose an approach that uses adversarial learning in the latent space for dialog gen-eration. We first train a variational autoencoder (VAE) (Kingma and Welling, 2014) on sentences, andthen apply a generative aderversarial network (GAN) on the latent space of the VAE. At inference time,we obtain the latent representation of the response from the generator of the GAN and decode it usingthe VAE’s decoder. In this way, we can benefit from the mode-capturing property of GANs (Mao etal., 2019; Thanh-Tung et al., 2019). Also, our GAN is trained on the latent space, and techniques likeGumbel-Softmax and reinforcement-learning (RL) are not necessary, which largely simplifies the train-ing procedure.We further introduce a mean squared error (MSE) auxiliary loss to our adversarial module, ∗ Equal contribution. The code is available at https://github.com/vikigenius/conditional_text_generation
This work is licensed under a Creative Commons Attribution 4.0 International License. License details: http://creativecommons.org/licenses/by/4.0/ . a r X i v : . [ c s . C L ] N ov a) Step 1: Variational Autoencoder (b) Step 2: Adversarial network (dashed box) Figure 1: The framework of our proposed two-step training procedure. (cid:76) denotes concatenation.which mitigates the mode-missing problem in GANs (Che et al., 2017), resulting in more relevant anddiverse responses.We evaluate our model on the deduplicated version (Bahuleyan et al., 2018) of the benchmark Daily-Dialog dataset (Li et al., 2017) and also the Switchboard dataset (Godfrey et al., 1992). Results indicatethat responses generated by our model are more relevant to the input query/context, and are more diverseand fluent than the existing baselines.The main contributions of our paper are as follows.1. We propose a two-step framework of latent-space adversarial learning for generating diverse andrelevant responses.2. We propose a combination of adversarial loss and an auxiliary mean squared loss to help the GANto converge faster and achieve better performance for dialog generation.
Figure 1 provides an overview of our proposed two-step approach.
Step 1:
We first train an autoencoder, which takes an utterance s (either a query or a response) asinput, gets its latent code z s from the encoder, and then feeds it to a decoder for reconstructing. Theautoencocder learns a real-valued vector representation of a generic sentence. Step 2:
We train an adversarial network on the latent z space for learning dialog generation. Given aquery-response pair ( q, r ) in the training set, we use the trained encoder from Step 1 to obtain their latentvariables z q and z r . The query latent variable z q is fed to a generator G , that maps it to the correspondingresponse’s latent variable ˆ z r . When training the generator, we aim to match z r and ˆ z r through theadversarial loss combined with a mean-squared error loss. In here, the adversarial loss further involvesa discriminator that classifies the predicted response representation ˆ z r versus the encoded representationof the true response z r , conditioned on the query z q .The details of our approach will be introduced in the rest of this section. In Step 1, our primary goal is to learn a continuous representation of all utterances in the dialog corpus.The mapping from a sentence to its continuous representation should ideally be invertible so that ouradversarial loss (in Step 2) could be applied to the continuous space to generate dialog responses.In particular, we adopt a variational autoencode (Kingma and Welling, 2014, VAE) for our first step.A VAE encodes an input sentence s to a probabilistic, latent continuous representation z , from which theinput sentence s is reconstructed.We first impose a prior distribution on z , which is typically set to standard normal p ( z ) = N ( , I ) .Given the sentence s , VAE encodes a posterior distribution q E ( z | s ) = N ( µ , diag σ ) , where µ and σ are predicted by the encoder of VAE. The training objective is to minimize the expected reconstructionoss, penalized by a KL divergence term between the posterior and the prior. This is given by J AE ( θ Enc , θ Dec ) = − E q E ( z | s ) [ log p ( s | z )] + λ KL KL ( q E ( z | s ) || p ( z )) (1)where λ KL balances the two terms.Compared with a deterministic autoencoder, VAE learns a smoother latent space by its KL regulariza-tion. This is helpful during the second step, where a GAN is trained to predict the latent representationof a response for decoding. In Step 2, the main objective is to predict the representation of the response given dialog context (suchas the previous utterance). In this way, the predicted latent representation of the response can be fed tothe trained decoder in Step 1 to generate the response utterance.To predict the response representation, we re-use the encoder in Section 2.1 to capture the meaning ofthe context query as z q . Then we have a two-layer perceptron (with a ReLU activation function in thehidden layer) to predict the representation of the utterance to be generated, denoted by ˆ z r = G ( z q ) .For adversarial training, we also encode the representation of the ground truth reply r as z r by theencoder in Step 1. We train an adversarial discriminator D to classify whether the response representationis real or predicted. Such classification should be based on context, because we would learn not only if anutterance is appropriate as a reply, but also if the utterance is appropriate to a specific query. Therefore,we also feed the encoded context representation into the discriminator. The classification is denoted by D ( z r , z q ) or D ( ˆ z r , z q ) , where we essentially concatenate the representations of the response and thequery before feeding them to a logistic regression layer.The adversarial loss for training the latent space is given by: J CGAN = min G max D V ( D, G ) (2) V ( D, G ) = E ( z q , z r ) ∼D train [ log D ( z r , z q ) + log (1 − D ( G ( z q ) , z q ))] (3)where D train is the training data.In other words, the discriminator D is trained by maximizing V ( D, G ) so as to distinguish the truerepresentation of a response and the predicted response representation given the query, whereas thegenerator G is trained to fool the discriminator by minimizing V ( D, G ) .It should be emphasized that our model is different from adversarial autoencoders (Makhzani et al.,2015), because our discriminator takes the query into consideration. Our adversarial loss learns an im-plicit conditional distribution p ( z r | z q ) , instead of a marginal distribution p ( z r ) as in Zhao et al. (2018).Additionally, we introduce an auxiliary mean square error (MSE) loss to the objective function: J MSE = || z r − ˆ z r || (4)The MSE loss on the generator helps stabilize the GAN training and mitigate the mode-missing prob-lem of GANs (Che et al., 2017). In summary, the overall training objective is given by J = J CGAN + γJ MSE (5)where γ is a tunable hyperparameter that moderates the effect of the MSE loss.For inference, our model first uses the pretrained VAE from Step 1 to encode an unseen query q ∗ as z q ∗ . This encoded representation is then passed to the generator G to predict the response latent code G ( z q ∗ ) , which is finally fed to the decoder of the VAE from Step 1 to generate a response sentence.In our experiments, we have two settings for dialog generation: single-turn and multi-turn. In thesingle-turn setting, we form query-response samples by extracting every pair of consecutive utterancesof a conversation in the training data.In the multi-turn setting, we have the query-response pairs by extracting every utterance with its pre-ceding utterances in the entire conversation. The VAE in Step 1 remains the same, but we introduce odel BLEU Diversity FluencyAvg Max HM Intra-1 Intra-2 Inter-1 Inter-2 ASL (14.43) TTR PPLSingle-turn resultsSeq2Seq WED-S
DialogWAE
VAE-M (ours)
VAE-A (ours)
VAE-AM (ours) 0.306 0.367 0.334
CVAE*
CVAE-CO*
VHCR*
DialogWAE
VAE-AM (ours) 0.314 0.371 0.340 0.847 0.98 0.41 0.73
Table 1: Results on the DailyDialog dataset. BLEU scores are computed by the average/maximum of10 randomly sampled replies. HM is the harmonic mean of average and maximum BLEU scores. SuffixA: adversarial loss; suffix M: MSE loss; suffix AM: both adversarial and MSE losses. * denotes resultstaken from Gu et al. (2019), whose training and test sets contain duplicate samples. The Bold font showsthe best performance on the de-duplicated dataset. The number in the bracket of the ASL column is thegroundtruth average sentence length.
Model BLEU Diversity FluencyAvg Max HM Intra-1 Intra-2 Inter-1 Inter-2 ASL (8.49) TTR PPLSingle-turn resultsSeq2Seq
WED-S
DialogWAE
VAE-M (ours)
VAE-A (ours)
VAE-AM (ours) 0.259
CVAE*
CVAE-CO*
VHCR*
DialogWAE
VAE-AM (ours) 0.271
Table 2: Results on the Switchboard dataset. * denotes the numbers taken from Gu et al. (2019).another RNN to encode context. Specifically, it is built upon the VAE’s encoded representation of eachutterance, and yields a fixed-length vector representation of the entire context. During the adversarialtraining, we concatenate the context vector with the query (immediate preceding utterance) representa-tion before feeding them to the generator. In this way, our generator now also takes the context intoaccount when predicting the response latent code. A similar adjustment is applied during inference aswell.
We conduct experiments on the DailyDialog dataset (Li et al., 2017), a manually labeled multi-turn dialogdataset, and the Switchboard dataset (Godfrey et al., 1992), a dialog dataset containing transcripts oftelephonic conversations. For DailyDialog, we use the original splits after removing duplicates followingBahuleyan et al. (2019). We use the the AllenNLP framework (Gardner et al., 2018) to implement allour models. Appendix A presents more experimental details and hyper-parameters.We use the following baseline models for comparison:•
Seq2Seq.
The standard sequence to sequence model based on LSTM.
WED-S.
A stochastic Wasserstein encoder-decoder model (Bahuleyan et al., 2019).•
DialogWAE.
A model based on adversarial regularization of autoencoders (Gu et al., 2019).•
HRED.
A generalized Seq2Seq model that uses hierarchical RNN encoder (Serban et al., 2016).•
CVAE.
A conditional VAE model with KL annealing (Zhao et al., 2017).•
CVAE-CO.
A collaborative conditional VAE model (Shen et al., 2018).
The results for the Daily Dialog and the Switchboard datasets are shown in Tables 1 and 2, respectively.The generated responses are evaluated by the following criteria:
Overall quality.
We measure the quality of the generated responses by BLEU scores (Papineni et al.,2002), for which we adopt the smoothing techniques in Gu et al. (2019). For each query, we generate 10responses for a query, and compute the average and maximum BLEU scores. Then we also compute theharmonic mean of the average and the maximum BLEU scores. Our model is either the best-performingmodel or highly competitive in terms of the BLEU scores. The DialogWAE model also achieves highBLEU scores, while the Seq2Seq model is the worst-performing model.
Diversity.
We measure the diversity of dialog generation in two aspects:•
Intra-diversity . The Intra-diversity score measures the proportion of distinct unigrams and bigramsin each response. It is similar for most models.•
Inter-diversity . The Inter-diversity scores measure the proportion of distinct unigrams and bigramsacross all 10 responses.We note that our model performs the best across Inter-diversity metrics. We further use other diversityindicators such as the Average Sentence Length (ASL) of the responses. We see that diversity scores forthe Seq2Seq model are very high on the Switchboard dataset; however, it has the lowest ASL score aswell. This observation is within expectations, and the Seq2Seq model does not generate diverse responsesoverall. DialogWAE generates longer responses on average; however, our model is closer to the groundtruth ASL (14.43 for DailyDialog and 8.49 for Switchboard). We also note that our model achieves goodType-Token Ratio (TTR) scores, indicating diverse word choices when compared with other models. Fluency.
We compute the PPL scores of generated responses to measure fluency. We notice thatour model achieves the best PPL scores, although DialogWAE is quite close. The Seq2Seq model alsoachieves low PPL, but this is mainly due to the short and generic responses. Interestingly, PPL scoresare generally higher in the multi-turn setting, which may be attributed to the increased complexity of theoutput when more context is given.
Analysis of Losses.
Combining the MSE and adversarial losses leads to significant improvementsacross all metrics, including the BLEU scores, response diversity (Inter-1 and Inter-2), and fluency (PPL).In our experiments, we also notice that the MSE term leads to quicker and more stable convergence ofthe GAN (within 6 epochs), making training easier.We present human evaluation in Appendix B and a case study in Appendix C.
We propose an effective two-stage model for dialog generation. We make use of sentence representationslearned by a VAE and train a adversarial network on VAE’s latent space to generate diverse responsesgiven a query and context. We observe that our model outperforms existing state-of-the-art approachesby generating more diverse, fluent, and relevant sentences. The evaluation protocol follows Gu et al. (2019) with code at https://github.com/guxd/DialogWAE/blob/29f206af05bfe5fe28fec4448e208310a7c9258d/experiments/metrics.py TTR is computed in the corpus level, whereas Inter- n diversity is the average of per-sample distinct unigram ratio. cknowledgments References
Hareesh Bahuleyan, Lili Mou, Olga Vechtomova, and Pascal Poupart. 2018. Variational attention for sequence-to-sequence models. In
COLING , pages 1672–1682.Hareesh Bahuleyan, Lili Mou, Hao Zhou, and Olga Vechtomova. 2019. Stochastic Wasserstein autoencoder forprobabilistic sentence generation. In
NAACL-HLT, Volume 1 , pages 4068–4076.Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew Dai, Rafal Jozefowicz, and Samy Bengio. 2016. Gener-ating sentences from a continuous space. In
CoNLL , pages 10–21.Tong Che, Yanran Li, Athul Paul Jacob, Yoshua Bengio, and Wenjie Li. 2017. Mode regularized generativeadversarial networks. In
ICLR .Matt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, Pradeep Dasigi, Nelson F. Liu, Matthew Peters, MichaelSchmitz, and Luke S. Zettlemoyer. 2018. AllenNLP: A deep semantic natural language processing platform.In
Proceedings of Workshop for NLP Open Source Software (NLP-OSS) , pages 1–6.John J. Godfrey, Edward C. Holliman, and Jane McDaniel. 1992. Switchboard: Telephone speech corpus forresearch and development. In
ICASSP, Volume 1 , page 517–520.Xiaodong Gu, Kyunghyun Cho, Jung-Woo Ha, and Sunghun Kim. 2019. Dialogwae: Multimodal responsegeneration with conditional wasserstein auto-encoder. In
ICLR .Sepp Hochreiter and J¨urgen Schmidhuber. 1997. Long short-term memory.
Neural Computation, Volume 9 , pages1735–1780.Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducinginternal covariate shift. In
ICML , pages 448–456.Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In
ICLR .Diederik P. Kingma and Max Welling. 2014. Auto-encoding variational bayes. In
ICLR .Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. 2016. A diversity-promoting objectivefunction for neural conversation models. In
NAACL-HLT , pages 110–119.Yanran Li, Hui Su, Xiaoyu Shen, Wenjie Li, Ziqiang Cao, and Shuzi Niu. 2017. DailyDialog: A manually labelledmulti-turn dialogue dataset. In
IJCNLP , pages 986–995.Andrew L. Maas, Awni Y. Hannun, and Andrew Y. Ng. 2013. Rectifier nonlinearities improve neural networkacoustic models. In
ICML Workshop on Deep Learning for Audio, Speech and Language Processing .Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, and Ian J. Goodfellow. 2015. Adversarial autoencoders. arXiv preprint arXiv:1511.05644 .Qi Mao, Hsin-Ying Lee, Hung-Yu Tseng, Siwei Ma, and Ming-Hsuan Yang. 2019. Mode seeking generativeadversarial networks for diverse image synthesis. In
CVPR , pages 1429–1437.Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: A method for automatic evaluationof machine translation. In
ACL , pages 311–318.Iulian V. Serban, Alessandro Sordoni, Yoshua Bengio, Aaron Courville, and Joelle Pineau. 2016. Building end-to-end dialogue systems using generative hierarchical neural network models. In
AAAI , page 3776–3783.Iulian Vlad Serban, Alessandro Sordoni, Ryan Lowe, Laurent Charlin, Joelle Pineau, Aaron Courville, and YoshuaBengio. 2017. A hierarchical latent variable encoder-decoder model for generating dialogues. In
AAAI , page3295–3301.iaoyu Shen, Hui Su, Shuzi Niu, and Vera Demberg. 2018. Improving variational encoder-decoders in dialoguegeneration. In Sheila A. McIlraith and Kilian Q. Weinberger, editors,
AAAI , pages 5456–5463.Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In
NIPS , pages 3104–3112.Hoang Thanh-Tung, Truyen Tran, and Svetha Venkatesh. 2019. Improving generalization and stability of genera-tive adversarial networks. arXiv preprint arXiv:1902.03984 .Bolin Wei, Shuai Lu, Lili Mou, Hao Zhou, Pascal Poupart, Ge Li, and Zhi Jin. 2019. Why do neural dialogsystems generate short and meaningless replies? a comparison between dialog and translation. In
ICASSP ,pages 7290–7294.Tiancheng Zhao, Ran Zhao, and Maxine Eskenazi. 2017. Learning discourse-level diversity for neural dialogmodels using conditional variational autoencoders. In
ACL , pages 654–664.Junbo Zhao, Yoon Kim, Kelly Zhang, Alexander Rush, and Yann LeCun. 2018. Adversarially regularized autoen-coders. In
ICML , pages 5902–5911.
Hyperparameter Settings and Training
Single-turn.
In this setting, we first train a VAE on the entire corpus. We use a single-layer encoderwith Bidirectional LSTMs (Hochreiter and Schmidhuber, 1997) and a unidirectional LSTM layer for thedecoder of the VAE. Both use a hidden size of 512. The dimension of our latent vectors is 128, and thatof the word embeddings is 300. Further, we adopt KL-annealing and word dropout from Bowman et al.(2016) to stabilize VAE’s training. We use a word dropout probability of 0.5 and a sigmoid annealingschedule to anneal the KL weight to 0.15 for 4500 iterations. The performance statistics of VAE in Step1 are shown in 3. Model KL BLEU Dist-1 Dist-2VAE 18.8 0.18 0.32 0.49Table 3: Performance of the VAE in Step 1 on the DailyDialog dataset. BLEU is the reconstructionBLEU-4, Dist-1 and Dist-2 are distinct unigrams and bigrams in generated samples, KL is the validationKL in the best epoch (ELBO).For the GAN, we use a 2-layer feed-forward network with a hidden layer of 256 units as the generator,along with batch normalization (Ioffe and Szegedy, 2015) and LeakyReLU activation (Maas et al., 2013).The discriminator shares a similar architecture. We use Adam (Kingma and Ba, 2015) to optimize allour networks.
Multi-turn.
In this setting, the VAE’s architecture remains the same as the Single-turn setting. Weintroduce another BiLSTM encoder with hidden size of 512, which is fed with the VAE-encoded rep-resentations of the context sentences. Other hyperparameters are kept the same. For implementation,our generator predicts the response representation at each turn, but we use teacher-forcing, assuming thecontext is the actual previous utterances.
B Human Evaluation
In addition to automatic metrics, we also evaluate our model and compare it with DialogWAE (Gu et al.,2019) using human evaluation. Five human judges rate the response of each model on a scale of 1–5,according to two criteria: 1) Relevance to the query, and 2) Fluency of the generated response. As canbe seen from Table 4 our model is competitive and achieves better scores on both criteria.Model Relevance FluencyVAE-AM 2.3625 3.2225DialogWAE 2.3125 3.1775Table 4: Human evaluation results
C Case Study
We conduct a case study on VAE-AM’s responses with DialogWAE’s in the single-turn and multi-turnsettings. For both settings, we show the best response out of 10 generated responses for each input.Table 5 shows that DialogWAE generates generic responses, although its sentences are longer. Ourmodel has more concrete concepts, such as “leaflets” and “money.”Table 6 shows generated responses in the multi-turn setting. We see that both models are more infor-mative as we have more context compared with Table 5. However, DialogWAE is repeating the sameconcept, such as “a lot of” in Examples 2, 3, and 5, and “have to pay” in Examples 2 and 5. Thisobservation is also consistent with the TTR results in Tables 1 and 2.
Generated Responses1 Query: o . k . well , go to google.com , and type in battle of gettysburg.
VAE-AM: oh , there are a lot of things that i need to do .
DialogWAE: then ill have to check the details . what can i do for you today ? yes . i m going to the untied states for travel . what places do you recommend ?
VAE-AM: here are some leaflets detailing everything . i have nt seen you in ages .
DialogWAE: it s very nice to see you again , but i think it s a good idea to do . medium one will be perfect , thank you
VAE-AM: is that all ?
DialogWAE: in that case , i have to be able to get some of them . the annual charge rate is NUM .
VAE-AM: that s a lot of money . can you make it a little cheaper ?
DialogWAE: then what about the
VAE-AM: ok , we will . how about this one ?
DialogWAE: in that case , i have to be able to get some of them . i think i can get a new one .
Table 5: Generated responses in the single-turn setting allan, i am really disappointed by your lack of effort on this project .
VAE-AM: it seems you are not aware of the actual work split . i have done more than anyone else.
DialogWAE: ok , it’s not that bad . but it’s not that bad . i don’t want to have any other questions about it . when is he coming ?
VAE-AM: next week . he is enjoying his month long vacation in miami .
DialogWAE: sorry , he’s not very serious . he’s got a lot of money to do . i ll try my best to make a good impression on the interviewer .
VAE-AM: do nt worry pal. only weak candidates should. just dress properly and try to make it alive there haha ...
DialogWAE: that is ok , we ’ ll be able to get a raise in the next few days . we will have to pay you a lot of money . can you tell me where the pots and pans are ?
VAE-AM: absolutely not . i m not letting you cook today . you messed it real bad yesterday .
DialogWAE: yes , of course . what kind of personal information ?
VAE-AM: your name , height , sex , and contact . also tick this check box if you have faced this issue before .
DialogWAE: not at the moment . but there is a lot of money in the city , but we have to pay a lot of money for the money .not at the moment . but there is a lot of money in the city , but we have to pay a lot of money for the money .