Condition-Transforming Variational AutoEncoder for Conversation Response Generation
Yu-Ping Ruan, Zhen-Hua Ling, Quan Liu, Zhigang Chen, Nitin Indurkhya
CCONDITION-TRANSFORMING VARIATIONAL AUTOENCODER FOR CONVERSATIONRESPONSE GENERATION
Yu-Ping Ruan , Zhen-Hua Ling , Quan Liu , Zhigang Chen , Nitin Indurkhya National Engineering Laboratory for Speech and Language Information Processing,University of Science and Technology of China, Hefei, P.R.China iFLYTEK Research, Hefei, P.R. China [email protected], { zhling, nitin } @ustc.edu.cn, { quanliu, zgchen } @iflytek.com ABSTRACT
This paper proposes a new model, called condition-transformingvariational autoencoder (CTVAE), to improve the performance ofconversation response generation using conditional variational au-toencoders (CVAEs). In conventional CVAEs , the prior distributionof latent variable z follows a multivariate Gaussian distributionwith mean and variance modulated by the input conditions. Pre-vious work found that this distribution tends to become condition-independent in practical application. In our proposed CTVAEmodel, the latent variable z is sampled by performing a non-lineartransformation on the combination of the input conditions and thesamples from a condition-independent prior distribution N ( , I ) . Inour objective evaluations, the CTVAE model outperforms the CVAEmodel on fluency metrics and surpasses a sequence-to-sequence(Seq2Seq) model on diversity metrics. In subjective preferencetests, our proposed CTVAE model performs significantly better thanCVAE and Seq2Seq models on generating fluency, informative andtopic relevant responses. Index Terms — variational, autoencoders, conversation, textgeneration
1. INTRODUCTION
There has been a growing interest in neural-network-based end-to-end models for text generation tasks, including machine translation[1], text summarization [2], and conversation response generation[3, 4, 5]. Among these, encoder-decoder framework has beenwidely adopted and they principally learn the mapping from an inputsequence x to its target sequence y . Although this framework hasachieved great success in machine translation, previous studies ongenerating responses for chit-chat conversations [5, 6] have foundthat ordinary encoder-decoder models tend to generate dull, repeatedand generic responses in conversations, such as “i don’t know”,“that’s ok” , which are lack of diversity. One possible reason isthe deterministic calculation of ordinary encoder-decoder modelswhich constrains them from learning the -to- n mapping relation-ship, especially on semantic connections, between input sequenceand potential multiple target sequences. In the task of chit-chatconversation, modeling and generating the diversity of responsesis important because an input post or context may correspond tomultiple responses with different meanings and language styles.Many attempts have been made to alleviate these deficienciesof encoder-decoder models, such as by utilizing extra features This work was partially funded by the National Nature ScienceFoundation of China (Grant No. U1636201). or knowledge as conditions to generate more specific responses[7, 8] and by improving the model structure, the training algorithmsand the decoding strategies [9, 10, 11]. Additionally, conditionalvariational autoencoders (CVAEs), which were originally proposedfor image generation [12, 13], have recently been applied to dia-log response generation [14, 15]. Variational generative models,including variational autoencoders (VAEs) and CVAEs, are suitablefor learning the -to- n mapping relationship due to their variationalsampling mechanism for deriving latent representations.This paper studies variational generative models for text gener-ation in single-turn chit-chat conversations. The CVAE models usedin previous work [14, 16, 12, 13, 15] all assumed a prior distributionof latent variable z followed a multivariate Gaussian distribution p θ ( z | x ) whose mean and variance were estimated by a prior networkusing condition x as input. However, previous studies on imagegeneration [12, 17] found that the samples of z from p θ ( z | x ) tendedto be independent of x given estimated models, which implied thatthe effect of the condition x was constrained at the generation stage.In the conversation response generation task, the condition x is in theform of natural language. The semantic space of x in the training setis always sparse, which further increases the difficulty of estimatingthe prior network p θ ( z | x ) . To address this issue of CVAEs, wepropose condition-transforming variational autoencoders (CTVAEs)in this paper. In contrast to CVAEs, which use prior networks todescribe p θ ( z | x ) , a condition-independent prior distribution N ( , I ) is adopted in CTVAEs. Then, another transformation network isbuilt to derive the samples of z for decoding by transforming thecombination of condition x and samples from N ( , I ) .Specifically, the contributions of this paper are two-fold: First,the subjective preference tests in this paper demonstrate that thereis no significant performance gap between the ordinary CVAEmodel and a simplified CVAE model whose prior distribution isfixed as a condition-independent distribution, i.e., N ( , I ) , whichimplies that the effects of the condition-dependent prior distributionin CVAE were limited. Second, a new model, called CTVAE,is proposed to enhance the effect of the conditions in CVAEs.This model samples the condition-dependent latent variable z byperforming a non-linear transformation on the combination of theinput condition and the samples from a condition-independent Gaus-sian distribution. In our experiments of generating short textconversations, the CTVAE model outperforms CVAE on objectivefluency metrics and surpasses a sequence-to-sequence (Seq2Seq)model on objective diversity metrics. In subjective preferencetests, our proposed CTVAE model performs significantly better thanCVAE and Seq2Seq models on generating fluency, informative andtopic relevant responses. a r X i v : . [ c s . C L ] A p r z y z ( ) z ~ p z x y z q z | ( ) x, y p z ( ) y | q z | ( ) y x y z z ~ p z | ( ) x p z ( ) y x, | x z y t x z y t p z ( ) y x, | t ~ p t ( ) p z | t ( ) x, p z | t ( ) x, q t | ( ) y (a)VAE (a)CVAE (b)CTVAE Fig. 1 . Graphical models of (a) CVAE, and (b) CTVAE. In eachsubgraph, the left part shows the recognition process of latentvariable z during the training stage, and the right part shows theprocess of generating y during the testing stage. The dashed linesand the single solid lines represent the recognition network andthe decoder network respectively. The double solid line in (a)and the thick solid lines in (b) denote the prior network and thetransformation network respectively.
2. METHODOLOGY2.1. From CVAE to CTVAE
Figure 1 shows directed graphical models of CVAE and CTVAE. Inthe single-turn short text conversation task, the condition x is theinput post and y is the output response. As Figure 1(a) shows,a CVAE is composed of a prior network p θ ( z | x ) , a recognitionnetwork q φ ( z | x, y ) , and a decoder network p θ ( y | x, z ) . Both p θ ( z | x ) and q φ ( z | x, y ) are multivariate Gaussian distributions. Thegenerative process of response y at testing stage is as follows:sample a z point from the prior distribution p θ ( z | x ) , then feed it intodecoder network p θ ( y | x, z ) . CVAEs can be efficiently trained withthe stochastic gradient variational Bayes (SGVB) [18] frameworkby maximizing the lower bound of the conditional log likelihood log p ( y | x ) as follows, L CV AE ( θ, φ ; x, y ) = − KL ( q φ ( z | x, y ) || p θ ( z | x ))+ E q φ ( z | x,y ) [log p θ ( y | x, z )] ≤ log p ( y | x ) . (1)As shown in Figure 1(b), a CTVAE has no prior network butadopts N ( , I ) as a transitional prior distribution p θ ( t ) to generate y . Similarly, a CTVAE includes a recognition network q φ ( t | y ) and a decoder network p θ ( y | x, z ) . Additionally, CTVAEs use analternative non-linear transformation network p θ ( z | x, t ) to samplethe latent variable z from the combination of x and the samplesof transitional latent variable t . Following the training strategyfor CVAEs, the model parameters of CTVAEs can be estimatedby maximizing the lower bound of the conditional log likelihood log p ( y | x ) as follows, L CTV AE ( θ, φ ; x, y ) = − KL ( q φ ( t | y ) || p θ ( t ))+ E p θ ( z | x, t ) q φ ( t | y ) [log p θ ( y | x, z )] ≤ log p ( y | x ) . (2) The model architecture of the CTVAE implemented in this paperis shown in Figure 2. Specifically, all the encoders and decoders are 1-layer recurrent neural networks with long short-term memoryunits. For an input post x = [ x , x , ..., x l x ] with l x words, wecan derive the corresponding output hidden states [ h , h , ..., h l x ] by sending its word embedding sequence X = [ x , x , ..., x l x ] intothe Condition Encoder . Then, the mean pooling of hidden states [ h , h , ..., h l x ] is used to present the condition post, denoted as x . Similarly, we can derive vector representation y for response y by inputting Y = [ y , y , ..., y l y ] into the Output Encoder . The
Recognition Network is a multi-layer perceptron (MLP), which hasa hidden layer with softplus activation and a linear output layerin our implementation. The recognition network predicts µ and log( σ ) from y , which gives q φ ( t | y ) = N ( µ , σ I ) . The samplesof the transitional latent variable t generated by q φ ( t | y ) are furtherused to derive the samples of latent variable z for reconstructing y during training. To guarantee the feasibility of error backpropagationfor model training, reparametrization [18] is performed to generatethe samples of t . To derive the samples of latent variable z , thesampled t is concatenated with condition x and passed through atransformation network, which is a MLP with two hidden layerswith tanh activation in our implementation. The output of thetransformation network is used as the samples of latent variable z .The initial hidden state of the OutputDecoder is x . At each timestep of the 1-layer LSTM-RNN, the input is composed of the wordembedding from the previous time step and the encoding vector enc , which is the concatenation of x and z samples. Accordingto Eq. (2), the summation of the log-likelihood of reconstructing y from the OutputDecoder and the negative KL divergence between q φ ( t | y ) and the prior distribution of the transitional latent variable p θ ( t ) = N ( , I ) is used as the objective function for training.In the CVAE built for comparison, all its encoders and decoderhave identical structure to those in CTVAE. Both the recognitionnetwork and prior network have the same structure as the recognitionnetwork in CTVAE except that the recognition network accepts theconcatenation of x and y as input. To evaluate the performance of producing diverse responses usingdifferent models, multiple responses for each post are generated atthe testing stage. Specifically, for the CVAE/CTVAE models, wefirst generatd multiple samples of z . Then, for each z sample, a beamsearch is adopted to return the best result. The multiple responses foreach post are reranked using a topic coherence discrimination (TCD)model, which is trained based on the ESIM model [19]. Specifically,we replace all BiLSTMs in the ESIM with 1-layer LSTMs and definethe objective of the TCD model as judging whether a response is avalid response to a given post. In order to train the TCD model, allpost-response pairs in the training set are used as positive samplesand negative samples are constructed by randomly shuffling themapping between posts and responses. Finally, ranking scores areadopted to rerank all responses generated for one post. The scoresare calculated as log p θ (˜ y | c ) + λ ∗ log p TCD ( true | x, ˜ y ) , wherethe first term is the log-likelihood of generating response (cid:101) y usingthe decoder network and c is the condition input to the decoder,i.e., [ x, z ] in CVAE/CTVAE models. The second term is the log-likelihood of the output probability of the TCD model. λ representsthe weight between these two terms.
3. EXPERIMENTS3.1. Dataset
The short text conversation (STC) dataset from
NTCIR-12 [20] wasused in our experiments. This dataset was crawled from ChineseSina Weibo . The dataset contains , , post-response pairs , https://weibo.com/ This dataset was originally prepared for retrieval models and had nostandard division for generative models. Here we filtered the post-response utput Encoder Output Decoder T = [ ] y : Yy , y , ..., y y T Recognition Network y KL(q p ) q t ( ) | y Reconstruct y MLP
Condition Encoder = [ ] x : Xx , x , ..., x x ( , ) p t =N 0 I ( ) z t x x x enc Transformation Network x Output Decoder T Generate y MLP
Condition Encoder = [ ] x : Xx , x , ..., x x ( , ) p t =N 0 I ( ) z t x x x enc x Training stageTesting stage ~ Fig. 2 . The model architecture of the CTVAE implemented in this paper. (cid:76) denotes the concatenation of input vectors. All the encoders anddecoders are 1-layer LSTM-RNNs, both recognition network and transformation network are MLPs.and one post corresponds to an average of responses. There-fore, it contains -to- n mapping relationship and is appropriate forstudying diverse text generation methods. We randomly split thedata into , , / , / , pairs to build the training,development and test sets. There were no overlapping posts amongthese three sets. In our experiments, we compared
CTVAE with following threebaseline models, i.e.,
Seq2Seq, CVAE-simple , and
CVAE . We didn’tinclude the models in
NTCIR-12 contest because they were allretrieval models. • Seq2Seq
Like previous study, we used the encoder-decoderneural network with attention as the baseline model [14, 15], whichwas similar to that for machine translation [1, 21]. Both the encoderand decoder were 1-layer LSTM-RNNs, and the attention weightswere obtained by the inner product of the hidden states. • CVAE-simple & CVAE
The CVAE model has been describedin Section 2.2. As described in Section 1, the prior distribution p θ ( z | x ) in CVAEs was previously found to degrade to p θ ( z ) . Toverify this, we manually removed the prior network p θ ( z | x ) in theCVAE and fixed the prior distribution to p θ ( z ) = N ( , I ) . Thismodified CVAE model was denoted as CVAE-simple. We trained the models in our experiments with the following hyper-parameters. All word embeddings, hidden layers of the recognitionnetwork and prior network, hidden layers of the transformationnetwork, and hidden state vectors of the encoders and decodershad dimensions. The latent variables t in CTVAE and z inCVAE had dimensions. Each encoder and decoder had wordembeddings of its own, and the vocabulary size was , . Allword embeddings and model parameters were initialized randomlywith Gaussian-distributed samples. The method of Adam [22] wasadopted for optimization with initial learning rate e − . Thebatch size was set to . When training the CVAEs and CTVAEs,the KL annealing strategy [23] was adopted to address the issue oflatent variable vanishing. The model parameters were pre-trainedwithout optimizing the KL divergence term. Additionally, we alsoadopted a training strategy which optimized the KLD loss term every3 steps but optimized the reconstruction non-negative log likelihood pairs in raw STC dataset according to word frequencies to built our dataset.
Seq2Seq CVAE-simple CVAE CTVAEPPL on LM 7.61 31.82 36.96 21.75Matching(%) 92.58 8.12 10.51 19.10
Table 1 . The objective fluency performance of different models.Seq2Seq CVAE-simple CVAE CTVAEDistinct-1(%) 1.61 10.26 11.52 8.69Distinct-2(%) 5.26 41.23 42.6 33.44Unique(%) 22.86 97.66 97.78 97.62
Table 2 . The objective diversity performance of different models.(NLL) loss term every 1 step. As described in Section 2.3, wegenerated multiple responses for each post. Specifically, for CVAEand CTVAE models, the number of z samples was set to 50, thebeam search size was 20. For Seq2Seq, a beam search with beamsize 50 was used to return multiple responses. The weight λ forreranking was heuristically set to . The top-5 responses afterreranking were used for evaluation in our experiments. We trained a RNN language model (LM) [24] using the sameSTC dataset to evaluate the fluency of the generated responsesby calculating their perplexities, denoted as
PPL on LM here.Furthermore, the percentage of generated responses that exactlymatched any responses in the training set were counted. Thismatching percentage was used as a metric to evaluate the model’sability to generate fluency sentences with reasonable syntactic andsemantic representations. For each model, 50 responses weregenerated for unique posts in the test set, and the responses werereranked using the methods described in Section 2.3. The averageLM perplexity and matching percentage of all top-5 responses werecalculated for each model and the results are presented in Table1. It can be found that the Seq2Seq model achieved the lowestperplexity on LM and the highest matching percentage because ittended to generate its dull, generic and repeated responses. TheCVAE models performed worst on these two fluency metrics. ForCTVAE, it performed much better than the CVAE models on bothLM perplexity and matching percentage. air No. Seq2Seq CVAE-simple CVAE CTVAE
N/P p Fluency P1 > . P2 – 25.6(1.6) 32.4(1.7) – 42.0(3.0) > . P3 – – 23.6(3.4) (2.9) 35.2(6.0) < . Topic relevance P1 (3.2) – 19.6(2.9) < . P2 – 28.0(1.9) 34.4(1.1) – 37.6(2.5) > . P3 – – 29.6(2.4) (1.9) 29.6(4.1) < . Informativeness P1 (3.1) – 14.0(2.7) < . P2 – 28.4(2.3) 28.0(2.1) – 43.6(4.2) > . P3 – – 32.4(2.2) (1.1) 20.4(2.6) < . Table 3 . Average preference scores (std.) ( % ) on fluency, topic relevance, and informativeness score between three model pairs (P1-P3),where N/P stands for ”no preference” and p denotes the p -value of a t -test between two models. The percentages of distinct unigrams and bigrams [6] in the gen-erated top-5 responses were used to evaluate the diversity of thegenerated responses. These two percentages, denoted as distinct-1 and distinct-2 , respectively, represented the diversity at the n-gramlevel. We also counted the percentage of unique response sentences,which evaluated the diversity of responses at sentence level. Theresults for the four models are presented in Table 2. It can be foundthat the Seq2Seq model had the worst diversity at both the n-gramlevel and the sentence level. CVAE performed slightly better thanCVAE-simple. And both CVAE models achieved better diversityat the n-gram level than that of the CTVAE model, especially for distinct-2 . According to the results of Section 3.4.1, the CVAEmodels performed worst on fluency performance, which may lead tohigher diversity at the surface-text level. For the diversity at sentencelevel, the percentage of unique responses achieved by CTVAE wasclose to that of CVAE models.
It is difficult to evaluate the final performance of the generatedconversation responses using objective metrics, such as BLEU. It hasbeen argued that such objective metrics for machine translation arevery weakly correlated with human judgment in dialog generation[25]. To evaluate the responses generated by our models in amore comprehensive and convincing manner, several groups ofsubjective ABX preference tests were conducted. We randomlychose 50 posts from the test set and generated the top-5 responsesfrom each model for each post. The responses generated by twomodels were compared in each test. Five native Chinese speakerswith rich Sina Weibo experience were recruited for the evaluation.For each test post, a pairs of top-5 responses generated by twomodels were presented in random order. The evaluators wereasked to judge which top-5 responses in each pair were preferredor if there was no preference on three subjective metrics: fluency , topic relevance , and informativeness score. Fluency was used toevaluate the quality of grammar and the semantic logic of responses.Topic relevance measured whether a response matched the topicof the given post. Informativeness measured how informative andinteresting a response was. In addition to calculating the averagepreference scores, the p -value of a t -test was adopted to measurethe significance of the difference between two models. Severalsignificance levels were examined, including p < . , p < . ,and p < . . p > . indicated that there was no significantdifference between two models. The subjective evaluation resultsare presented in Table 3.According to results of model pair P2 in Table 3, we can Table 3: Average preference scores (std.) ( % ) on fluency, topic relevance, informativeness, andgeneral score between three model pairs (C1-C3), where N/P stands for "no preference" and p denotesthe p -value of a t -test between two models. C
N/P p Fluency C1 > . C2 – 25.6(1.6) 32.4(1.7) – 42.0(3.0) > . C3 – – 23.6(3.4) (2.9) 35.2(6.0) < . Topic relevance C1 (3.2) – 19.6(2.9) < . C2 – 28.0(1.9) 34.4(1.1) – 37.6(2.5) > . C3 – – 29.6(2.4) (1.9) 29.6(4.1) < . Informativeness C1 (3.1) – 14.0(2.7) < . C2 – 28.4(2.3) 28.0(2.1) – 43.6(4.2) > . C3 – – 32.4(2.2) (1.1) 20.4(2.6) < . General C1 (3.3) – 10.4(1.6) < . C2 – 30.8(1.9) 36.4(1.4) – 32.8(2.8) > . C3 – – 30.8(2.4) (1.1) 14.8(1.9) < . Table 4: An example of the top-5 responses generated by Seq2Seq, CVAE, and CTVAE for the post"It will be sunny after the rain in Beijing today. night owls, please get up early today.( 今 天 北 京 雨 后 天 晴 , 熬 夜 人 啊 , 早 起 吧 。 )". Seq2Seq CVAE CTVAE top-1 是 什么 软 件 ? What’s this software? 早 安 。。。 Good morning... 早 起 的 鸟 儿 有 虫 吃 的 The early bird catches the wormtop-2 还 没 睡 呢 Haven’t slept yet? 失 眠 的 !! Sleepless!! 早 睡 早 起 身 体 好好 Early to bed and early to risemakes a man healthytop-3 还 没 睡 啊 Haven’t slept yet! 今 儿 今 天 !! Today and today!! 北 京下 雨 了 It’s rainy in Beijingtop-4 我 也 在 北 京 I’m in Beijing, too. 早 上 起 床 了 …… Got up in the morning... 帝 都 的 天 气 啊 ! What a weather in the city ofemperors!top-5 我 也 想 去北 京 I want to go to Beijing, too 今 天 也 迟 到 了 吧 ? Are you late today too? 北 京 天 气 很 好 的 天 气 Good weather in Beijing
According to results of model pair C2 in Table 3, we can see that there is no significant difference on all four metrics between CVAE and CVAE-simple whose prior latent distribution was condition- independent, which confirms again that the effects of the condition-dependent prior distribution in the CVAE model were limited. From the model pair C1, it can be found that CVAE outperformed
Seq2Seq significantly on all metrics except fluency. On the other hand, the results of model pair C3 show that CTVAE outperformed CVAE significantly on all four metrics.
Table 4 shows one typical example of the top-5 responses generated by the Seq2Seq, CVAE, and
CTVAE models. We can see that the responses generated by Seq2Seq were dull and generic compared with those generated by CVAE and CTVAE. Furthermore, the responses of CTVAE tended to be more topic relevant and informative than those of CVAE.
We have proposed a model named condition-transforming variational autoencoder (CTVAE) for diverse text generation. In this model, the samples of latent variable z are derived by performing a non-linear transformation on the combination of the input condition and the samples from a prior Gaussian distribution N ( , I ) . In our experiments on short text conversation, CTVAE outperformed a baseline Seq2Seq model and CVAE model in both objective and subjective evaluations. Moreover, the CTVAE model can derive z samples with better condition-dependency than that of the CVAE model and can improve the consistency between the ground and the inference sampling distributions. Fig. 3 . An example of the top-5 responses generated by Seq2Seq,CVAE, and CTVAE for the post “It will be sunny after the rain inBeijing today. night owls, please get up early today.( 今 天 北 京 雨 后 天 晴 , 熬 夜 人 啊 , 早 起 吧 。 )”. see that there is no significant difference on all three metricsbetween CVAE and CVAE-simple whose prior latent distributionwas condition-independent, which implies that the effects of thecondition-dependent prior distribution in the CVAE model werelimited. From the model pair P1, it can be found that CVAEoutperformed Seq2Seq significantly on all metrics except fluency.On the other hand, the results of model pair P3 show that CTVAEoutperformed CVAE significantly on all three metrics, which con-firms the effectiveness of our proposed CTVAE model. These resultsindicate that our CTVAE model can derive z samples with bettercondition-dependency than CVAE model. Case study
Figure 3 shows one typical example of the top-5responses generated by the Seq2Seq, CVAE, and CTVAE models.We can see that the Seq2Seq model tend to generate dull and genericresponses. Furthermore, the responses of CTVAE tended to be moretopic relevant and informative than those of CVAE.
4. CONCLUSION
We have proposed a model named condition-transforming varia-tional autoencoder (CTVAE) for diverse text generation. In thismodel, the samples of latent variable z are derived by performing anon-linear transformation on the combination of the input conditionand the samples from a prior Gaussian distribution N ( , I ) . Inour experiments on single-turn short text conversation, the CTVAEoutperformed the Seq2Seq and CVAE models in both objective andsubjective evaluations, which indicates that the CTVAE can derive z samples with better condition-dependency than CVAE models.Applying the proposed CTVAE model to multi-turn conversationresponse generation and pursuing controllable sampling of the latentvariable z will be our future work. . REFERENCES [1] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio,“Neural machine translation by jointly learning to align andtranslate,” arXiv preprint arXiv:1409.0473 , 2014.[2] Alexander M Rush, Sumit Chopra, and Jason Weston, “A neu-ral attention model for abstractive sentence summarization,” in Proceedings of the 2015 Conference on Empirical Methods inNatural Language Processing , 2015, pp. 379–389.[3] Ilya Sutskever, Oriol Vinyals, and Quoc V Le, “Sequenceto sequence learning with neural networks,” in
Advances inneural information processing systems , 2014, pp. 3104–3112.[4] Lifeng Shang, Zhengdong Lu, and Hang Li, “Neural respond-ing machine for short-text conversation,” in
Proceedings ofthe 53rd Annual Meeting of the Association for ComputationalLinguistics and the 7th International Joint Conference onNatural Language Processing (Volume 1: Long Papers) , 2015,vol. 1, pp. 1577–1586.[5] Iulian Vlad Serban, Alessandro Sordoni, Yoshua Bengio,Aaron C. Courville, and Joelle Pineau, “Building end-to-enddialogue systems using generative hierarchical neural networkmodels,” in
Proceedings of the Thirtieth AAAI Conferenceon Artificial Intelligence, February 12-17, 2016, Phoenix,Arizona, USA. , 2016, pp. 3776–3784.[6] Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and BillDolan, “A diversity-promoting objective function for neuralconversation models,” arXiv preprint arXiv:1510.03055 , 2015.[7] Chen Xing, Wei Wu, Yu Wu, Jie Liu, Yalou Huang, MingZhou, and Wei-Ying Ma, “Topic aware neural responsegeneration.,” in
AAAI , 2017, pp. 3351–3357.[8] Marjan Ghazvininejad, Chris Brockett, Ming-Wei Chang,Bill Dolan, Jianfeng Gao, Wen-tau Yih, and Michel Galley,“A knowledge-grounded neural conversation model,” arXivpreprint arXiv:1702.01932 , 2017.[9] Yu Wu, Wei Wu, Dejian Yang, Can Xu, Zhoujun Li, andMing Zhou, “Neural response generation with dynamicvocabularies,” arXiv preprint arXiv:1711.11191 , 2017.[10] Jiwei Li, Will Monroe, Alan Ritter, Dan Jurafsky, MichelGalley, and Jianfeng Gao, “Deep reinforcement learning fordialogue generation,” in
Proceedings of the 2016 Conferenceon Empirical Methods in Natural Language Processing , 2016,pp. 1192–1202.[11] Ganbin Zhou, Ping Luo, Rongyu Cao, Fen Lin, Bo Chen, andQing He, “Mechanism-aware neural machine for dialogueresponse generation.,” in
AAAI , 2017, pp. 3400–3407.[12] Kihyuk Sohn, Honglak Lee, and Xinchen Yan, “Learningstructured output representation using deep conditional gener-ative models,” in
Advances in Neural Information ProcessingSystems , 2015, pp. 3483–3491.[13] Xinchen Yan, Jimei Yang, Kihyuk Sohn, and Honglak Lee,“Attribute2image: Conditional image generation from visualattributes,” in
European Conference on Computer Vision .Springer, 2016, pp. 776–791.[14] Tiancheng Zhao, Ran Zhao, and Maxine Eskenazi, “Learningdiscourse-level diversity for neural dialog models usingconditional variational autoencoders,” in
Proceedings of the55th Annual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers) , 2017, vol. 1, pp. 654–664. [15] Iulian Vlad Serban, Alessandro Sordoni, Ryan Lowe, LaurentCharlin, Joelle Pineau, Aaron C Courville, and Yoshua Bengio,“A hierarchical latent variable encoder-decoder model forgenerating dialogues.,” in
AAAI , 2017, pp. 3295–3301.[16] Xiaopeng Yang, Xiaowen Lin, Shunda Suo, and MingLi, “Generating thematic chinese poetry with conditionalvariational autoencoder,” arXiv preprint arXiv:1711.07632 ,2017.[17] Diederik P Kingma, Shakir Mohamed, Danilo JimenezRezende, and Max Welling, “Semi-supervised learning withdeep generative models,” in
Advances in Neural InformationProcessing Systems , 2014, pp. 3581–3589.[18] Diederik P Kingma and Max Welling, “Auto-encodingvariational bayes,” arXiv preprint arXiv:1312.6114 , 2013.[19] Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, Hui Jiang,and Diana Inkpen, “Enhanced lstm for natural languageinference,” in
Proceedings of the 55th Annual Meeting of theAssociation for Computational Linguistics (Volume 1: LongPapers) , 2017, vol. 1, pp. 1657–1668.[20] Lifeng Shang, Tetsuya Sakai, Zhengdong Lu, Hang Li,Ryuichiro Higashinaka, and Yusuke Miyao, “Overview of thentcir-12 short text conversation task.,” in
NTCIR , 2016.[21] Thang Luong, Hieu Pham, and Christopher D Manning,“Effective approaches to attention-based neural machinetranslation,” in
Proceedings of the 2015 Conference onEmpirical Methods in Natural Language Processing , 2015, pp.1412–1421.[22] Diederik Kingma and Jimmy Ba, “Adam: A method forstochastic optimization,” arXiv preprint arXiv:1412.6980 ,2014.[23] Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew Dai,Rafal Jozefowicz, and Samy Bengio, “Generating sentencesfrom a continuous space,” in
Proceedings of The 20th SIGNLLConference on Computational Natural Language Learning ,2016, pp. 10–21.[24] Tom´aˇs Mikolov, Martin Karafi´at, Luk´aˇs Burget, Jan ˇCernock`y,and Sanjeev Khudanpur, “Recurrent neural network basedlanguage model,” in
Eleventh Annual Conference of theInternational Speech Communication Association , 2010.[25] Chia-Wei Liu, Ryan Lowe, Iulian Serban, Mike Noseworthy,Laurent Charlin, and Joelle Pineau, “How not to evaluateyour dialogue system: An empirical study of unsupervisedevaluation metrics for dialogue response generation,” in