Stylized Text Generation Using Wasserstein Autoencoders with a Mixture of Gaussian Prior
SStylized Text Generation Using Wasserstein Autoencoders with a Mixtureof Gaussian Prior
Amirpasha Ghabussi , Lili Mou , Olga Vechtomova University of Waterloo, University of Alberta { aghabuss,ovechtom } @[email protected] Abstract
Wasserstein autoencoders are effective for textgeneration. They do not however provide anycontrol over the style and topic of the gen-erated sentences if the dataset has multipleclasses and includes different topics. In thiswork, we present a semi-supervised approachfor generating stylized sentences. Our modelis trained on a multi-class dataset and learnsthe latent representation of the sentences us-ing a mixture of Gaussian prior without anyadversarial losses. This allows us to generatesentences in the style of a specified class ormultiple classes by sampling from their cor-responding prior distributions. Moreover, wecan train our model on relatively small datasetsand learn the latent representation of a spec-ified class by adding external data with otherstyles/classes to our dataset. While a sim-ple WAE or VAE cannot generate diverse sen-tences in this case, generated sentences withour approach are diverse, fluent, and preservethe style and the content of the desired classes.
Probabilistic text generation is an important ap-plication of Natural Language Processing (NLP).Variational autoencoder (Kingma and Welling,2013) is a common and important method for sen-tence generation. VAE imposes a prior distribu-tion on the latent space which is typically set tostandard normal. It regularizes the latent space byKullback-Leibler (KL) divergence (Kullback andLeibler, 1951) while reconstructing a data sample.This is equivalent to maximizing the variationallower bound of the likelihood of data. VAE is verydifficult to train due to the issue of KL collapse.This can be resolved by adding word dropout orKL annealing to the training process (Bowmanet al., 2015). Another approach to text genera-tion is Generative Adversarial Networks (GAN)(Goodfellow et al., 2014). However, GAN loss is not differentiable and they have difficulties gener-ating discrete sequences (Husz´ar, 2015), thereforeVAE seems more appropriate for sentence genera-tion.Wasserstein autoencoders (WAE) (Tolstikhinet al., 2017) adjust the aforementioned problems.They regularize the latent space by pushing theaggregated posterior to the prior. This can beachieved by comparing empirical samples fromthe prior and the posterior distributions. Since,WAE unlike VAE does not push the latent poste-rior to be close to the prior based on any giveninput, this results in a better reconstruction perfor-mance. Moreover, WAE is much easier to trainsince it does not use KL divergence to regularizethe latent space.Regular VAE and WAE both generate a sen-tence by learning a distribution for the latentspace. At the inference time, by sampling fromthis space, they can generate sentences similarto the distribution of the dataset they have beentrained on. When the dataset has one class ora topic, this produces satisfactory results. Yet,since they use a standard normal distribution astheir prior, they tend to over-regularize the latentspace in cases where the dataset consists of mul-tiple classes with different styles or topics. Thiscan be a major drawback of using VAE or WAEfor style-specific text generation.To solve this problem, we propose a WAE witha Gaussian Mixture Prior (GMP) with the num-ber of mixtures set to the number of classes inthe dataset. This allows us to generate sampleswith the style of a specified class by only sam-pling from the GMP corresponding to this class.Moreover, since we share the same encoder anddecoder over all of the classes, we can generatemore diverse sentences by training our model onrelatively small datasets. Lastly, this allows us toalso generate sentences with a mixture of styles a r X i v : . [ c s . C L ] N ov y using a weighted average of the latent vectorssampled from multiple Gaussian distributions.In addition to over regularizing the latent space,most neural networks depend on big datasets andperform poorly when trained on small datasets.However, achieving good results using small datais an important real-world challenge, and in mostcases it is harder than solving a big data challenge.With our proposed approach we show that we cantrain our model on small number of data samplesby adding data points from different topics to ourdata. Our experiments show that this will havevery small effect on the style of the generated sen-tences.To summarize, our main contributions are: • Supervised multi-class sentence generationwhile preserving the content and style ofspecified classes • Diverse sentence generation on relativelysmall datasetsTo evaluate our approach we conduct severalexperiments. We use the Multi-genre Natural Lan-guage Inference (MNLI) dataset (Williams et al.,2018) to run all of our experiments. We performstyle-conditioned and style-interpolated sentencegeneration. Our model produces the most diversesentences among previous works. Moreover weillustrate how our model can outperform othersin fluency, diversity, and style accuracy by beingtrained on a small portion of the dataset.
In natural language processing there is no uniquedefinition of style. Different authors choose avariety of text characteristics as style. Senti-ment, formality, genre, and authorship are com-mon choices for representing the style of a sen-tence (Hu et al., 2017; Shen et al., 2017; Fu et al.,2018; John et al., 2018). There are different ap-proaches to style transfer, stylized generation andstyle-specific topic modeling.One approach to stylized text generation is us-ing style-specific embeddings for sentence gen-eration. Vechtomova et al. (2018) use author-specific embeddings to generate stylized poetry,using multi-modal training data. By pretrain-ing the embeddings using a CNN classifier theyare able to generate creative data samples. Fuet al. (2018) propose two different approaches for style transfer: style-specific embeddings andstyle-specific decoder. By applying adversariallosses during training, they encourage the encoderto only include the content of the sentence in thelatent space. They use sentiment as the style of asentence.Other works focus on learning separate latentrepresentations of style and content for style trans-fer or stylized generation. Gao et al. (2019) usea structured latent space to generate stylized di-alogue responses. Their model uses a sequence-to-sequence module and an autoencoder with ashared decoder. John et al. (2018) propose anotherapproach and apply an adversarial loss to separatestyle from content. This approach is designed forstyle transfer, but it can be conditioned on a de-sired style and used for stylized generation as well.Mixture of Gaussian prior was previously usedfor image clustering (Ben-Yosef and Weinshall,2018). However, using mixture of gaussian fortext generation is different from previous worksboth in terms of the training objective and themodel structure. There are different approaches togenerate stylized sentences or style transfer. Pre-vious work used Gaussian mixture models as theprior distribution for several NLP tasks. Shen et al.(2019) uses Gaussian mixtures for machine trans-lation. Gu et al. (2018) uses an autoencoder net-work with a GMP to learn the latent representa-tion of sentence-level data points and jointly trainsa GAN to generate and discriminate in the samespace. They use the Wasserstein distance to modeldialogue responses. Wang et al. (2019) use an un-supervised approach using a VAE with Gaussianmixture prior for topic modeling. They apply atraining penalty to push the Gaussian distributionsfurther apart in the latent space. However, theirchoice of bag of words for data point represen-tation, does not allow them to generate coherentsentences. Moreover, mixing new data points withtheir dataset of choice might completely changethe topics of their model.Our work is different from the previous worksin that we use a supervised approach with a GMMas our prior distribution and use labeled datafor training. Moreover, we refer to a specifictopic/class as the style of a sentence similar to(Wang et al., 2019), but we propose a supervisedapproach using Wasserstein distance. This allowsus to have more control over the specific styles ourmodel will learn. Moreover, it allows us to mixhese styles at the inference time. Moreover, wedo not apply any penalty to push the Gaussian dis-tributions further away in the latent space and thismakes our model easy to train. Finally, we canexpand our dataset and add new training samplesto help our encoder to effectively learn the latentrepresentation of our desired classes, and help ourdecoder to generate much more diverse sentences.
In this section we describe our approach in detail.We use a stochastic WAE with MMD penalty witha sequence to sequence neural network (Sutskeveret al., 2014). Using a Gaussian mixture distribu-tion for our prior we are able to generate singleand multi-style conditioned sentences at the infer-ence time. We further explain our training processand the details of our model in this section.
Autoencoders (Baldi, 2012) encode an input intoa latent representation, from which they recon-struct the input again. Usually, the input has amuch higher dimension than its corresponding la-tent representation. However, in some cases, suchas noise reduction and text enhancement, the la-tent representation can have higher dimensions(Lu et al., 2013). Another important application ofautoencoders is style transfer (Shen et al., 2017).Depending on the task, there are multiple designoptions for the encoder and the decoder networksof an autoencoder, which are chosen based on theinput structure. In natural language processing,a common choice for these networks is a feed-forward neural network when the input format isbag-of-words (BOW) (Wallach, 2006). Anothercommon choice for input sequences are Recur-rent Neural Networks (RNN). In this work, we useGated Recurrent Units (GRU) (Choi et al., 2016)as our choice of the RNN cell for our encoder anddecoder.Given that at time step t the decoder predicts thenext token to be x t the training loss of the autoen-coder J AE is definded as: J AE = N (cid:88) i =1 T (cid:88) t =1 − log ( p ( x t | h, x , x , ..., x t − )) (1)where h is the latent vector representation, N isthe number of training samples, and T is the totalnumber of decoding steps. One approach to regularize the posterior is to im-pose a constraint that the aggregated posterior of h should be similar to its prior (Tolstikhin et al.,2017). This constraint can be relaxed by pe-nalizing the Wasserstein distance between q ( h ) and p ( h ) . This can be computed as the Maxi-mum Mean Discrepancy (MMD) between Q ( h ) and P ( h ) : M M D = (cid:13)(cid:13)(cid:13)(cid:13)(cid:90) k ( h, . ) dP ( h ) − (cid:90) k ( h, . ) dQ ( h ) (cid:13)(cid:13)(cid:13)(cid:13) H k (2)Where H k is the reproducing kernel Hilbertspace defined by kernel k . We chose the inversemulti-quadratic kernel k ( x, y ) = CC + || x − y || in ourexperiments which is a common choice.The MMD penalty can be estimated by empiri-cal samples as: (3) (cid:100) M M D = 1 N ( N − (cid:88) n (cid:54) = m k ( h ( n ) , h ( m ) )+ (cid:88) n (cid:54) = m k ( (cid:101) h ( n ) , (cid:101) h ( m ) ) − N (cid:88) n,m k ( h ( n ) , (cid:101) h ( m ) ) where (cid:101) h ( n ) is a sample from prior p and h ( n ) is asample from the aggregated posterior q . In this work we use a Gaussian mixture model asthe chosen distribution for our WAE prior. Thereare multiple benefits gained from this. First, manydatasets are a combination of different styles andclasses, therefore, the model structure should ac-count for this in order to learn a good represen-tation of these datasets. Moreover, separating thelatent representation of a group of data samples,allows the model to be trained on completely dif-ferent data points at the same time and learn mul-tiple latent distributions independently. The finaldistribution of our latent space follows the Gaus-sian mixture model distribution: (4) P ( z ) = N (cid:88) i =1 w i N ( µ i , σ i ) here N is the number of mixture distribu-tions, Σ Ni =1 w i = 1 , and w i ≥ . If a dataset has N classes with distinct styles, we use the same num-ber of Gaussian distributions for our latent spaceand encode every sentence to its corresponding la-tent distribution. Then, the latent vector represen-tation is defined as: (5) h = N (cid:88) i =1 w i × h i Where h i denotes the sampled vector from the i th Gaussian mixture distribution h i ∼ N ( µ i , σ i ) and w i is its corresponding weight. At the training time, each input sequence ( x i , x i , ..., x in ) is mapped to its correspondingmean and variance vectors. We simultaneouslylearn multiple priors by pushing the encoded meanand variance vectors to their corresponding priormean ( µ i ) and variance ( σ i ) vectors. Since weuse a stochastic WAE, we then sample from a nor-mal distribution with the same encoded mean anda variance of . We use KL-divergence to regular-ize the stochastic part of our model and producemore diverse sentences based on the following ob-jective: J KL = N (cid:88) i =0 KL (cid:16) N ( µ post , diag ( σ post ) ) ||N ( µ post , I ) (cid:17) (6)To regularize our latent space and learn the priordistribution, we use the MMD penalty followingEquation 3. The final training loss is the weightedsum of the KL loss, MMD loss, and reconstructionloss. Hence, it can be written as: (7) J W AE = J AE + λ KL · J KL + λ MMD · M (cid:88) j =0 × (cid:100) M M D j Where M is the number of classes in our dataset.This is our training objective at the training phase.Note that the gradient will only back-propagatethrough the Gaussian distribution correspondingto the batch class.During the training phase, we use mini-batcheswhere the samples are from only one input class.This is a stochastic estimation of the actual gra-dient descent algorithm. Individual batches are
Figure 1: Overview of our approach. The red ar-rows represent backpropagation through a single class.The black arrows represent the forward pass where thehidden vector is sampled from two classes with equalweights. biased towards a certain class, but with multiplebatches sampled from all of the classes we esti-mate the actual training objective. For a sequencewith class i we set all other latent weights to zeroand w i = 0 . This allows us to only back-propagatethe reconstruction loss through the i th Gaussiandistribution and the MMD penalty will push µ i and σ i to the encoded vectors. Moreover, we userecurrent architectures for encoder and decoderand cross-entropy loss for reconstruction. Figure1a shows an overview of the training process. Theone-hot class vector is the training weights for themixture distributions and the red arrows show thebackpropagation through just one of the distribu-tions. Text generation with GMM-WAE is slightly dif-ferent from the training process. By samplingfrom the latent space we can generate new sen-tences conditioned either on a single class, or onmultiple classes. To generate a sentence, we firsthave to sample from the latent space and producethe latent vector h following Equation 5. Then wesimply feed this vector to the recurrent decoder asits initial state, and append it to the input of ev-ry time step. We use the standard inference de-coder following Wu et al. (2016). Figure 1b showsan overview of the inference process. The classescontributing to the style of the final sentence arethe weights with non-zero values in the class vec-tor. Style-Conditioned Sentence Generation : Inthis setup, we generate sentences conditioned ona single style. This process is similar to the train-ing approach. We set all w i to zero except for theweight of the class, corresponding to the desired(target) style of the generated sentence. Hence thelatent vector is sampled from P ( z ) = N ( µ k , σ k ) where k is the target style the generated sentenceis conditioned on. The sampled latent vector willonly include features from the target class. This isvery similar to what we perform at training time,so the results are expected to be very good. Style-Interpolated Sentence Generation : Inthis second setup, generated sentences are condi-tioned on an interpolation between two latent vec-tor samples. For generating a sentence with morethan one style, we simply interpolate between twosamples from the mixture Gaussian distributions.We set two w i s to non-zero values while satis-fying the condition that Σ i w i = 1 . This is anequivalent of a weighted average between the la-tent vector samples. By changing the value of theweights for each distribution, we can control thecontribution of each style in the final generatedsentence. For the sake of our experiments twoof the w i weights are set to . and the rest arezero. This means the latent vector is sampled from P ( z ) = ( N ( µ k , σ k ) + N ( µ k , σ k )) with k and k being the desired classes. To evaluate our approach we use the MNLI. Weuse a sequence to sequence (Sutskever et al., 2014)setup with maximum sequence length of 30. Ourvocabulary size is 30,000 and our latent space has100 dimensions for every Gaussian prior. Duringthe training process, we append the encoded latentvector to every step of the decoder. For inference,we use the generated token at each time step asthe input to the next RNN time step and appendthe sampled latent vector to all decoder time steps,similar to the training process. We compare ourresults with the work of (Vechtomova et al., 2018),and other baselines.MNLI consists of 433k crowd-sourced sen- tences from five different genres: Slate, Tele-phone, Government, Fiction, and Travel. We ig-nore the Slate genre in our experiments since thesentences in this genre cover a diverse set of top-ics and it confuses our model. We run two exper-iments on MNLI to evaluate the performance ofour model. Our first experiment uses all of the sen-tences available and compares the stylized gener-ation performance of our approach with previouswork and baselines. For our second experiment,we use a subset of 10240 sentences from the fourclasses mentioned above. We generate samples us-ing separate WAEs trained on sentences from in-dividual classes and using a WAE with GM priortrained on all 40960 sentences. We describe themetrics used for our comparison in the results sec-tion.Our model works best for generating diversesentences and outperforms other models in mostof the evaluation metrics. When the dataset is rela-tively small, a WAE or VAE do not capture enoughfeatures to generate diverse and fluent sentences.We use a WAE with GMP and train our model us-ing ten percent of the data in MNLI and compareits performance with other models. Our modeloutperforms all of the other models in this case.
In this section we discuss different metrics weused to evaluate the performance of our model andcompare it with baselines, and previous work. Weuse Jansen-Shannon Divergence to evaluate style-interpolated sentence generation classification ac-curacy. Also we use multiple measures for sen-tence diversity and finally we use perplexity to val-idate the coherency and fluency of the generatedsentences.
Jensen-Shannon Divergence:
Jensen-Shannon Divergence (JSD) (Lin, 1991) is ourmetric of choice to evaluate our multi-classsentence inference accuracy. We sample fromtwo of the Gaussian distributions and averagethe sampled vectors with equal weights. Then,we feed this vector to the decoder. To determinethe class of the inferred sentence, we use ourpre-trained classifier. In the ideal situation, theoutput of the last layer of the classifier shouldbe 0 for all of the classes and . for the twosampled classes. JSD quantifies the differencebetween the sampling probability distributionand the classification distribution of the sampled-1 ↑ D-2 ↑ Entropy ↑ (PPL) ↓ Classification ↑ Trained on MNLILyrics VAE 0.034 0.070 4.153 112.3 85.2Dinentangled VAE 0.027 0.117 4.853
Trained on 40960 samples from MNLILyrics VAE(s) 0.021 0.055 4.001 112.3 82.4Disentangled VAE(s) 0.021 0.096 4.392
Table 1: Comparison between our work and others. Note that the classification accuracy for separate WAE modelsis not a valid measure since each model is only trained on a single class. Hence the accuracy should be 100% intheory. Models with (s) in their names, are trained on a small subset of MNLI with 40960 raining samples for fourclasses
Fiction kramenin? he drew the question the last time that ’s happened?man , apparently you do n’t think that the doctor ’s always alone.
Government i provide guidance in determining the requirements of state agencies also used the additional databases.there are no success of delivery in california , reducing in pm concentrations.
Travel hong kong is now a fascinating fifth-century , walled architecture.the greatest can be sensed in dublin and its surrounding farmlands , a full of historic buildings.
Telephone uh-huh yeah i guess you ca n’t have our problem.so they had to talk about it, um oh absolutely.
Government + Travel so we ’ve talked to our children to pursue little observation from the standpoint that we ’re split in.one provides opportunities for bargaining delivery system to link between gagas and research is helpful.
Government + Telephone
Travel + Telephone given the book now i know like that capital or UNK egypt who came to conquer.that i guess the remaining states that now is a more easily protected
Table 2: Sentences generated by our GMP WAE model. sentences.
Style Accuracy:
We follow the approach ofthe previous work (Hu et al., 2017), (Shen et al.,2017), (Fu et al., 2018), (John et al., 2018) andseparately train a convolutional neural network(CNN) to classify sentences (Kim, 2014) based ontheir classes. We use this classifier to classify sen-tences generated with our approach and compareour results with a separate WAE trained only on a specified class of a dataset. We also ran our ex-periments with separate VAEs and the results arevery close to separate WAEs. The classificationaccuracy of the classifier over the original MNLIdataset is 98%. Table 1 compares the classificationresults. Using the WAE with a GM prior lowersthe accuracy of the classifier. This is expected be-cause we use a shared decoder over multiple classdistributions, but the classifier still is easily able toov Fic Trav Tel JSDGov
Table 3: Classification acuracy and JSD values for GMM-WAE Style-conditioned sentence generation using 40960training samples from MNLI.
Gov Fic Trav Tel JSDFic-Gov 04.84
Table 4: Single and Multi-class sentence generation for 40960 sentences from MNLI using WAE with GM Prior.Each row represents the classifier’s confidence for sampled Sentences from two classes. The sampling weight isequal to 50% for both classes. In all of the cases the classifier classifies the single-class samples correctly andfor multi-class samples it always finds at least one of the correct classes. However, Half of the times it correctlyfinds the other class as well. This table is good for understanding and comparing how a classification and samplingdistribution translates into JSD values. identify different classes.
Perplexity:
We use the Kneser-Ney languagemodel (Kneser and Ney, 1995) to evaluate the flu-ency of our sampled sentences. We measure theempirical distribution of trigrams in a corpus, andcompute the log-likelihood of a sentence. We trainthe language model on the original dataset andevaluate the fluency of our sampled sentences. 1provides the fluency results. Our model achievesacceptable results compared to separate WAEstrained on only one class of sentences. This isagain because we are sharing the decoder for allof the GM priors and thus, the decoder sometimesuses rare words to generate a sentence for a spec-ified class. These rare words might be commonwords in another class. Since the language modelhas not seen such a combination of words, it eval-uates these sentences with a lower accuracy. Notethat smaller values correspond to more fluent sen-tences.
Diversity:
We use distinct diversity metrics bycomputing the percentage of distinct unigrams andbigrams following the work of (Li et al., 2015) and(Bahuleyan et al., 2017). Our model outperformsall of the baselines and previous models in termsof sentence diversity. This is because the decoderlearns on a more diverse set of sentences when itis trained over multiple classes. The diversity re- sults are provided in 1. For the question genera-tion task, since the dataset is very small, neitherof the models are successful at generating diversesentences and they tend to generate the same setof question over and over. However, WAE withGM prior generates twice more diverse sentencescompared to separate WAEs.
Compared to WAE and VAE, WAE with GMPprovides control over the style of generated sam-ples. Moreover, it generates fluent and diverse sen-tences while it is capable of generating sentenceswith a mixture of styles. Additionally, since theGMP is powerful to capture the latent represen-tation of the dataset, it is possible to add moredata samples with other classes to small datasetsand learn enough features to generate diverse sam-ples with a desired style/class. The VAE and WAEare not capable of learning the representation of adataset, nor can they learn a good language modelwhen the dataset has few training samples. eferences
Hareesh Bahuleyan, Lili Mou, Olga Vechtomova,and Pascal Poupart. 2017. Variational attentionfor sequence-to-sequence models. arXiv preprintarXiv:1712.08207 .Pierre Baldi. 2012. Autoencoders, unsupervised learn-ing, and deep architectures. In
Proceedings of ICMLworkshop on unsupervised and transfer learning ,pages 37–49.Matan Ben-Yosef and Daphna Weinshall. 2018. Gaus-sian mixture generative adversarial networks for di-verse datasets, and the unsupervised clustering ofimages. arXiv preprint arXiv:1808.10356 .Samuel R Bowman, Luke Vilnis, Oriol Vinyals, An-drew M Dai, Rafal Jozefowicz, and Samy Ben-gio. 2015. Generating sentences from a continuousspace. arXiv preprint arXiv:1511.06349 .Edward Choi, Mohammad Taha Bahadori, AndySchuetz, Walter F Stewart, and Jimeng Sun. 2016.Doctor ai: Predicting clinical events via recurrentneural networks. In
Machine Learning for Health-care Conference , pages 301–318.Zhenxin Fu, Xiaoye Tan, Nanyun Peng, DongyanZhao, and Rui Yan. 2018. Style transfer in text:Exploration and evaluation. In
Thirty-Second AAAIConference on Artificial Intelligence .Xiang Gao, Yizhe Zhang, Sungjin Lee, Michel Galley,Chris Brockett, Jianfeng Gao, and Bill Dolan. 2019.Structuring latent spaces for stylized response gen-eration. arXiv preprint arXiv:1909.05361 .Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza,Bing Xu, David Warde-Farley, Sherjil Ozair, AaronCourville, and Yoshua Bengio. 2014. Generative ad-versarial nets. In
Advances in neural informationprocessing systems , pages 2672–2680.Xiaodong Gu, Kyunghyun Cho, Jung-Woo Ha, andSunghun Kim. 2018. Dialogwae: Multimodalresponse generation with conditional wassersteinauto-encoder. arXiv preprint arXiv:1805.12352 .Zhiting Hu, Zichao Yang, Xiaodan Liang, RuslanSalakhutdinov, and Eric P Xing. 2017. Toward con-trolled generation of text. In
Proceedings of the 34thInternational Conference on Machine Learning-Volume 70 , pages 1587–1596. JMLR. org.Ferenc Husz´ar. 2015. How (not) to train your genera-tive model: Scheduled sampling, likelihood, adver-sary? arXiv preprint arXiv:1511.05101 .Vineet John, Lili Mou, Hareesh Bahuleyan, and OlgaVechtomova. 2018. Disentangled representationlearning for text style transfer. arXiv preprintarXiv:1808.04339 .Yoon Kim. 2014. Convolutional neural net-works for sentence classification. arXiv preprintarXiv:1408.5882 . Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv preprintarXiv:1312.6114 .Reinhard Kneser and Hermann Ney. 1995. Improvedbacking-off for m-gram language modeling. In , volume 1, pages 181–184. IEEE.Solomon Kullback and Richard A Leibler. 1951. Oninformation and sufficiency.
The annals of mathe-matical statistics , 22(1):79–86.Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao,and Bill Dolan. 2015. A diversity-promoting objec-tive function for neural conversation models. arXivpreprint arXiv:1510.03055 .Jianhua Lin. 1991. Divergence measures based on theshannon entropy.
IEEE Transactions on Informationtheory , 37(1):145–151.Xugang Lu, Yu Tsao, Shigeki Matsuda, and ChioriHori. 2013. Speech enhancement based on deepdenoising autoencoder. In
Interspeech , pages 436–440.Tianxiao Shen, Tao Lei, Regina Barzilay, and TommiJaakkola. 2017. Style transfer from non-parallel textby cross-alignment. In
Advances in neural informa-tion processing systems , pages 6830–6841.Tianxiao Shen, Myle Ott, Michael Auli, andMarc’Aurelio Ranzato. 2019. Mixture models fordiverse machine translation: Tricks of the trade. arXiv preprint arXiv:1902.07816 .Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.Sequence to sequence learning with neural net-works. In
Advances in neural information process-ing systems , pages 3104–3112.Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, andBernhard Schoelkopf. 2017. Wasserstein auto-encoders. arXiv preprint arXiv:1711.01558 .Olga Vechtomova, Hareesh Bahuleyan, AmirpashaGhabussi, and Vineet John. 2018. Generating lyricswith variational autoencoder and multi-modal artistembeddings. arXiv preprint arXiv:1812.08318 .Hanna M Wallach. 2006. Topic modeling: beyondbag-of-words. In
Proceedings of the 23rd interna-tional conference on Machine learning , pages 977–984. ACM.Wenlin Wang, Zhe Gan, Hongteng Xu, Ruiyi Zhang,Guoyin Wang, Dinghan Shen, Changyou Chen, andLawrence Carin. 2019. Topic-guided variationalautoencoders for text generation. arXiv preprintarXiv:1903.07137 .Adina Williams, Nikita Nangia, and Samuel Bowman.2018. A broad-coverage challenge corpus for sen-tence understanding through inference. In
Proceed-ings of the 2018 Conference of the North Americanhapter of the Association for Computational Lin-guistics: Human Language Technologies, Volume 1(Long Papers) , pages 1112–1122. Association forComputational Linguistics.Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc VLe, Mohammad Norouzi, Wolfgang Macherey,Maxim Krikun, Yuan Cao, Qin Gao, KlausMacherey, et al. 2016. Google’s neural ma-chine translation system: Bridging the gap betweenhuman and machine translation. arXiv preprintarXiv:1609.08144arXiv preprintarXiv:1609.08144