[PDF] Toward Controlled Generation of Text

Abstract

Generic generation and manipulation of text is challenging and has limited success compared to recent deep generative modeling in visual domain. This paper aims at generating plausible natural language sentences, whose attributes are dynamically controlled by learning disentangled latent representations with designated semantics. We propose a new neural generative model which combines variational auto-encoders and holistic attribute discriminators for effective imposition of semantic structures. With differentiable approximation to discrete text samples, explicit constraints on independent attribute controls, and efficient collaborative learning of generator and discriminators, our model learns highly interpretable representations from even only word annotations, and produces realistic sentences with desired attributes. Quantitative evaluation validates the accuracy of sentence and attribute generation.

Full PDF

TToward Controlled Generation of Text

Zhiting Hu

Zichao Yang Xiaodan Liang

Ruslan Salakhutdinov Eric P. Xing

Abstract

Generic generation and manipulation of text ischallenging and has limited success comparedto recent deep generative modeling in visual do-main. This paper aims at generating plausibletext sentences, whose attributes are controlled bylearning disentangled latent representations withdesignated semantics. We propose a new neu-ral generative model which combines variationalauto-encoders (VAEs) and holistic attribute dis-criminators for effective imposition of semanticstructures. The model can alternatively be seenas enhancing VAEs with the wake-sleep algo-rithm for leveraging fake samples as extra train-ing data. With differentiable approximation todiscrete text samples, explicit constraints on in-dependent attribute controls, and efﬁcient col-laborative learning of generator and discrimina-tors, our model learns interpretable representa-tions from even only word annotations, and pro-duces sentences with desired attributes of senti-ment and tenses. Quantitative experiments usingtrained classiﬁers as evaluators validate the accu-racy of short sentence and attribute generation.

1. Introduction

There is a surge of research interest in deep generativemodels (Hu et al., 2017), such as Variational Autoencoders(VAEs) (Kingma & Welling, 2013), Generative Adver-sarial Nets (GANs) (Goodfellow et al., 2014), and auto-regressive models (van den Oord et al., 2016). Despite theirimpressive advances in visual domain, such as image gen-eration (Radford et al., 2015), learning interpretable imagerepresentations (Chen et al., 2016), and image editing (Zhuet al., 2016), applications to natural language generationhave been relatively less studied. Even generating realis-tic sentences is challenging as the generative models are Carnegie Mellon University Petuum, Inc.. Correspondenceto: Zhiting Hu < [email protected] > . Proceedings of the th International Conference on MachineLearning , Sydney, Australia, PMLR 70, 2017. Copyright 2017by the author(s). required to capture complex semantic structures underly-ing sentences. Previous work have been mostly limitedto task-speciﬁc applications in supervised settings, includ-ing machine translation (Bahdanau et al., 2014) and imagecaptioning (Vinyals et al., 2015). However, autoencoderframeworks (Sutskever et al., 2014) and recurrent neuralnetwork language models (Mikolov et al., 2010) do not ap-ply to generic text generation from arbitrary hidden rep-resentations due to the unsmoothness of effective hiddencodes (Bowman et al., 2015). Very few recent attempts ofusing VAEs (Bowman et al., 2015; Tang et al., 2016) andGANs (Yu et al., 2017; Zhang et al., 2016) have been madeto investigate generic text generation, while their generatedtext is largely randomized and uncontrollable.In this paper we tackle the problem of controlled generationof text. That is, we focus on generating realistic sentences,whose attributes can be controlled by learning disentangledlatent representations. To enable the manipulation of gen-erated sentences, a few challenges need to be addressed.A ﬁrst challenge comes from the discrete nature of textsamples. The resulting non-differentiability hinders the useof global discriminators that assess generated samples andback-propagate gradients to guide the optimization of gen-erators in a holistic manner, as shown to be highly effectivein continuous image generation and representation model-ing (Chen et al., 2016; Larsen et al., 2016; Dosovitskiy &Brox, 2016). A number of recent approaches attempt to ad-dress the non-differentiability through policy learning (Yuet al., 2017) which tends to suffer from high variance dur-ing training, or continuous approximations (Zhang et al.,2016; Kusner & Hernndez-Lobato, 2016) where only pre-liminary qualitative results are presented. As an alterna-tive to the discriminator based learning, semi-supervisedVAEs (Kingma et al., 2014) minimize element-wise recon-struction error on observed examples and are applicable todiscrete visibles. This, however, loses the holistic view offull sentences and can be inferior especially for modelingglobal abstract attributes (e.g., sentiment).Another challenge for controllable generation relates tolearning disentangled latent representations. Interpretabil-ity expects each part of the latent representation to governand only focus on one aspect of the samples. Prior meth-ods (Chen et al., 2016; Odena et al., 2016) on structuredrepresentation learning lack explicit enforcement of the in- a r X i v : . [ c s . L G ] S e p oward Controlled Generation of Text dependence property on the full latent representation, andvarying individual code may result in unexpected variationof other unspeciﬁed attributes besides the desired one.In this paper, we propose a new text generative modelthat addresses the above issues, permitting highly disen-tangled representations with designated semantic structure,and generating sentences with dynamically speciﬁed at-tributes. We base our generator on VAEs in combinationwith holistic discriminators of attributes for effective im-position of structures on the latent code. End-to-end opti-mization is enabled with differentiable softmax approxima-tion which anneals smoothly to discrete case and helps fastconvergence. The probabilistic encoder of VAE also func-tions as an additional discriminator to capture variationsof implicitly modeled aspects, and guide the generator toavoid entanglement during attribute code manipulation.Our model can be interpreted as enhancing VAEs withan extended wake-sleep procedure (Hinton et al., 1995),where the sleep phase enables incorporation of generatedsamples for learning both the generator and discriminatorsin an alternating manner. The generator and the discrim-inators effectively provide feedback signals to each other,resulting in an efﬁcient mutual bootstrapping framework.We show a little supervision (e.g., 100s of annotated sen-tences) is sufﬁcient to learn structured representations.Besides efﬁcient representation learning and enabled semi-supervised training, another advantage of using discrimi-nators as learning signals for the generator, as comparedto conventional conditional reconstruction based meth-ods (Wen et al., 2015; Kingma et al., 2014), is that dis-criminators of different attributes can be trained indepen-dently. That is, for each attribute one can use separatelabeled data for training the respective discriminator, andthe trained discriminators can be combined arbitrarily tocontrol a set of attributes of interest. In contrast, recon-struction based approaches typically require every instanceof the training data to be labeled exhaustively with all tar-get attributes (Wen et al., 2015), or to marginalize out anymissing attributes (Kingma et al., 2014) which can be com-putationally expensive.As a showing case, we apply our model to generate sen-tences with controlled sentiment and tenses. Though toour best knowledge there is no text corpus with both senti-ment and tense labels, our method enables to use separatedatasets, one with annotated sentiment and the other withtense labels. Quantitative experiments demonstrate the ef-ﬁcacy of our method. Our model improves over previousgenerative models on the accuracy of generating speciﬁedattributes as well as performing classiﬁcation using gen-erated samples. We show our method learns highly dis-entangled representations from only word-level labels, andproduces plausible short sentences.

2. Related Work

Remarkable progress has been made in deep generativemodeling. Hu et al. (2017) provide a uniﬁed view of adiverse set of deep generative methods. Variational Au-toencoders (VAEs) (Kingma & Welling, 2013) consist ofencoder and generator networks which encode a data exam-ple to a latent representation and generate samples from thelatent space, respectively. The model is trained by maxi-mizing a variational lower bound on the data log-likelihoodunder the generative model. A KL divergence loss is mini-mized to match the posterior of the latent code with a prior,which enables every latent code from the prior to decodeinto a plausible sentence. Without the KL regularization,VAEs degenerate to autoencoders and become inapplicablefor the generic generation. The vanilla VAEs are incom-patible with discrete latents as they hinder differentiableparameterization for learning the encoder. Wake-sleep al-gorithm (Hinton et al., 1995) introduced for learning deepdirected graphical models shares similarity with VAEs byalso combining an inference network with the generator.The wake phase updates the generator with samples gener-ated from the inference network on training data, while thesleep phase updates the inference network based on sam-ples from the generator. Our method combines VAEs withan extended wake-sleep in which the sleep procedure up-dates both the generator and inference network (discrimi-nators), enabling collaborative semi-supervised learning.Besides reconstruction in raw data space, discriminator-based metric provides a different way for generator learn-ing, i.e., the discriminator assesses generated samples andfeedbacks learning signals. For instance, GANs (Good-fellow et al., 2014) use a discriminator to feedback theprobability of a sample being recognized as a real exam-ple. Larsen et al. (2016) combine VAEs with GANs forenhanced image generation. Dosovitskiy & Brox (2016);Taigman et al. (2017) use discriminators to measure high-level perceptual similarity. Applying discriminators to textgeneration is hard due to the non-differentiability of dis-crete samples (Yu et al., 2017; Zhang et al., 2016; Kusner& Hernndez-Lobato, 2016). Bowman et al. (2015); Tanget al. (2016); Yang et al. (2017) instead use VAEs withoutdiscriminators. All these text generation methods do notlearn disentangled latent representations, resulting in ran-domized and uncontrollable samples. In contrast, disen-tangled generation in visual domain has made impressiveprogress. E.g., InfoGAN (Chen et al., 2016), which resem-bles the extended sleep procedure of our joint VAE/wake-sleep algorithm, disentangles latent representation in an un-supervised manner. The semantic of each dimension isobserved after training rather than designated by users ina controlled way. Siddharth et al. (2017); Kingma et al.(2014) base on VAEs and obtain disentangled image rep-resentations with semi-supervised learning. Zhou & Neu- oward Controlled Generation of Text big (2017) extend semi-supervised VAEs for text transduc-tion. In contrast, our model combines VAEs with discrim-inators which provide a better, holistic metric compared toelement-wise reconstruction. Moreover, most of these ap-proaches have only focused on the disentanglement of thestructured part of latent representations, while ignoring po-tential dependence of the structured code with attributes notexplicitly encoded. We address this by introducing an in-dependency constraint, and show its effectiveness for im-proved interpretability.

3. Controlled Generation of Text

Our model aims to generate plausible sentences condi-tioned on representation vectors which are endowed withdesignated semantic structures. For instance, to controlsentence sentiment, our model allocates one dimension ofthe latent representation to encode “positive” and “nega-tive” semantics, and generates samples with desired sen-timent by simply specifying a particular code. Beneﬁtingfrom the disentangled structure, each such code is able tocapture a salient attribute and is independent with other fea-tures. Our deep text generative model possesses severalmerits compared to prior work, as it 1) facilitates effectiveimposition of latent code semantics by enabling global dis-criminators to guide the discrete text generator learning;2) improves model interpretability by explicitly enforcingthe constraints on independent attribute controls; 3) per-mits efﬁcient semi-supervised learning and bootstrappingby synthesizing variational auto-encoders with a tailoredwake-sleep approach. We ﬁrst present the overview of ourframework ( § § We build our framework starting from variational auto-encoders ( §

2) which have been used for text genera-tion (Bowman et al., 2015), where sentence ˆ x is generatedconditioned on latent code z . The vanilla VAE employs anunstructured vector z in which the dimensions are entan-gled. To model and control the attributes of interest in aninterpretable way, we augment the unstructured variables z with a set of structured variables c each of which targets asalient and independent semantic feature of sentences.We want our sentence generator to condition on the com-bined vector ( z , c ) , and generate samples that fulﬁll theattributes as speciﬁed in the structured code c . Conditionalgeneration in the context of VAEs (e.g., semi-supervisedVAEs (Kingma et al., 2014)) is often learned by recon-structing observed examples given their feature code. How-ever, as demonstrated in visual domain, compared to com-puting element-wise distances in the data space, computingdistances in the feature space allows invariance to distract-ing transformations and provides a better, holistic metric. 𝑧 𝑐 GeneratorDiscriminators 𝑥$ Encoder 𝑥 Figure 1.

The generative model, where z is unstructured latentcode and c is structured code targeting sentence attributes to con-trol. Blue dashed arrows denote the proposed independency con-straint (section 3.2 for details), and red arrows denote gradientpropagation enabled by the differentiable approximation. Thus, for each attribute code in c , we set up an individ-ual discriminator to measure how well the generated sam-ples match the desired attributes, and drive the generator toproduce improved results. The difﬁculty of applying dis-criminators in our context is that text samples are discreteand non-differentiable, which breaks down gradient prop-agation from the discriminators to the generator. We usea continuous approximation based on softmax with a de-creasing temperature, which anneals to the discrete case astraining proceeds. This simple yet effective approach en-joys low variance and fast convergence.Intuitively, having an interpretable representation wouldimply that each structured code in c can independentlycontrol its target feature, without entangling with other at-tributes, especially those not explicitly modeled. We en-courage the independency by enforcing those irrelevant at-tributes to be completely captured in the unstructured code z and thus be separated from c that we will manipulate.To this end, we reuse the VAE encoder as an additionaldiscriminator for recognizing the attributes modeled in z ,and train the generator so that these unstructured attributescan be recovered from the generated samples. As a result,varying different attribute codes will keep the unstructuredattributes invariant as long as z is unchanged.Figure 1 shows the overall model structure. Our completemodel incorporates VAEs and attribute discriminators, inwhich the VAE component trains the generator to recon-struct real sentences for generating plausible text, while thediscriminators enforce the generator to produce attributescoherent with the conditioned code. The attribute discrim-inators are learned to ﬁt labeled examples to entail desig-nated semantics, as well as trained to explain samples fromthe generator. That is, the generator and the discrimina-tors form a pair of collaborative learners and provide feed-back signals to each other. The collaborative optimizationresembles wake-sleep algorithm. We show the combinedVAE/wake-sleep learning enables a highly efﬁcient semi-supervised framework, which requires only a little supervi-sion to obtain interpretable representation and generation. oward Controlled Generation of Text We now describe our model in detail, by presenting thelearning of generator and discriminators, respectively.

Generator Learning

The generator G is an LSTM-RNN for generating tokensequence ˆ x = { ˆ x , . . . , ˆ x T } conditioned on the latent code ( z , c ) , which depicts a generative distribution: ˆ x ∼ G ( z , c ) = p G ( ˆ x | z , c )= (cid:89) t p (ˆ x t | ˆ x is the temperature normally set to .The unstructured part z of the representation is modeledas continuous variables with standard Gaussian prior p ( z ) ,while the structured code c can contain both continu-ous and discrete variables to encode different attributes(e.g., sentiment categories, formality) with appropriateprior p ( c ) . Given observation x , the base VAE includesa conditional probabilistic encoder E to infer the latents z : z ∼ E ( x ) = q E ( z | x ) . (3) Let θ G and θ E denote the parameters of the generator G and the encoder E , respectively. The VAE is then opti-mized to minimize the reconstruction error of observed realsentences, and at the same time regularize the encoder to beclose to the prior p ( z ) : L VAE ( θ G , θ E ; x ) = KL ( q E ( z | x ) (cid:107) p ( z )) − E q E ( z | x ) q D ( c | x ) [log p G ( x | z , c )] , (4) where KL ( ·(cid:107)· ) is the KL-divergence; and q D ( c | x ) is theconditional distribution deﬁned by the discriminator D foreach structured variable in c : D ( x ) = q D ( c | x ) . (5) Here, for notational simplicity, we assume only one struc-tured variable and thus one discriminator, though ourmodel speciﬁcation can straightforwardly be applied tomany attributes. The distribution over ( z , c ) factors into q E and q D as we are learning disentangled representa-tions. Note that here the discriminator D and code c arenot learned with the VAE loss, but instead optimized with the objectives described shortly. Besides the reconstruc-tion loss which drives the generator to produce realisticsentences, the discriminator provides extra learning signalswhich enforce the generator to produce coherent attributethat matches the structured code in c . However, as it isimpossible to propagate gradients from the discriminatorthrough the discrete samples, we resort to a deterministiccontinuous approximation. The approximation replaces thesampled token ˆ x t (represented as a one-hot vector) at eachstep with the probability vector in Eq.(2) which is differ-entiable w.r.t the generator’s parameters. The probabilityvector is used as the output at the current step and the inputto the next step along the sequence of decision making. Theresulting “soft” generated sentence, denoted as (cid:101) G τ ( z , c ) , isfed into the discriminator to measure the ﬁtness to the tar-get attribute, leading to the following loss for improving G : L Attr ,c ( θ G ) = − E p ( z ) p ( c ) (cid:104) log q D ( c | (cid:101) G τ ( z , c )) (cid:105) . (6) The temperature τ (Eq.2) is set to τ → as training pro-ceeds, yielding increasingly peaked distributions that ﬁ-nally emulate discrete case. The simple deterministic ap-proximation effectively leads to reduced variance and fastconvergence during training, which enables efﬁcient learn-ing of the conditional generator. The diversity of genera-tion results is guaranteed since we use the approximationonly for attribute modeling and the base sentence genera-tion is learned through VAEs.With the objective in Eq.(6), each structured attribute ofgenerated sentences is controlled through the correspond-ing code in c and is independent with other variables in thelatent representation. However, it is still possible that otherattributes not explicitly modeled may also entangle with thecode in c , and thus varying a dimension of c can yield unex-pected variation of these attributes we are not interested in.To address this, we introduce the independency constraintwhich separates these attributes with c by enforcing themto be fully captured by the unstructured part z . Therefore,besides the attributes explicitly encoded in c , we also trainthe generator so that other non-explicit attributes can becorrectly recognized from the generated samples and matchthe unstructured code z . Instead of building a new discrim-inator, we reuse the variational encoder E which servesprecisely to infer the latents z in the base VAE. The lossis in the same form as with Eq.(6) except replacing the dis-criminator conditional q D with the encoder conditional q E : L Attr ,z ( θ G ) = − E p ( z ) p ( c ) (cid:104) log q E ( z | (cid:101) G τ ( z , c )) (cid:105) . (7) Note that, as the discriminator in Eq.(6), the encoder now The probability vector thus functions to average over theword embedding matrix to obtain a “soft” word embedding ateach step. oward Controlled Generation of Text performs inference over generated samples from the prior,as opposed to observed examples as in VAEs.Combining Eqs.(4)-(7) we obtain the generator objective: min θ G L G = L VAE + λ c L Attr ,c + λ z L Attr ,z , (8) where λ c and λ z are balancing parameters. The varia-tional encoder is trained by minimizing the VAE loss, i.e., min θ E L VAE . Discriminator Learning

The discriminator D is trained to accurately infer the sen-tence attribute and evaluate the error of recovering the de-sired feature as speciﬁed in the latent code. For instance,for categorical attribute, the discriminator can be formu-lated as a sentence classiﬁer; while for continuous targeta probabilistic regressor can be used. The discriminatoris learned in a different way compared to the VAE encoder,since the target attributes can be discrete which are not sup-ported in the VAE framework. Moreover, in contrast to theunstructured code z which is learned in an unsupervisedmanner, the structured variable c uses labeled examples toentail designated semantics. We derive an efﬁcient semi-supervised learning method for the discriminator.Formally, let θ D denote the parameters of the discrimina-tor. To learn speciﬁed semantic meaning, we use a set oflabeled examples X L = { ( x L , c L ) } to train the discrimi-nator D with the following objective: L s ( θ D ) = − E X L [log q D ( c L | x L )] . (9) Besides, the conditional generator G is also capable of syn-thesizing (noisy) sentence-attribute pairs ( ˆ x , c ) which canbe used to augment training data for semi-supervised learn-ing. To alleviate the issue of noisy data and ensure ro-bustness of model optimization, we incorporate a minimumentropy regularization term (Grandvalet et al., 2004; Reedet al., 2014). The resulting objective is thus: L u ( θ D ) = − E p G (ˆ x | z , c ) p ( z ) p ( c ) (cid:2) log q D ( c | ˆ x ) + β H ( q D ( c (cid:48) | ˆ x )) (cid:3) , (10) where H ( q D ( c (cid:48) | ˆ x )) is the empirical Shannon entropy ofdistribution q D evaluated on the generated sentence ˆ x ; and β is the balancing parameter. Intuitively, the minimumentropy regularization encourages the model to have highconﬁdence in predicting labels.The joint training objective of the discriminator using bothlabeled examples and synthesized samples is then given as: min θ D L D = L s + λ u L u , (11) where λ u is the balancing parameter. Algorithm 1

Controlled Generation of Text

Input:

A large corpus of unlabeled sentences X = { x } A few sentence attribute labels X L = { ( x L , c L ) } Parameters: λ c , λ z , λ u , β – balancing parameters1: Initialize the base VAE by minimizing Eq.(4) on X with c sampled from prior p ( c ) repeat

3: Train the discriminator D by Eq.(11)4: Train the generator G and the encoder E by Eq.(8) andminimizing Eq.(4), respectively.5: until convergence Output:

Left:

The VAE and wake procedure, corresponding toEq.(4).

Right:

The sleep procedure, corresponding to Eqs.(6)-(7) and (10). Black arrows denote inference and generation; reddashed arrows denote gradient propagation. The two steps in thesleep procedure, i.e., optimizing the discriminator and the gener-ator, respectively, are performed in an alternating manner.

Summarization and Discussion

We have derived our model and its learning procedure. Thegenerator is ﬁrst initialized by training the base VAE on alarge corpus of unlabeled sentences, through the objectiveof minimizing Eq.(4) with the latent code c at this timesampled from the prior distribution p ( c ) . The full model isthen trained by alternating the optimization of the generatorand the discriminator, as summarized in Algorithm 1.Our model can be viewed as combining the VAE frame-work with an extended wake-sleep method, as illustrated inFigure 2. Speciﬁcally, in Eq.(10), samples are producedby the generator and used as targets for maximum like-lihood training of the discriminator. This resembles thesleep phase of wake-sleep. Eqs.(6)-(7) further leverage thegenerated samples to improve the generator. We can seethe above together as an extended sleep procedure basedon “dream” samples obtained by ancestral sampling fromthe generative network. On the other hand, Eq.(4) samples c from the discriminator distribution q D ( c | x ) on observa-tion x , to form a target for training the generator, whichcorresponds to the wake phase. The effective combinationenables discrete latent code, holistic discriminator metrics,and efﬁcient mutual bootstrapping.Training of the discriminators need supervised data to im-pose designated semantics. Discriminators for different at-tributes can be trained independently on separate labeledsets. That is, the model does not require a sentence to be oward Controlled Generation of Text annotated with all attributes, but instead needs only inde-pendent labeled data for each individual attribute. More-over, as the labeled data are used only for learning attributesemantics instead of direct sentence generation, we are al-lowed to extend the data scope beyond labeled sentencesto, e.g., labeled words or phrases. As shown in the experi-ments (section 4), our method is able to effectively lift theword level knowledge to sentence level and generate con-vincing sentences. Finally, with the augmented unsuper-vised training in the sleep phrase, we show a little supervi-sion is sufﬁcient for learning structured representations.

4. Experiments

We apply our model to generate short sentences (length ≤

15) with controlled sentiment and tense. Quantitative ex-periments using trained classiﬁers as evaluators show ourmodel gives improved generation accuracy. Disentangledrepresentation is learned with a few labels or only wordannotations. We also validate the effect of the proposedindependency constraint for interpretable generation.

DatasetsSentence corpus.

We use a large IMDB text corpus (Diaoet al., 2014) for training the generative models. This isa collection of 350K movie reviews. We select sentencescontaining at most words, and replace infrequent wordswith the token “ < unk > ”. The resulting dataset containsaround 1.4M sentences with the vocabulary size of 16K. Sentiment.

To control the sentiment (“positive” or “neg-ative”) of generated sentences, we test on the following la-beled sentiment data: (1) Stanford Sentiment Treebank-2(

SST-full ) (Socher et al., 2013) consists of 6920/872/1821movie review sentences with binary sentiment annotationsin the train/dev/test sets, respectively. We use the 2837training examples with sentence length ≤ , and evalu-ate classiﬁcation accuracy on the original test set. (2) SST-small.

To study the size of labeled data required in thesemi-supervised learning for accurate attribute control, wesample a small subset from SST-full, containing only 250labeled sentences for training. (3)

Lexicon.

We also in-vestigate the effectiveness of our model in terms of usingword-level labels for sentence-level control. The lexiconfrom (Wilson et al., 2005) contains 2700 words with senti-ment labels. We use the lexicon for training by treating thewords as sentences, and evaluate on the SST-full test set.(4)

IMDB.

We collect a dataset from the IMDB corpusby randomly selecting positive and negative movie reviews.The dataset has 5K/1K/10K sentences in train/dev/test.

Tense.

The second attribute is the tense of the main verbin a sentence. Though no corpus with sentence tense an-notations is readily available, our method is able to learnfrom only labeled words and generate desired sentences.

Model DatasetSST-full SST-small LexiconS-VAE 0.822 0.679 0.660Ours

Table 1.

Sentiment accuracy of generated sentences. S-VAE(Kingma et al., 2014) and our model are trained on the three sen-timent datasets and generate 30K sentences, respectively.

We compile from the TimeBank (timeml.org) dataset andobtain a lexicon of 5250 words and phrases labeled withone of { “past”, “present”, “future” } . The lexicon mainlyconsists of verbs in different tenses (e.g., “was”, “will be”)as well as time expressions (e.g., “in the future”).Note that our method requires only separate labeled coporafor each attribute. And for the tense attribute only anno-tated words/phrases are used. Parameter Setting

The generator and encoder are set as single-layer LSTMRNNs with input/hidden dimension of 300 and max samplelength of 15. Discriminators are set as ConvNets. Detailedconﬁgurations are in the supplements. To avoid vanishinglysmall KL term in the VAE module (Eq.4) (Bowman et al.,2015), we use a KL term weight linearly annealing from 0to 1 during training. Balancing parameters are set to λ c = λ z = λ u = 0 . , and β is selected on the dev sets. At testtime sentences are generated with Eq.(1). We quantitatively measure sentence attribute control byevaluating the accuracy of generating designated sentiment,and the effect of using samples for training classiﬁers.We compare with semi-supervised VAE (S-VAE) (Kingmaet al., 2014), one of the few existing deep models capableof conditional text generation. S-VAE learns to reconstructobserved sentences given attribute code, and no discrimi-nators are used. See § c ,and use the pre-trained sentiment classiﬁer to assign senti-ment labels to the generated sentences. The accuracy iscalculated as the percentage of the predictions that matchthe sentiment code c . Table 1 shows the results on 30Ksentences by the two models which are trained with SST-full, SST-small, and Lexicon, respectively. We see that ourmethod consistently outperforms S-VAE on all datasets. Inparticular, trained with only 250 labeled examples in SST-small, our model achieves reasonable generation accuracy,demonstrating the ability of learning disentangled repre- oward Controlled Generation of Text SST-full SST-small Lexicon IMDB A cc u r a c y Std H-Reg Ours S-VAE

Figure 3.

Test-set accuracy of classiﬁers trained on four sentimentdatasets augmented with different methods (see text for details).The ﬁrst three datasets use the SST-full test set for evaluation. sentations with very little supervision. More importantly,given only word-level annotations in Lexicon, our modelsuccessfully transfers the knowledge to sentence level andgenerates desired sentiments reasonably well. Compared toour method that drives learning by directly assessing gen-erated sentences, S-VAE attempts to capture sentiment se-mantics only by reconstructing labeled words, which is lessefﬁcient and gives inferior performance.We next use the generated samples to augment the sen-timent datasets and train sentiment classiﬁers. Whilenot aiming to build best-performing classiﬁers on thesedatasets, the classiﬁcation accuracy serves as an auxiliarymeasure of the sentence generation quality. That is, higher-quality sentences with more accurate sentiment attributecan predictably help yield stronger sentiment classiﬁers.Figure 3 shows the accuracy of classiﬁers trained on thefour datasets with different augmentations. “Std” is a Con-vNet trained on the standard original datasets, with thesame network structure as with the sentiment discriminatorin our model. “H-reg” additionally imposes the minimumentropy regularization on the generated sentences. “Ours”incorporates the minimum entropy regularization and thesentiment attribute code c of the generated sentences, asin Eq.(10). S-VAE uses the same protocol as our methodto augment with the data generated by the S-VAE model.Comparison in Figure 3 shows that our method consistentlygives the best performance on four datasets. For instance,on Lexicon, our approach achieves 0.733 accuracy, com-pared to 0.701 of “Std”. The improvement of “H-Reg”over “Std” shows positive effect of the minimum entropyregularization on generated sentences. Further incorporat-ing the conditioned sentiment code of the generated sam-ples, as in “Ours” and “S-VAE”, provides additional perfor-mance gains, indicating the advantages of conditional gen-eration for automatic creation of labeled data. Consistentwith the above experiment, our model outperforms S-VAE. We study the interpretability of generation and the explicitindependency constraint (Eq.7) for disentangled control. Table 2 compares the samples generated by models withand without the constraint term, respectively. In the leftcolumn where the constraint applies, each pair of sen-tences, conditioned on different sentiment codes, are highlyrelevant in terms of, e.g., subject, tone, and wording whichare not explicitly modeled in the structured code c while in-stead implicitly encoded in the unstructured code z . Vary-ing the sentiment code precisely changes the sentiment ofthe sentences (and paraphrases slightly to ensure ﬂuency),while keeping other aspects unchanged. In contrast, theresults in the right column, where the independency con-straint is unactivated, show that varying the sentiment codenot only changes the polarity of samples, but can alsochange other aspects unexpected to control, making thegeneration results less interpretable and predictable.We demonstrate the power of learned disentangled repre-sentation by varying one attribute variable at a time. Table 3shows the generation results. We see that each attributevariable in our model successfully controls its correspond-ing attribute, and is disentangled with other attribute code.The right column of the table shows meaningful variationof sentence tense as the tense code varies. Note that thesemantic of tense is learned only from a lexicon withoutcomplete sentence examples. Our model successfully cap-tures the key ingredients (e.g., verb “was” for past tense and“will be” for future tense) and combines with the knowl-edge of well-formed sentences to generate realistic sampleswith speciﬁed tense attributes. Table 4 further shows gen-erated sentences with varying code z in different settingsof structured attribute factors. We obtain samples that arediverse in content while consistent in sentiment and tense.We also occasionally observed failure cases as in Table 5,such as implausible sentences, unexpected variations ofirrelevant attributes, and inaccurate attribute generations.Improved modeling is expected such as using dilated con-volutions as decoder, and decoding with beam search, etc.Better systematic quantitative evaluations are also desired.

5. Discussions

We have proposed a deep generative model that learns in-terpretable latent representations and generates sentenceswith speciﬁed attributes. We obtained meaningful genera-tion with restricted sentence length, and improved accuracyon sentiment and tense attributes. In the future we wouldlike to improve the modeling and training as above, andextend to generate longer sentences/paragraphs and controlmore attributes with ﬁne-grained structures.Our approach combines VAEs with attribute discrim-inators and imposes explicit independency constraintson attribute controls, enabling disentangled latent code.Semi-supervised learning within the joint VAE/wake-sleep oward Controlled Generation of Text w/ independency constraint w/o independency constraint the ﬁlm is strictly routine ! the acting is bad .the ﬁlm is full of imagination . the movie is so much fun .after watching this movie , i felt that disappointed . none of this is very original .after seeing this ﬁlm , i ’m a fan . highly recommended viewing for its courage , and ideas .the acting is uniformly bad either . too blandthe performances are uniformly good . highly watchablethis is just awful . i can analyze this movie without more than three words .this is pure genius . i highly recommend this ﬁlm to anyone who appreciates music .

Table 2.

Samples from models with or without independency constraint on attribute control (i.e., Eq.7). Each pair of sentences aregenerated with sentiment code set to “negative” and “positive”, respectively, while ﬁxing the unstructured code z . The SST-full datasetis used for learning the sentiment representation. Varying the code of tense i thought the movie was too bland and too much this was one of the outstanding thrillers of the last decadei guess the movie is too bland and too much this is one of the outstanding thrillers of the all timei guess the ﬁlm will have been too bland this will be one of the great thrillers of the all time

Table 3.

Each triple of sentences is generated by varying the tense code while ﬁxing the sentiment code and z . Varying the unstructured code z (“negative”, “past”) (“positive”, “past”) the acting was also kind of hit or miss . his acting was impeccablei wish i ’d never seen it this was spectacular , i saw it in theaters twiceby the end i was so lost i just did n’t care anymore it was a lot of fun (“negative”, “present”) (“positive”, “present”) the movie is very close to the show in plot and characters this is one of the better dance ﬁlmsthe era seems impossibly distant i ’ve always been a big fan of the smart dialogue .i think by the end of the ﬁlm , it has confused itself i recommend you go see this, especially if you hurt (“negative”, “future”) (“positive”, “future”) i wo n’t watch the movie i hope he ’ll make more movies in the futureand that would be devastating ! i will deﬁnitely be buying this on dvdi wo n’t get into the story because there really is n’t one you will be thinking about it afterwards, i promise you Table 4.

Samples by varying the unstructured code z given sentiment (“positive”/“negative”) and tense (“past”/“present”/“future”) code. Failure cases the plot is not so original it does n’t get any better the other dance moviesthe plot weaves us into < unk > it does n’t reach them , but the stories lookhe is a horrible actor ’s most part i just think sohe ’s a better actor than a standup i just think ! Table 5.

Failure cases when varying sentiment code with other codes ﬁxed. oward Controlled Generation of Text framework is effective with little or incomplete supervi-sion. Hu et al. (2017) develop a uniﬁed view of a diverseset of deep generative paradigms, including GANs, VAEs,and wake-sleep algorithm. Our model can be alternativelymotivated under the view as enhancing VAEs with the ex-tended sleep phase and by leveraging generated samples.Interpretability of the latent representations not only allowsdynamic control of generated attributes, but also providesan interface that connects the end-to-end neural model withconventional structured methods. For instance, we can en-code structured constraints (e.g., logic rules or probabilisticstructured models) on the interpretable latent code, to in-corporate prior knowledge or human intentions (Hu et al.,2016a;b); or plug the disentangled generation model intodialog systems to generate natural language responses fromstructured dialog states (Young et al., 2013).Though we have focused on the generation capacity of ourmodel, the proposed collaborative semi-supervised learn-ing framework also helps improve the discriminators bygenerating labeled samples for data augmentation (e.g., seeFigure 3). More generally, for any discriminative task, wecan build a conditional generative model to synthesize ad-ditional labeled data. The accurate attribute generation ofour approach can offer larger performance gains comparedto previous generative methods.

Implementation

We have released code for an adaptedversion of the proposed algorithm at: https://github.com/asyml/texar/tree/master/examples/text_style_transfer .The implementation is based on

Texar (Hu et al., 2018), a general-purpose text generation toolkit.

Acknowledgments

This research is supported by NSFIIS1447676, ONR N000141410684, and ONR N000141712463.

References

Bahdanau, Dzmitry, Cho, Kyunghyun, and Bengio,Yoshua. Neural machine translation by jointly learningto align and translate. arXiv preprint arXiv:1409.0473 ,2014.Bowman, Samuel R, Vilnis, Luke, Vinyals, Oriol, Dai, An-drew M, Jozefowicz, Rafal, and Bengio, Samy. Gener-ating sentences from a continuous space. arXiv preprintarXiv:1511.06349 , 2015.Chen, Xi, Duan, Yan, Houthooft, Rein, Schulman, John,Sutskever, Ilya, and Abbeel, Pieter. InfoGAN: Inter-pretable representation learning by information max-imizing generative adversarial nets. In

Advances inNeural Information Processing Systems , pp. 2172–2180,2016. Diao, Qiming, Qiu, Minghui, Wu, Chao-Yuan, Smola,Alexander J, Jiang, Jing, and Wang, Chong. Jointly mod-eling aspects, ratings and sentiments for movie recom-mendation (JMARS). In

Proceedings of the 20th ACMSIGKDD international conference on Knowledge dis-covery and data mining , pp. 193–202. ACM, 2014.Dosovitskiy, Alexey and Brox, Thomas. Generating im-ages with perceptual similarity metrics based on deepnetworks. arXiv preprint arXiv:1602.02644 , 2016.Goodfellow, Ian, Pouget-Abadie, Jean, Mirza, Mehdi, Xu,Bing, Warde-Farley, David, Ozair, Sherjil, Courville,Aaron, and Bengio, Yoshua. Generative adversarial nets.In

Advances in Neural Information Processing Systems ,pp. 2672–2680, 2014.Grandvalet, Yves, Bengio, Yoshua, et al. Semi-supervisedlearning by entropy minimization. In

NIPS , volume 17,pp. 529–536, 2004.Hinton, Geoffrey E, Dayan, Peter, Frey, Brendan J, andNeal, Radford M. The “wake-sleep” algorithm for un-supervised neural networks.

Science , 268(5214):1158,1995.Hu, Zhiting, Ma, Xuezhe, Liu, Zhengzhong, Hovy, Eduard,and Xing, Eric. Harnessing deep neural networks withlogic rules. In

ACL , 2016a.Hu, Zhiting, Yang, Zichao, Salakhutdinov, Ruslan, andXing, Eric P. Deep neural networks with massive learnedknowledge. In

EMNLP , 2016b.Hu, Zhiting, Yang, Zichao, Salakhutdinov, Ruslan, andXing, Eric P. On unifying deep generative models. arXivpreprint arXiv:1706.00550 , 2017.Hu, Zhiting, Shi, Haoran, Yang, Zichao, Tan, Bowen,Zhao, Tiancheng, He, Junxian, Wang, Wentao, Yu,Xingjiang, Qin, Lianhui, Wang, Di, Ma, Xuezhe, Liu,Hector, Liang, Xiaodan, Zhu, Wanrong, Sachan, Deven-dra Singh, and Xing, Eric. Texar: A modularized, versa-tile, and extensible toolkit for text generation. 2018.Kingma, Diederik P and Welling, Max. Auto-encodingvariational Bayes. arXiv preprint arXiv:1312.6114 ,2013.Kingma, Diederik P, Mohamed, Shakir, Rezende,Danilo Jimenez, and Welling, Max. Semi-supervisedlearning with deep generative models. In

Advances inNeural Information Processing Systems , pp. 3581–3589,2014.Kusner, Matt and Hernndez-Lobato, Jos. GANs for se-quences of discrete elements with the Gumbel-softmaxdistribution. arXiv preprint arXiv:1611.04051 , 2016. oward Controlled Generation of Text

Larsen, Anders Boesen Lindbo, Sønderby, Søren Kaae,and Winther, Ole. Autoencoding beyond pixels usinga learned similarity metric. In

ICML , 2016.Mikolov, Tomas, Karaﬁ´at, Martin, Burget, Lukas, Cer-nock`y, Jan, and Khudanpur, Sanjeev. Recurrent neu-ral network based language model. In

Interspeech , vol-ume 2, pp. 3, 2010.Odena, Augustus, Olah, Christopher, and Shlens, Jonathon.Conditional image synthesis with auxiliary classiﬁerGANs. arXiv preprint arXiv:1610.09585 , 2016.Radford, Alec, Metz, Luke, and Chintala, Soumith. Un-supervised representation learning with deep convolu-tional generative adversarial networks. arXiv preprintarXiv:1511.06434 , 2015.Reed, Scott, Lee, Honglak, Anguelov, Dragomir, Szegedy,Christian, Erhan, Dumitru, and Rabinovich, Andrew.Training deep neural networks on noisy labels with boot-strapping. arXiv preprint arXiv:1412.6596 , 2014.Siddharth, N., Paige, Brooks, Desmaison, Alban, Meent,Jan-Willem van de, Wood, Frank, Goodman, Noah D.,Kohli, Pushmeet, and Torr, Philip H.S. Learning disen-tangled representations in deep generative models. 2017.Socher, Richard, Perelygin, Alex, Wu, Jean Y, Chuang,Jason, Manning, Christopher D, Ng, Andrew Y, Potts,Christopher, et al. Recursive deep models for semanticcompositionality over a sentiment treebank. In

Proceed-ings of the conference on empirical methods in naturallanguage processing (EMNLP) , volume 1631, pp. 1642.Citeseer, 2013.Sutskever, Ilya, Vinyals, Oriol, and Le, Quoc V. Se-quence to sequence learning with neural networks. In

Advances in neural information processing systems , pp.3104–3112, 2014.Taigman, Yaniv, Polyak, Adam, and Wolf, Lior. Unsuper-vised cross-domain image generation. In

ICLR , 2017.Tang, Shuai, Jin, Hailin, Fang, Chen, and Wang, Zhaowen.Unsupervised sentence representation learning with ad-versarial auto-encoder. 2016.van den Oord, Aaron, Kalchbrenner, Nal, andKavukcuoglu, Koray. Pixel recurrent neural networks.In

ICML , 2016.Vinyals, Oriol, Toshev, Alexander, Bengio, Samy, and Er-han, Dumitru. Show and tell: A neural image captiongenerator. In

Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition , pp. 3156–3164, 2015. Wen, Tsung-Hsien, Gasic, Milica, Mrksic, Nikola, Su, Pei-Hao, Vandyke, David, and Young, Steve. Semanticallyconditioned lstm-based natural language generation forspoken dialogue systems. In

EMNLP , 2015.Wilson, Theresa, Wiebe, Janyce, and Hoffmann, Paul. Rec-ognizing contextual polarity in phrase-level sentimentanalysis. In

Proceedings of the conference on humanlanguage technology and empirical methods in natu-ral language processing , pp. 347–354. Association forComputational Linguistics, 2005.Yang, Zichao, Hu, Zhiting, Salakhutdinov, Ruslan, andBerg-Kirkpatrick, Taylor. Improved variational autoen-coders for text modeling using dilated convolutions. In

ICML , 2017.Young, Steve, Gaˇsi´c, Milica, Thomson, Blaise, andWilliams, Jason D. POMDP-based statistical spoken di-alog systems: A review.

Proceedings of the IEEE , 101(5):1160–1179, 2013.Yu, Lantao, Zhang, Weinan, Wang, Jun, and Yu, Yong. Se-qGAN: sequence generative adversarial nets with policygradient. In

AAAI , 2017.Zhang, Yizhe, Gan, Zhe, and Carin, Lawrence. Generat-ing text via adversarial training. In

NIPS Workshop onAdversarial Training , 2016.Zhou, Chunting and Neubig, Graham. Multi-space varia-tional encoder-decoders for semi-supervised labeled se-quence transduction. In

ACL , 2017.Zhu, Jun-Yan, Kr¨ahenb¨uhl, Philipp, Shechtman, Eli, andEfros, Alexei A. Generative visual manipulation on thenatural image manifold. In