[PDF] Unsupervised Text Style Transfer using Language Models as Discriminators

Abstract

Binary classifiers are often employed as discriminators in GAN-based unsupervised style transfer systems to ensure that transferred sentences are similar to sentences in the target domain. One difficulty with this approach is that the error signal provided by the discriminator can be unstable and is sometimes insufficient to train the generator to produce fluent language. In this paper, we propose a new technique that uses a target domain language model as the discriminator, providing richer and more stable token-level feedback during the learning process. We train the generator to minimize the negative log likelihood (NLL) of generated sentences, evaluated by the language model. By using a continuous approximation of discrete sampling under the generator, our model can be trained using back-propagation in an end- to-end fashion. Moreover, our empirical results show that when using a language model as a structured discriminator, it is possible to forgo adversarial steps during training, making the process more stable. We compare our model with previous work using convolutional neural networks (CNNs) as discriminators and show that our approach leads to improved performance on three tasks: word substitution decipherment, sentiment modification, and related language translation.

Full PDF

UUnsupervised Text Style Transfer using LanguageModels as Discriminators

Zichao Yang , Zhiting Hu , Chris Dyer , Eric P. Xing , Taylor Berg-Kirkpatrick Carnegie Mellon University, DeepMind {zichaoy, zhitingh, epxing, tberg}@[email protected]

Abstract

Binary classiﬁers are often employed as discriminators in GAN-based unsupervisedstyle transfer systems to ensure that transferred sentences are similar to sentencesin the target domain. One difﬁculty with this approach is that the error signalprovided by the discriminator can be unstable and is sometimes insufﬁcient to trainthe generator to produce ﬂuent language. In this paper, we propose a new techniquethat uses a target domain language model as the discriminator, providing richerand more stable token-level feedback during the learning process. We train thegenerator to minimize the negative log likelihood (NLL) of generated sentences,evaluated by the language model. By using a continuous approximation of discretesampling under the generator, our model can be trained using back-propagationin an end-to-end fashion. Moreover, our empirical results show that when usinga language model as a structured discriminator, it is possible to forgo adversarialsteps during training, making the process more stable. We compare our modelwith previous work that uses convolutional networks (CNNs) as discriminators, aswell as a broad set of other approaches. Results show that the proposed methodachieves improved performance on three tasks: word substitution decipherment,sentiment modiﬁcation, and related language translation.

Recently there has been growing interest in designing natural language generation (NLG) systemsthat allow for control over various attributes of generated text – for example, sentiment and otherstylistic properties. Such controllable NLG models have wide applications in dialogues systems (Wenet al., 2016) and other natural language interfaces. Recent successes for neural text generationmodels in machine translation (Bahdanau et al., 2014), image captioning (Vinyals et al., 2015) anddialogue (Vinyals and Le, 2015; Wen et al., 2016) have relied on massive parallel data. However,for many other domains, only non-parallel data – which includes collections of sentences from eachdomain without explicit correspondence – is available. Many text style transfer problems fall into thiscategory. The goal for these tasks is to transfer a sentence with one attribute to a sentence with ananother attribute, but with the same style-independent content, trained using only non-parallel data.Unsupervised text style transfer requires learning disentangled representations of attributes (e.g., nega-tive/positive sentiment, plaintext/ciphertext orthography) and underlying content. This is challengingbecause the two interact in subtle ways in natural language and it can even be hard to disentangle themwith parallel data. The recent development of deep generative models like variational auto-encoders(VAEs) (Kingma and Welling, 2013) and generative adversarial networks(GANs) (Goodfellow et al.,2014) have made learning disentangled representations from non-parallel data possible. However,despite their rapid progress in computer vision—for example, generating photo-realistic images (Rad-ford et al., 2015), learning interpretable representations (Chen et al., 2016b), and translating im- a r X i v : . [ c s . C L ] J a n ges (Zhu et al., 2017)—their progress on text has been more limited. For VAEs, the problem oftraining collapse can severely limit effectiveness (Bowman et al., 2015; Yang et al., 2017b), and whenapplying adversarial training to natural language, the non-differentiability of discrete word tokensmakes generator optimization difﬁcult. Hence, most attempts use REINFORCE (Sutton et al., 2000)to ﬁnetune trained models (Yu et al., 2017; Li et al., 2017) or uses professor forcing (Lamb et al.,2016) to match hidden states of decoders.Previous work on unsupervised text style transfer (Hu et al., 2017a; Shen et al., 2017) adopts anencoder-decoder architecture with style discriminators to learn disentangled representations. Theencoder takes a sentence as an input and outputs a style-independent content representation. Thestyle-dependent decoder takes the content representation and a style representation and generatesthe transferred sentence. Hu et al. (2017a) use a style classiﬁer to directly enforce the desired stylein the generated text. Shen et al. (2017) leverage an adversarial training scheme where a binaryCNN-based discriminator is used to evaluate whether a transferred sentence is real or fake, ensuringthat transferred sentences match real sentences in terms of target style. However, in practice, theerror signal from a binary classiﬁer is sometimes insufﬁcient to train the generator to produce ﬂuentlanguage, and optimization can be unstable as a result of the adversarial training step.We propose to use an implicitly trained language model as a new type of discriminator, replacing themore conventional binary classiﬁer. The language model calculates a sentence’s likelihood, whichdecomposes into a product of token-level conditional probabilities. In our approach, rather thantraining a binary classiﬁer to distinguish real and fake sentences, we train the language model toassign a high probability to real sentences and train the generator to produce sentences with highprobability under the language model. Because the language model scores sentences directly using aproduct of locally normalized probabilities, it may offer more stable and more useful training signal tothe generator. Further, by using a continuous approximation of discrete sampling under the generator,our model can be trained using back-propagation in an end-to-end fashion.We ﬁnd empirically that when using the language model as a structured discriminator, it is possible toeliminate adversarial training steps that use negative samples—a critical part of traditional adversarialtraining. Language models are implicitly trained to assign a low probability to negative samplesbecause of its normalization constant. By eliminating the adversarial training step, we found thetraining becomes more stable in practice.To demonstrate the effectiveness of our new approach, we conduct experiments on three tasks: wordsubstitution decipherment, sentiment modiﬁcation, and related language translation. We show thatour approach, which uses only a language model as the discriminator, outperforms a broad set ofstate-of-the-art approaches on the three tasks. We start by reviewing the current approaches for unsupervised text style transfer (Hu et al., 2017a;Shen et al., 2017), and then go on to describe our approach in Section 3. Assume we have two textdatasets X = { x (1) , x (2) , . . . , x ( m ) } and Y = { y (1) , y (2) , . . . , y ( n ) } with two different styles v x and v y , respectively. For example, v x can be the positive sentiment style and v y can be the negativesentiment style. The datasets are non-parallel such that the data does not contain pairs of ( x ( i ) , y ( j ) ) that describe the same content. The goal of style transfer is to transfer data x with style v x to style v y and vice versa, i.e., to estimate the conditional distribution p ( y | x ) and p ( x | y ) . Since text datais discrete, it is hard to learn the transfer function directly via back-propagation as in computervision (Zhu et al., 2017). Instead, we assume the data is generated conditioned on two disentangledparts, the style v and the content z (Hu et al., 2017a).Consider the following generative process for each style: 1) the style representation v is sampledfrom a prior p ( v ) ; 2) the content vector z is sampled from p ( z ) ; 3) the sentence x is generated fromthe conditional distribution p ( x | z , v ) . This model suggests the following parametric form for styletransfer where q represents a posterior: p ( y | x ) = (cid:90) z x p ( y | z x , v y ) q ( z x | x , v x ) d z x . We drop the subscript in notations wherever the meaning is clear. x to get its content vector z x , then we switch the style label from v x to v y . Combining the content vector z x and the style label v y , we can generate a new sentence ˜ x (the transferred sentences are denotes as ˜ x and ˜ y ).One unsupervised approach is to use the auto-encoder model. We ﬁrst use an encoder model E toencode x and y to get the content vectors z x = E ( x , v x ) and z y = E ( y , v y ) . Then we use a decoder G to generate sentences conditioned on z and v . The E and G together form an auto-encoder andthe reconstruction loss is: L rec ( θ E , θ G ) = E x ∼ X [ − log p G ( x | z x , v x )] + E y ∼ Y [ − log p G ( y | z y , v y )] , where v x and v y can be two learnable vectors to represent the label embedding. In order to makesure that the z x and z y capture the content and we can deliver accurate transfer between the styleby switching the labels, we need to guarantee that z x and z y follow the same distribution. We canassume p ( z ) follows a prior distribution and add a KL-divergence regularization on z x , z y . Themodel then becomes a VAE. However, previous works (Bowman et al., 2015; Yang et al., 2017b)found that there is a training collapse problem with the VAE for text modeling and the posteriordistribution of z fails to capture the content of a sentence.To better capture the desired styles in the generated sentences, Hu et al. (2017a) additionally imposea style classiﬁer on the generated samples, and the decoder G is trained to generate sentences thatmaximize the accuracy of the style classiﬁer. Such additional supervision with a discriminative modelis also adopted in (Shen et al., 2017), though in that work a binary real/fake classiﬁer is instead usedwithin a conventional adversarial scheme. Adversarial Training

Shen et al. (2017) use adversarial training to align the z distributions. Notonly do we want to align the distribution of z x and z y , but also we hope that the transferred sentence ˜ x from x to resemble y and vice versa. Several adversarial discriminators are introduced to alignthese distributions. Each of the discriminators is a binary classiﬁer distinguishing between real andfake. Speciﬁcally, the discriminator D z aims to distinguish between z x and z y : L z adv ( θ E , θ D z ) = E x ∼ X [ − log D z ( z x )] + E y ∼ Y [ − log(1 − D z ( z y ))] . Similarly, D x distinguish between x and ˜ y , yielding an objective L x adv as above; and D y distinguishbetween y and ˜ x , yielding L y adv . Since the samples of ˜ x and ˜ y are discrete and it is hard to train thegenerator in an end-to-end way, professor forcing (Lamb et al., 2016) is used to match the distributionsof the hidden states of decoders. The overall training objective is a min-max game played among theencoder E /decoder G and the discriminators D z , D x , D y (Goodfellow et al., 2014): min E,G max D z ,D x ,D y L rec − λ ( L z adv + L x adv + L y adv ) The model is trained in an alternating manner. In the ﬁrst step, the loss of the discriminators areminimize to distinguish between the z x , x , y and z y , ˜ x , ˜ y , respectively; and in the second step theencoder and decoder are trained to minimize the reconstruction loss while maximizing loss of thediscriminators. In most past work, a classiﬁer is used as the discriminator to distinguish whether a sentence is real orfake. We propose instead to use locally-normalized language models as discriminators. We arguethat using an explicit language model with token-level locally normalized probabilities offers a moredirect training signal to the generator. If a transfered sentence does not match the target style, it willhave high perplexity when evaluated by a language model that was trained on target domain data.Not only does it provide an overall evaluation score for the whole sentence, but a language model canalso assign a probability to each token, thus providing more information on which word is to blame ifthe overall perplexity is very high.The overall model architecture is shown in Figure 1. Suppose ˜ x is the output sentence from applyingstyle transfer to input sentence x , i.e., ˜ x is sampled from p G (˜ x | z x , v y ) (and similary for ˜ y and y ). Let p LM ( x ) be the probability of a sentence x evaluate against a language model, then the discriminator3igure 1: The overall model architecture consists of two parts: reconstruction and transfer. Fortransfer, we switch the style label and sample an output sentence from the generator that is evaluatedby a language model.loss becomes: L x LM ( θ E , θ G , θ LM x ) = E x ∼ X [ − log p LM x ( x ))] + γ E y ∼ Y , ˜ y ∼ p G (˜ y | z y , v x ) [log p LM x (˜ y )] , (1) L y LM ( θ E , θ G , θ LM y ) = E y ∼ Y [ − log p LM y ( y ))] + γ E x ∼ X , ˜ x ∼ p G (˜ x | z x , v y ) [log p LM y (˜ x )] . (2)Our overall objective becomes: min E,G max LM x , LM y L rec − λ ( L x LM + L y LM ) (3) Negative samples : Note that Equation 1 and 2 differs from traditional ways of training languagemodels in that we have a term including the negative samples. We train the LM in an adversarial wayby minimizing the loss of LM of real sentences and maximizing the loss of transferred sentences.However, since the LM is a structured discriminator, we would hope that a language model trainedon the real sentences will automatically assign high perplexity to sentences not in the target domain,hence negative samples from the generator may not be necessary. To investigate the necessity ofnegative samples, we add a weight γ to the loss of negative samples. The weight γ adjusts thenegative sample loss in training the language models. If γ = 0 , we simply train the language modelon real sentences and ﬁx its parameters, avoiding potentially unstable adversarial training steps. Weinvestigate the necessity of using negative samples in the experiment section.Training consists of two steps alternatively. In the ﬁrst step, we train the language models accordingto Equation 1 and 2. In the second step, we minimize the reconstruction loss as well as the per-plexity of generated samples evaluated by the language model. Since ˜ x is discrete, one can use theREINFORCE (Sutton et al., 2000) algorithm to train the generator: ∇ θ G L y LM = E x ∼ X , ˜ x ∼ p G (˜ x | z x , v y ) [log p LM (˜ x ) ∇ θ G log p G (˜ x | z x , v y )] . (4)However, using a single sample to approximate the expected gradient leads to high variance ingradient estimates and thus unstable learning. Continuous approximation : Instead, we propose to use a continuous approximation to the samplingprocess in training the generator, as demonstrated in Figure 2. Instead of feeding a single sampledword as input to the next timestep of the generator, we use a Gumbel-softmax (Jang et al., 2016)distribution as a continuous approximation to sample instead. Let u be a categorical distribution withprobabilities π , π , . . . , π c . Samples from u can be approximated using: p i = exp((log π i ) + g i ) /τ (cid:80) cj =1 exp((log π j + g j ) /τ ) , where the g i ’s are independent samples from Gumbel(0 , .Let the tokens of the transferred sentence be ˜ x = { ˜ x t } Tt =1 . Suppose the output of the logit at timestep t is v x t , then ˜ p x t = Gumbel-softmax ( v x t , τ ) , where τ is the temperature. When τ → , ˜ p x t becomesthe one hot representation of token ˜ x t . Using the continuous approximation, then the output of thedecoder becomes a sequence of probability vectors ˜ p x = { ˜ p x t } Tt =1 .4igure 2: Continuous approximation of language model loss. The input is a sequence of probabilitydistributions { ˜ p x t } Tt =1 sampled from the generator. At each timestep, we compute a weightedembedding as input to the language model and get the sequence of output distributions from the LMas { ˆ p x t } Tt =1 . The loss is the sum of cross entropies between each pair of ˜ p x t and ˆ p x t .With the continuous approximation of ˜ x , we can calculate the loss evaluated using a language modeleasily, as shown in Figure 2. For every step, we feed ˜ p x t to the language model of y (denoted asLM y ) using the weighted average of the embedding W e ˜ p x t , then we get the output from the LM y which is a probability distribution over the vocabulary of the next word ˆ p x t +1 . The loss of the currentstep is the cross entropy loss between ˜ p x t +1 and ˆ p x t +1 : (˜ p x t +1 ) (cid:124) log ˆ p x t +1 . Note that when the decoderoutput distribution ˜ p x t +1 aligns with the language model output distribution ˆ p x t +1 , the above lossachieves minimum. By summing the loss over all steps and taking the gradient, we can use standardback-propagation to train the generator: ∇ θ G L y LM ≈ E x ∼ X , ˜ p x ∼ p G (˜ x | z x , v y ) [ ∇ θ G T (cid:88) t =1 (˜ p x t ) (cid:124) log ˆ p x t ] . (5)The above Equation is a continuous approximation of Equation 4 with Gumbel softmax distribution.In experiments, we use a single sample of ˜ p x to approximate the expectation.Note that the use of the language model discriminator is a somewhat different in each of the twotypes of training update steps because of the continuous approximation. We use discrete samplesfrom the generators as negative samples in training the language model discriminator step, while weuse a continuous approximation in updating the generator step according to Equation 5. Overcoming mode collapse : It is known that in adversarial training, the generator can suffer frommode collapse (Arjovsky and Bottou, 2017; Hu et al., 2017b) where the samples from the generatoronly cover part of the data distribution. In preliminary experimentation, we found that the languagemodel prefers short sentences. To overcome this length bias, we use two tricks in our experiments: 1)we normalize the loss of Equation 5 by length and 2) we ﬁx the length of ˜ x to be the same of x . Weﬁnd these two tricks stabilize the training and avoid generating collapsed overly short outputs. In order to verify the effectiveness of our model, we experiment on three tasks: word substitutiondecipherment, sentiment modiﬁcation, and related language translation. We mainly compare with themost comparable approach of (Shen et al., 2017) that uses CNN classiﬁers as discriminators . Notethat Shen et al. (2017) use three discriminators to align both z and decoder hidden states, while ourmodel only uses a single language model as a discriminator directly on the output sentences ˜ x , ˜ y .Moreover, we also compare with a broader set of related work (Hu et al., 2017a; Fu et al., 2017; Liet al., 2018) for the tasks when appropriate. Our proposed model provides substantiate improvementsin most of the cases. We implement our model with the Texar (Hu et al., 2018b) toolbox based onTensorﬂow (Abadi et al., 2016). We use the code from https://github.com/shentianxiao/language-style-transfer . ∗ Our results:

LM 89.0 y against x . LM +adv denotes we use negative samples to train the language model. ∗ We run the code open-sourced bythe authors to get the results.Model Accu BLEU PPL X PPL Y Shen et al. (2017) 79.5 12.4 50.4 52.7Hu et al. (2017a) 87.7

Our results:

LM 83.3 38.6

LM + Classiﬁer X = negative , Y = positive. PPL x denotes theperplexity of sentences transferred from positive sentences evaluated by a language model trainedwith negative sentences and vice versa. As the ﬁrst task, we consider the word substitution decipherment task previous explored in the NLPliterature (Dou and Knight, 2012). We can control the amount of change to the original sentencesin word substitution decipherment so as to systematically investigate how well the language modelperforms in a task that requires various amount of changes. In word substitution cipher, every tokenin the vocabulary is mapped to a cipher token and the tokens in sentences are replaced with ciphertokens according to the cipher dictionary. The task of decipherment is to recover the original textwithout any knowledge of the dictionary.

Data : Following (Shen et al., 2017), we sample 200K sentences from the Yelp review dataset as plaintext X and sample other 200K sentences and apply word substitution cipher on these sentences to get Y . We use another 100k parallel sentences as the development and test set respectively. Sentences oflength more than 15 are ﬁltered out. We keep all words that appear more than 5 times in the trainingset and get a vocabulary size of about 10k. All words appearing less than 5 times are replaced with a“” token. We random sample words from the vocabulary and replace them with cipher tokens.The amount of ciphered words ranges from 20% to 100%. As we have ground truth plain text, wecan directly measure the BLEU score to evaluate the model. Our model conﬁgurations are includedin Appendix B. Results : The results are shown in Table 1. We ﬁrst investigate the effect of using negative samples intraining the language model, as denotes by LM + adv in Table 1. We can see that using adversarialtraining sometimes improves the results. However, we found empirically that using negative samplesmakes the training very unstable and the model diverges easily. This is the main reason why we didnot get consistently better results by incorporating adversarial training.Comparing with (Shen et al., 2017), we can see that the language model without adversarial trainingis already very effective and performs much better when the amount of change is less than 100%. Thisis intuitive because when the change is less than 100%, a language model can use context informationto predict and correct enciphered tokens. It’s surprising that even with 100% token change, our modelis only 1.5 BLEU score worse than (Shen et al., 2017), when all tokens are replaced and no contextinformation can be used by the language model. We guess our model can gradually decipher tokensfrom the beginning of a sentence and then use them as a bootstrap to decipher the whole sentence.We can also combine language models with the CNNs as discriminators. For example, for the 100% BLEU score is measured with multi-bleu.perl . γ = 0 in Equation 1 and 2 in the rest of the experiments. We have demonstrated that the language model can successfully crack word substitution cipher.However, the change of substitution cipher is limited to a one-to-one mapping. As the second task,we would like to investigate whether a language model can distinguish sentences with positive andnegative sentiments, thus help to transfer the sentiments of sentences while preserving the content.We compare to the model of (Hu et al., 2017a) as an additional baseline, which uses a pre-trainedclassiﬁer as guidance.

Data : We use the same data set as in (Shen et al., 2017). The data set contains 250K negativesentences (denoted as X ) and 380K positive sentences (denoted as Y ), of which 70% are used fortraining, 10% are used for development and the remaining 20% are used as test set. The pre-processingsteps are the same as the previous experiment. We also use similar experiment conﬁgurations. Evaluation : Evaluating the quality of transferred sentences is a challenging problem as there are noground truth sentences. We follow previous papers in using model-based evaluation. We measurewhether transferred sentences have the correct sentiment according to a pre-trained sentiment classiﬁer.We follow both (Hu et al., 2017a) and (Shen et al., 2017) in using a CNN-based classiﬁer. However,simply evaluating the sentiment of sentences is not enough since the model can output collapsedoutput such as a single word “good” for all negative transfer and “bad” for all positive transfer. Wenot only would like transferred sentences to preserve the content of original sentences, but also tobe smooth in terms of language quality. For these two aspects, we propose to measure the BLEUscore of transferred sentences against original sentences and measure the perplexity of transferredsentences to evaluate the ﬂuency. A good model should perform well on all three metrics.

Results : We report the results in Table. 2. As a baseline, the original corpus has perplexity of . and . for the negative and positive sentences respectively. Comparing LM with (Shen et al.,2017), we can see that LM outperforms it in all three aspects: getting higher accuracy, preservingthe content better while being more ﬂuent. This demonstrates the effectiveness of using LM as thediscriminator. (Hu et al., 2017a) has the highest accuracy and BLEU score among the three modelswhile the perplexity is very high. It is not surprising that the classiﬁer will only modify the features ofthe sentences that are related to the sentiment and there is no mechanism to ensure that the modiﬁedsentence being ﬂuent. Hence the corresponding perplexity is very high. We can manifest the best ofboth models by combing the loss of LM and the classiﬁer in (Hu et al., 2017a): a classiﬁer is good atmodifying the sentiment and an LM can smooth the modiﬁcation to get a ﬂuent sentence. We ﬁndimprovement of accuracy and perplexity as denoted by LM + classiﬁer compared to classiﬁer only(Hu et al., 2017a). Comparing with other models : Recently there are other models that are proposed speciﬁcallytargeting the sentiment modiﬁcation task such as (Li et al., 2018). Their method is feature based andconsists of the following steps: (

Delete ) ﬁrst, they use the statistics of word frequency to delete theattribute words such as “good, bad” from original sentences, (

Retrieve ) then they retrieve the mostsimilar sentences from the other corpus based on nearest neighbor search, (

Generate ) the attributewords from retrieved sentences are combined with the content words of original sentences to generatetransferred sentences. The authors provide 500 human annotated sentences as the ground truth oftransferred sentences so we measure the BLEU score against those sentences. The results are shownin Table 3. We can see our model has similar accuracy compared with DeleteAndRetrieve, but hasmuch better BLEU scores and slightly better perplexity.We list some examples of transferred sentences in Table 5 in the appendix. We can see that (Shenet al., 2017) does not keep the content of the original sentences well and changes the meaningof the original sentences. (Hu et al., 2017a) changes the sentiment but uses improper words, e.g.“maintenance is equally hilarious ”. Our LM can change the change the sentiment of sentences. Butsometimes there is an over-smoothing problem, changing the less frequent words to more frequentwords, e.g. changing “my goodness it was so gross” to “my food it was so good.”. In general LM +classiﬁer has the best results, it changes the sentiment, while keeps the content and the sentences areﬂuent. 7odel ACCU BLEU PPL X PPL Y Shen et al. (2017) 76.2 6.8 49.4 45.6Fu et al. (2017):StyleEmbedding 9.2 16.65 97.51 142.6MultiDecoder 50.9 11.24 111.1 119.1Li et al. (2018):Delete 87.2 11.5 75.2 68.7Template 86.7 18.0 192.5 148.4Retrieval

DeleteAndRetrieval 90.9 12.6 104.6 43.8

Our results:

LM 85.4 13.4 32.8 40.5LM + Classiﬁer 90.0

In the ﬁnal experiment, we consider a more challenging task: unsupervised related language trans-lation (Pourdamghani and Knight, 2017). Related language translation is easier than normal pairlanguage translation since there is a close relationship between the two languages. Note here wedon’t compare with other sophisticated unsupervised neural machine translation systems such as(Lample et al., 2017; Artetxe et al., 2017), whose models are much more complicated and use othertechniques such as back-translation, but simply compare the different type of discriminators in thecontext of a simple model.

Data : We choose Bosnian (bs) vs Serbian (sr) and simpliﬁed Chinese (zh-CN) vs traditional Chinese(zh-TW) pair as our experiment languages. Due to the lack of parallel data for these data, we build thedata ourselves. For bs and sr pair, we use the monolingual data from Leipzig Corpora Collections .We use the news data and sample about 200k sentences of length less than 20 for each language,of which 80% are used for training, 10% are used for validation and remaining 10% are used fortest. For validation and test, we obtain the parallel corpus by using the Google Translation API.The vocabulary size is 25k for the sr vs bs language pair. For zh-CN and zh-TW pair, we use themonolingual data from the Chinese Gigaword corpus. We use the news headlines as our training data.300k sentences are sampled for each language. The data is partitioned and parallel data is obtained ina similar way to that of sr vs bs pair. We directly use a character-based model and the total vocabularysize is about 5k. For evaluation, we directly measure the BLEU score using the references for bothlanguage pairs.Note that the relationship between zh-CN and zh-TW is simple and mostly like a deciphermentproblem in which some simpliﬁed Chinese characters have the corresponding traditional charactermapping. The relation between bs vs sr is more complicated. Results : The results are shown in Table. 4. For sr–bos and bos–sr, since the vocabulary of twolanguages does not overlap at all, it is a very challenging task. We report the BLEU1 metric sincethe BLEU4 is close to 0. We can see that our language model discriminator still outperforms (Shenet al., 2017) slightly. The case for zh–tw and tw–zh is much easier. Simple copying already has areasonable score of 32.3. Using our model, we can improve it to 81.6 for cn–tw and 85.5 for tw–cn,outperforming (Shen et al., 2017) by a large margin.

Non-parallel transfer in natural language : (Hu et al., 2017a; Shen et al., 2017; Prabhumoye et al.,2018; Gomez et al., 2018) are most relevant to our work. Hu et al. (2017a) aim to generate sentenceswith controllable attributes by learning disentangled representations. Shen et al. (2017) introduceadversarial training to unsupervised text style transfer. They apply discriminators both on the encoder http://wortschatz.uni-leipzig.de/en Our results: LM Table 4: Related language translation results measured in BLEU. The results for sr vs bs in measuredin BLEU1 while cn vs tw is measure in BLEU.representation and on the hidden states of the decoders to ensure that they have the same distribution.These are the two models that we mainly compare with. Prabhumoye et al. (2018) use the back-translation technique in their model, which is complementary to our method and can be integratedinto our model to further improve performance. Gomez et al. (2018) use GAN-based approach todecipher shift ciphers. (Lample et al., 2017; Artetxe et al., 2017) propose unsupervised machinetranslation and use adversarial training to match the encoder representation of the sentences fromdifferent languages. They also use back-translation to reﬁne their model in an iterative way.

GANs : GANs have been widely explored recently, especially in computer vision (Zhu et al., 2017;Chen et al., 2016b; Radford et al., 2015; Sutton et al., 2000; Salimans et al., 2016; Denton et al., 2015;Isola et al., 2017). The progress of GANs on text is relatively limited due to the non-differentiablediscrete tokens. Lots of papers (Yu et al., 2017; Che et al., 2017; Li et al., 2017; Yang et al., 2017a)use REINFORCE (Sutton et al., 2000) to ﬁnetune a trained model to improve the quality of samples.There is also prior work that attempts to introduce more structured discriminators, for instance, theenergy-based GAN (EBGAN) (Zhao et al., 2016) and RankGAN (Lin et al., 2017). Our languagemodel can be seen as a special energy function, but it is more complicated than the auto-encoderused in (Zhao et al., 2016) since it has a recurrent structure. Hu et al. (2018a) also proposes touse structured discriminators in generative models and establishes its the connection with posteriorregularization.

Computer vision style transfer : Our work is also related to unsupervised style transfer in computervision (Gatys et al., 2016; Huang and Belongie, 2017). (Gatys et al., 2016) directly uses the covariancematrix of the CNN features and tries to align the covariance matrix to transfer the style. (Huangand Belongie, 2017) proposes adaptive instance normalization for an arbitrary style of images. (Zhuet al., 2017) uses a cycle-consistency loss to ensure the content of the images is preserved and can betranslated back to original images.

Language model for reranking : Previously, language models are used to incorporate the knowledgeof monolingual data mainly by reranking the sentences generated from a base model such as (Brantset al., 2007; Gulcehre et al., 2015; He et al., 2016). (Liu et al., 2017; Chen et al., 2016a) use alanguage model as training supervision for unsupervised OCR. Our model is more advanced in usinglanguage models as discriminators in distilling the knowledge of monolingual data to a base model inan end-to-end way.

We showed that by using language models as discriminators and we could outperform traditionalbinary classiﬁer discriminators in three unsupervised text style transfer tasks including word substitu-tion decipherment, sentiment modiﬁcation and related language translation. In comparison with abinary classiﬁer discriminator, a language model can provide a more stable and more informativetraining signal for training generators. Moreover, we empirically found that it is possible to eliminateadversarial training with negative samples if a structured model is used as the discriminator, thuspointing one possible direction to solve the training difﬁculty of GANs. In the future, we plan toexplore and extend our model to semi-supervised learning.

References

M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving,M. Isard, et al. Tensorﬂow: A system for large-scale machine learning. In

OSDI , volume 16, pages965–283, 2016.M. Arjovsky and L. Bottou. Towards principled methods for training generative adversarial networks. arXiv preprint arXiv:1701.04862 , 2017.M. Artetxe, G. Labaka, E. Agirre, and K. Cho. Unsupervised neural machine translation. arXivpreprint arXiv:1710.11041 , 2017.D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align andtranslate. arXiv preprint arXiv:1409.0473 , 2014.S. R. Bowman, L. Vilnis, O. Vinyals, A. M. Dai, R. Jozefowicz, and S. Bengio. Generating sentencesfrom a continuous space. arXiv preprint arXiv:1511.06349 , 2015.T. Brants, A. C. Popat, P. Xu, F. J. Och, and J. Dean. Large language models in machine translation. In

Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processingand Computational Natural Language Learning (EMNLP-CoNLL) , 2007.T. Che, Y. Li, R. Zhang, R. D. Hjelm, W. Li, Y. Song, and Y. Bengio. Maximum-likelihood augmenteddiscrete generative adversarial networks. arXiv preprint arXiv:1702.07983 , 2017.J. Chen, P.-S. Huang, X. He, J. Gao, and L. Deng. Unsupervised learning of predictors from unpairedinput-output samples. arXiv preprint arXiv:1606.04646 , 2016a.X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. Infogan: Interpretablerepresentation learning by information maximizing generative adversarial nets. In

Advances inNeural Information Processing Systems , pages 2172–2180, 2016b.J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. Empirical evaluation of gated recurrent neuralnetworks on sequence modeling. arXiv preprint arXiv:1412.3555 , 2014.E. L. Denton, S. Chintala, R. Fergus, et al. Deep generative image models using a laplacian pyramidof adversarial networks. In

Advances in neural information processing systems , pages 1486–1494,2015.Q. Dou and K. Knight. Large scale decipherment for out-of-domain machine translation. In

Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processingand Computational Natural Language Learning , pages 266–275. Association for ComputationalLinguistics, 2012.Z. Fu, X. Tan, N. Peng, D. Zhao, and R. Yan. Style transfer in text: Exploration and evaluation. arXivpreprint arXiv:1711.06861 , 2017.L. A. Gatys, A. S. Ecker, and M. Bethge. Image style transfer using convolutional neural networks. In

Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on , pages 2414–2423.IEEE, 2016.A. N. Gomez, S. Huang, I. Zhang, B. M. Li, M. Osama, and L. Kaiser. Unsupervised cipher crackingusing discrete gans. arXiv preprint arXiv:1801.04883 , 2018.I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, andY. Bengio. Generative adversarial nets. In

Advances in neural information processing systems ,pages 2672–2680, 2014.C. Gulcehre, O. Firat, K. Xu, K. Cho, L. Barrault, H.-C. Lin, F. Bougares, H. Schwenk, and Y. Bengio.On using monolingual corpora in neural machine translation. arXiv preprint arXiv:1503.03535 ,2015.D. He, Y. Xia, T. Qin, L. Wang, N. Yu, T. Liu, and W.-Y. Ma. Dual learning for machine translation.In

Advances in Neural Information Processing Systems , pages 820–828, 2016.Z. Hu, Z. Yang, X. Liang, R. Salakhutdinov, and E. P. Xing. Toward controlled generation of text. In

International Conference on Machine Learning , pages 1587–1596, 2017a.10. Hu, Z. Yang, R. Salakhutdinov, and E. P. Xing. On unifying deep generative models. arXivpreprint arXiv:1706.00550 , 2017b.Z. Hu, Z. Yang, R. Salakhutdinov, X. Liang, L. Qin, H. Dong, and E. Xing. Deep generative modelswith learnable knowledge constraints. arXiv preprint arXiv:1806.09764 , 2018a.Z. Hu, Z. Yang, T. Zhao, H. Shi, J. He, D. Wang, X. Ma, Z. Liu, X. Liang, L. Qin, et al. Texar: Amodularized, versatile, and extensible toolbox for text generation. In

Proceedings of Workshop forNLP Open Source Software (NLP-OSS) , pages 13–22, 2018b.X. Huang and S. Belongie. Arbitrary style transfer in real-time with adaptive instance normalization.

CoRR, abs/1703.06868 , 2017.P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarialnetworks. arXiv preprint , 2017.E. Jang, S. Gu, and B. Poole. Categorical reparameterization with gumbel-softmax. arXiv preprintarXiv:1611.01144 , 2016.D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 ,2014.D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 ,2013.A. M. Lamb, A. G. A. P. GOYAL, Y. Zhang, S. Zhang, A. C. Courville, and Y. Bengio. Professorforcing: A new algorithm for training recurrent networks. In

Advances In Neural InformationProcessing Systems , pages 4601–4609, 2016.G. Lample, L. Denoyer, and M. Ranzato. Unsupervised machine translation using monolingualcorpora only. arXiv preprint arXiv:1711.00043 , 2017.J. Li, W. Monroe, T. Shi, A. Ritter, and D. Jurafsky. Adversarial learning for neural dialoguegeneration. arXiv preprint arXiv:1701.06547 , 2017.J. Li, R. Jia, H. He, and P. Liang. Delete, retrieve, generate: A simple approach to sentiment andstyle transfer. arXiv preprint arXiv:1804.06437 , 2018.K. Lin, D. Li, X. He, Z. Zhang, and M.-T. Sun. Adversarial ranking for language generation. In

Advances in Neural Information Processing Systems , pages 3155–3165, 2017.Y. Liu, J. Chen, and L. Deng. Unsupervised sequence classiﬁcation using sequential output statistics.In

Advances in Neural Information Processing Systems , pages 3550–3559, 2017.N. Pourdamghani and K. Knight. Deciphering related languages. In

Proceedings of the 2017Conference on Empirical Methods in Natural Language Processing , pages 2513–2518, 2017.S. Prabhumoye, Y. Tsvetkov, R. Salakhutdinov, and A. W. Black. Style transfer through back-translation. arXiv preprint arXiv:1804.09000 , 2018.A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutionalgenerative adversarial networks. arXiv preprint arXiv:1511.06434 , 2015.T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniquesfor training gans. In

Advances in Neural Information Processing Systems , pages 2234–2242, 2016.T. Shen, T. Lei, R. Barzilay, and T. Jaakkola. Style transfer from non-parallel text by cross-alignment.In

Advances in Neural Information Processing Systems , pages 6833–6844, 2017.R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour. Policy gradient methods for reinforcementlearning with function approximation. In

Advances in neural information processing systems ,pages 1057–1063, 2000.O. Vinyals and Q. Le. A neural conversational model. arXiv preprint arXiv:1506.05869 , 2015.11. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. In

Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on , pages 3156–3164.IEEE, 2015.T.-H. Wen, D. Vandyke, N. Mrksic, M. Gasic, L. M. Rojas-Barahona, P.-H. Su, S. Ultes, andS. Young. A network-based end-to-end trainable task-oriented dialogue system. arXiv preprintarXiv:1604.04562 , 2016.Z. Yang, W. Chen, F. Wang, and B. Xu. Improving neural machine translation with conditionalsequence generative adversarial nets. arXiv preprint arXiv:1703.04887 , 2017a.Z. Yang, Z. Hu, R. Salakhutdinov, and T. Berg-Kirkpatrick. Improved variational autoencoders fortext modeling using dilated convolutions. arXiv preprint arXiv:1702.08139 , 2017b.L. Yu, W. Zhang, J. Wang, and Y. Yu. Seqgan: Sequence generative adversarial nets with policygradient. In

AAAI , pages 2852–2858, 2017.J. Zhao, M. Mathieu, and Y. LeCun. Energy-based generative adversarial network. arXiv preprintarXiv:1609.03126 , 2016.J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. arXiv preprint arXiv:1703.10593 , 2017.12

Training Algorithms

Algorithm 1

Unsupervised text style transfer.

Input:

Data set of two different styles X , Y .Parameters: weight λ and γ , temperature τ .Initialized model parameters θ E , θ G , θ LM x , θ LM y . repeat Update θ LM x and θ LM y by minimizing L x LM ( θ LM x ) and L y LM ( θ LM y ) respectively.Update θ E , θ G by minimizing: L rec − λ ( L x LM + L y LM ) using Equation 5. until convergence Output:

A text style transfer model with parameters θ E , θ G . B Model Conﬁgurations

Similar model conﬁguration to that of (Shen et al., 2017) is used for a fair comparison. We useone-layer GRU (Chung et al., 2014) as the encoder and decoder (generator). We set the wordembedding size to be and GRU hidden size to be . v is a vector of size . For the languagemodel, we use the same architecture as the decoder. The parameters of the language model are notshared with parameters of other parts and are trained from scratch. We use a batch size of , whichcontains 64 samples from X and Y respectively. We use Adam (Kingma and Ba, 2014) optimizationalgorithm to train both the language model and the auto-encoder and the learning rate is set to bethe same. Hyper-parameters are selected based on the validation set. We use grid search to pick thebest parameters. The learning rate is selected from [1 e − , e − , e − , e − and λ , the weightof language model loss, is selected from [1 . , . , . . Models are trained for a total of 20 epochs.We use an annealing strategy to set the temperature of τ of the Gumbel-softmax approximation. Theinitial value of τ is set to 1.0 and it decays by half every epoch until reaching the minimum value of0.001. 13 Sentiment Transfer Examples