[PDF] DRAG: Director-Generator Language Modelling Framework for Non-Parallel Author Stylized Rewriting

Abstract

Author stylized rewriting is the task of rewriting an input text in a particular author's style. Recent works in this area have leveraged Transformer-based language models in a denoising autoencoder setup to generate author stylized text without relying on a parallel corpus of data. However, these approaches are limited by the lack of explicit control of target attributes and being entirely data-driven. In this paper, we propose a Director-Generator framework to rewrite content in the target author's style, specifically focusing on certain target attributes. We show that our proposed framework works well even with a limited-sized target author corpus. Our experiments on corpora consisting of relatively small-sized text authored by three distinct authors show significant improvements upon existing works to rewrite input texts in target author's style. Our quantitative and qualitative analyses further show that our model has better meaning retention and results in more fluent generations.

Full PDF

DDRAG : Director-Generator Language Modelling Framework forNon-Parallel Author Stylized Rewriting

Hrituraj Singh

Adobe Research [email protected]

Gaurav Verma ∗ Georgia Tech [email protected]

Aparna Garimella

Adobe Research [email protected]

Balaji Vasan Srinivasan

Adobe Research [email protected]

Abstract

Author stylized rewriting is the task of rewrit-ing an input text in a particular author’s style.Recent works in this area have leveragedTransformer-based language models in a de-noising autoencoder setup to generate authorstylized text without relying on a parallel cor-pus of data. However, these approaches arelimited by the lack of explicit control of targetattributes and being entirely data-driven. Inthis paper, we propose a Director-Generatorframework to rewrite content in the target au-thor’s style, speciﬁcally focusing on certaintarget attributes. We show that our proposedframework works well even with a limited-sized target author corpus. Our experimentson corpora consisting of relatively small-sizedtext authored by three distinct authors showsigniﬁcant improvements upon existing worksto rewrite input texts in target author’s style.Our quantitative and qualitative analyses fur-ther show that our model has better meaning re-tention and results in more ﬂuent generations.

With recent advances in language modeling tech-niques that have resulted in powerful languagemodels (Radford et al., 2019; Devlin et al., 2018;Brown et al., 2020) along with an increased inter-est in stylized content generation, (Hu et al., 2017;Shen et al., 2017; Subramanian et al., 2018; Fuet al., 2018; Niu and Bansal, 2018), large languagemodels have been successfully tuned to achievetext stylization (Lample et al., 2018; Ziegler et al.,2019; Syed et al., 2020; Singh et al., 2020). Apartfrom transferring an input text to the target style,which has received recent interest from the com-munity, understanding and measuring style havebeen persistently explored over the last few decades ∗ This was work was carried out when the author was atAdobe Research. (Kessler et al., 1997; Garera and Yarowsky, 2009;Liu, 2012; Verma and Srinivasan, 2019). Lyingat the intersection of style transfer enabled by ad-vanced language models and a deep understand-ing of style as a nuanced combination of severallinguistic concepts, problems like stylized genera-tion or stylized rewriting have gained further trac-tion. A large body of work in style transfer fo-cuses on binary aspects such as positive-negativesentiment (Li et al., 2018; Ziegler et al., 2019),formal-informal (Jain et al., 2019), and sometimesa mixture of these attributes (Subramanian et al.,2018). To fuel this interest in such binary styl-ization, some datasets comprising of text from theextreme ends of these spectrums have also emerged(e.g., positive-negative sentiments (Mathews et al.,2016), formal-informal (Rao and Tetreault, 2018)).As pointed by Syed et al. (2020), author stylizedrewriting does not directly ﬁt under any of thesevariants as the writing style of an author is an amal-gamation of several such attributes and needs to bemodeled in a ﬁne-grained manner.Apart from the distinction along style dimen-sions, prior works can also be categorized as super-vised (using parallel corpus (Jhamtani et al., 2017))and unsupervised (Li et al., 2018; Syed et al., 2020;Niu and Bansal, 2018). In supervised frameworks,parallel data is used to tune sequence-to-sequencemodels for stylized rewriting. However, annotatingsuch parallel corpus is a tedious effort and there-fore, there is an increased interest in unsupervisedstyle transfer; i.e., when there is no direct supervi-sion or parallel data available for training the mod-els. In this work, we focus on such an unsupervisedsetting.Existing approaches on unsupervised author styl-ized rewriting rely on implicitly learning the targetstylistic attributes from data and do not allow ﬁnercontrol on generation (Syed et al., 2020). Whilethis is a good starting point for author-stylized a r X i v : . [ c s . C L ] J a n ewriting, it is desirable to further improve therewriting model on certain aspects without com-promising on other attributes that the model hasalready optimized. An example would be to re-tain the stylistic strengths while improving contentretention, or vice versa. To this end, we propose D i r ecting a G enerator framework (DRAG). Ourquantitive and qualitative experiments show theviability of the proposed approach. Experimentsfurther indicate that the framework’s setup allows itto operate efﬁciently in scarce data setting and im-proves the performance over the baseline models.Our contributions can be summarized as - (1) Weintroduce a director-generator approach to rewrite an input text in a target author’s style. (2)

We pro-pose linguistic alignment scores – both at the localand global level and extend these to design thresh-olds for the generator and director. (3)

We presentexperimental results on texts written by three au-thors from the Gutenberg corpus with very distinctwriting styles, and show that our approach outper-forms prior works across content retention and stylealignment metrics. (4)

We further identify and dis-cuss shortcomings of our proposed approach, andpresent error analysis to aid future research in au-thor stylized rewriting.

With the rise of Transformer-based (Vaswani et al.,2017) language models , generative pretraining(Devlin et al., 2018; Radford et al., 2019; Brownet al., 2020) has advanced the ﬁeld of NLP signiﬁ-cantly. Fine-tuning such large language models onspeciﬁc task has become very prevalent (Sun et al.,2019; Lee et al., 2020; Lample and Conneau, 2019;Raffel et al., 2019; Liu et al., 2019). Pretraininginfuses the generic language knowledge into thelanguage model helping it understand the speciﬁctasks with relatively much less supervision. In fact,recent approaches (Radford et al., 2019; Brownet al., 2020) show that often, even such small su-pervisions are not required and a simple instructioncan be used to solve speciﬁc tasks by utilizing thecapabilities of such large language models trainedon very large datasets.Pretraining of such models usually involves op-timizing them on

Masked Language Modelling(MLM) (Devlin et al., 2018),

Causal LanguageModelling (CLM) (Radford et al., 2019) or othersimilar (Clark et al., 2020) objectives.While CLMis the task of autoregressively predicting the next word given the previous words/context, MLM isthe task of recovering masked tokens from a giveninput. While these approaches mostly train only anencoder or a decoder framework, Lample and Con-neau (2019) explored initializing encoder-decoderframeworks using the pre-trained encoders forcross-lingual translation. Such a technique withappropriate modiﬁcation has been shown to be suc-cessful in incorporating stylistic aspects of the lan-guage as well (Conneau and Lample, 2019; Syedet al., 2020). All these works utilize the task ofminimizing the denoising auto-encoder loss for in-ducing style in the language models in a reconstruc-tion framework. For our explorations, we leveragethese works to initialize our DRAG framework.There is an increased interest in stylisticgeneration or rewriting of content. Most ofthe approaches deﬁne dimensions like formality-informality (Shen et al., 2017; Ficler and Goldberg,2017; Jain et al., 2019; Sun et al., 2019) and achievethe alignment along these dimensions. While someof these approaches rely on parallel corpus (Ficlerand Goldberg, 2017; Jhamtani et al., 2017), manyof the approaches focus on unsupervised frame-work (Li et al., 2018; Shen et al., 2017; Jain et al.,2019), where the model preserves the input contentin the output while biasing the generations towardsthe target style. While some approaches utilizesimple editing to achieve the style along particulardimensions (Li et al., 2018), others focus on achiev-ing this through discriminators (Fu et al., 2018) orscorers (Jain et al., 2019). As mentioned before,since author style is an amalgamation of severalsuch attributes, it requires much more than a dis-criminator or singular dimension tuning to achievestylization.Due to the difﬁculty associated with author styleunderstanding and ﬁne-grained nature of that styleeven if understood them, the problem of authorstylized rewriting has not been explored a lot.While Jhamtani et al. (2017) try to solve this prob-lem for a speciﬁc author (i.e. Shakespeare), theirapproach is contingent on the availability of a par-allel corpus. Since preparing parallel corpus is atedious and intractable process, especially whiledealing with multiple authors and multiple com-binations of input and output styles, it is essentialto focus on unsupervised solutions. Most recently,Syed et al. (2020) leverage the capabilities of thelarge language models to solve this problem in anunsupervised manner.

Author Style

There has been signiﬁcant work on understandingbinary stylization along dimensions like formal-informal, positive-negative sentiment (Rao andTetreault, 2018; Kessler et al., 1997; Pavlick andTetreault, 2016; Collins-Thompson and Callan,2005; Hovy, 1990; Inkpen and Hirst, 2006;Kantrowitz, 2003), however, there is limited workon understanding an author’s writing style (Mc-Carthy et al., 2006; Forgeard, 2008; Verma andSrinivasan, 2019). While style can be a mixture ofseveral factors including, but not limited to, lexicalpreferences, syntactic/sentential choices, discoursestructure, narrative style, tone, we follow Syed et al.(2020) and consider an author’s style at three levels:

Surface style is estimated using the frequenciesof different surface elements such as the numberof commas, semicolons, colons, question marks,exclamation marks, and hyphens per paragraph,from a given author’s text. We, thus, quantify thesurface-style elements into a 6-dimensional vector.

Lexical style of an author is reﬂected in the au-thor’s choice of words. To describe the same con-cept, different authors may use different words. Forinstance, Rudyard Kipling, known for his classicsin children’s literature, tended to use more concretewords (e.g., gongs, rockets, torch) while AbrahamLincoln, being a political writer, used more abstractwords (e.g., freedom, patriotism). We enumeratelexical style categories as subjective, objective, lit-erary, colloquial, abstract and concrete (Brookeand Hirst, 2013). We use lexicons for each of thesecategories (Brooke and Hirst, 2013), and deﬁne lex-ical style alignment of each word in the vocabularyto a given style category as the average and normal-ized point-wise mutual information (PMI) betweenthat word and the seed words in the lexicon forthat style category. The lexical style alignment foreach word is thus a 6-dimensional vector. We usethe EmoBank corpus (Buechel and Hahn, 2017)to compute the co-occurrence statistics for PMIcomputations. The inclination of a word towardsa style category is positive if its normalized PMIscore is positive with respect to the given category.The inclination of an author towards a style cate-gory is then estimated by the fraction of words intheir text that have a positive inclination towardsthe category.

Syntactic style of an author is indicated by thenature of sentences used and we estimate the distri-bution of different types of sentences in an author’s text. Sentence types may range from complex, asseen in philosophical writings, to simple, as ob-served in children’s storybooks. We use ﬁve cate-gories of sentence styles: (i) simple, (ii) compound,(iii) complex, (iv) complex-compound sentences,and (v) others (Feng et al., 2012; Verma and Srini-vasan, 2019; Syed et al., 2020). Sentences are cate-gorized into one of these types using the algorithmproposed by (Feng et al., 2012). The resulting -dimensional probability distribution vector is usedas the estimation of syntactic style. These vec-tors are estimated at corpus-level, unlike those forlexical and surface style which are computed atparagraph-level. D i r ecting a G enerator forStylized Rewriting Our proposed framework, DRAG, that aims torewrite a given piece of text with a speciﬁc targetauthor’s style consists of three main stages: (1)

Pretraining a language model to infuse generallinguistic knowledge into the model (2)

Adapting the pre-trained language model to-wards the target author’s writing style by furtherpretraining it on text written by this author (Syedet al., 2020), and (3)

Using a director-generator framework (as dis-cussed later) to ﬁne-tune such biased languagemodel to improve its style transfer capabilities evenfurther while ﬁxing content preservation issues. Itis worth noting that we do not rely on the availabil-ity of parallel data for any of our experiments.

In order to infuse general linguistic knowledgeinto a language model, we leverage Tranformer-based pretrained language models (Devlin et al.,2018; Radford et al., 2019; Brown et al., 2020)due to their recent success in text processing tasks(Vaswani et al., 2017; Devlin et al., 2018; Brownet al., 2020). Similar to Conneau and Lample(2019), we ﬁrst train a Transformer-based encoderon the Masked Language Modelling (MLM) taskwith 15% of the tokens masked (Devlin et al., 2018)on a generic text corpus. We initialize an encoder-decoder framework, as shown in Figure 1, with thislanguage model.

To adapt the pretrained LM for author stylizedrewriting, Syed et al. (2020) initialize an encoder-

SOS> He is person aTransformer Layer NLayers He is great persona Masked Language Modelling He is great personaTransformer Layer NLayersHe is a person greatEnc-DecAttention He is personaTransformer Layer NLayers

Denoising Auto Encoder N o r m a li za ti on S e l f- A tt e n ti on N o r m a li za ti on F ee d F o r w a r d G a t e G a t e Figure 1: Language Model Pretraining using MaskedLanguage Modelling followed by encoder-decoder ini-tialization using pretrained models. This process stillleaves the encoder-decoder attention parameters unitial-ized which can be initiliazed using the Denoising AutoEncoder training as depicted in the ﬁgure. decoder framework with the pretrained LM, asshown in Figure 1. This is followed by optimizingit on denoising auto-encoder (DAE) loss (Lampleet al., 2018; Lample and Conneau, 2019) only overtarget author’s corpus. Syed et al. (2020) use theDAE loss to infuse an author’s linguistic style intothe reconstruction model; we refer to this frame-work as S

TYLE

LM. The ﬁne-tuning using the DAEloss on a target author’s corpus encourages recov-ering actual paragraphs from their noisy version(Lample and Conneau, 2019). For a paragraph g incorpus G and its noisy version C ( g ) ( C ( . ) beingthe noise function), DAE loss is given by, DAE ( θ e , θ ed , θ d ) = − | G | ∗ (cid:88) g ∼ G log P ( g/C ( g ); θ e , θ ed , θ d ) (1) where P is the probability of reconstruction for agiven encoder parameters θ e , decoder parameters θ d , and encoder-decoder attention parameters θ ed .Please note that θ ed does not refer to any additionallayer but the parameters which are present in trans-formers and are responsible for encoder-decoderattention. In our setup, C ( . ) function introducestwo noises: (a) random dropping of words with10% probability, and (b) word masking by replac-ing it with [MASK] token with 10% probability.Given a noisy input, the encoder ﬁlls the[MASK] tokens with suitable replacements (basedon the knowledge from its MLM pretraining), thuscreating a pseudo generic input for the decoder, thetarget sequence for which is aligned to the targetauthor’s style. However, we identify and verifyexperimentally two issues with this approach: (1) It requires a large target author corpora toachieve meaningful content preservation capability.This is evident by its very low content preservationscores (as discussed in Section 5.2) when trainedon authors with relatively smaller corpora. Even with large corpora, the model still suffers from ex-posure bias to texts written only by the target authorleading to spurious outputs for unseen inputs. (2)

The masking results in a signiﬁcant empha-sis on lexical style aspects, with a lesser focus onthe surface and syntactic preferences. Since themodel is completely data-driven, there is no way toexplicitly add emphasis on additional style aspects.One of the primary reasons behind (1) is thelack of explicit initialization of encoder-decoderattention parameters in S

TYLE

LM resulting in arandom initialization. The model, therefore, needsa large corpus of author data to stabilize theseparameters. To ﬁx this, we propose to train theentire encoder-decoder language model using theDAE loss over the same generic corpus used forpre-training. The resulting model will be in the generic language space (English, in our case), andhenceforth referred to as V

ANILLA

LM. We, fur-ther, ﬁnetune V

ANILLA

LM in the author corpus onthe DAE loss to arrive at an improved version ofS

TYLE

LM which we call I S TYLE

LM. This offersbetter encoder-decoder attention initialization, andalso removes the exposure bias of S

TYLE

LM, thusresulting in a more resilient and stable model withimproved content preservation abilities (as demon-strated in Section 5.2).However, at this point, we note that I S TYLE

LMstill fails to address (2), and its content preservationability is also sub-optimal as the target author’sstyle aspects which are infused at the later stageof training override some of the general linguisticknowledge. To further improve on I S TYLE

LM, weintroduce a Director-Generator component to ourtraining framework in the next section.

For the Director-Generator ﬁnetuning, we ﬁnd in-spiration in the standard RL strategies (Rennieet al., 2017; Ranzato et al., 2015) where the nearby space is explored and certain actions are rewardedhigher than others, consequently getting encour-aged in the future. We, however, ﬁnd direct re-warding unstable for our problem. Hence, we generate potential directives during explorationand accept or reject them on the basis of thresh-olds. A directive , in our context, is as an outputparagraph generated from an input by a directormodel which is ﬁxed and has been initialized using I S TYLE

LM. Speciﬁcally, we create two copies ofthe I S TYLE

LM as the

Director and the

Generator . ransformer Layer NLayersEnc-DecAttention He is persona Transformer Layer NLayers great n P o t e n t i a l D i r e c t i v e s > T < T > T < T > T Average reshold Score T

Director Generator H e i s a n a m a z i n g p e r s o n Transformer Layer NLayersEnc-DecAttention He is persona

Transformer Layer NLayers great S u c h a g r e a t p e r s o n h e i s H e i s a n e x t r e m e l y a m a z i n g p e r s o n I s n o t h e a g r e a t p e r s o n ? H e i s a n e x c e ll e n t p e r s o n H e i s a n e y o u n g m a n H e i s s o g r e a t t h a t w o r d s c a n ' t d e s c r i b e h i m e r e i s s o m e t h i n g r e a ll y s p e c i a l a b o u t h i m H e i s a n i m p r e ss i v e f e ll o w W o r d s c a n ' t d e s c r i b e t h e g e n i u s o f t h i s m a n Figure 2: Both the director as well as generator, intiliazed using I S TYLE

LM, work together to improve the ﬁnaloutputs. While director remains in the space of author style generating and exploring potential directives , generatorkeeps changing its threshold as it gets improved on its content & style capabilities. The directives above the averagethreshold for same example are accepted while rest of them are rejected.

As the names indicate, for each input, the direc-tor proposes n potential directives or paragraphs,while the generator generates n thresholding out-puts (paragraphs) as shown in Figure 2. We gener-ate the potential directives using nucleus sampling(Holtzman et al., 2019) with a softmax temperatureof 1.2, while the thresholding outputs are generatedusing a softmax temperature of 0.8 (the same valueis used at inference time as well). We score the di-rector and generator outputs on various content andstyle attributes. For content preservation, we usethe BLEU score between input and output as thecontent score. For lexical style, the mean squarederror is calculated between the 6-dimensional lex-ical alignment vector of the directives/generatoroutputs (calculated as the averaged sum of align-ments of words in the proposal) and average lexicalalignments of paragraphs for the target author cor-pus. Similarly, the mean squared error for surfacestyle is also calculated. The scores L and S forlexical and surface styles, respectively, are then cal-culated as reciprocal of means squared errors (with (cid:15) added in the denominator to avoid zero-division).For syntactical choices, since we wish to achievethe probability distribution of different types of sen-tences at the corpus-level, we calculate the scorefor syntactic style as, SX = sum ( P p ◦ P t ) sum ( P p ) where P p denotes the frequency distribution of differenttypes of sentences in a directive/generator’s output, P t the probability distribution of different types ofsentences in target author corpus, and ◦ denotes theHadamard product. All three scores are summed to calculate the style score for the directives (and thegenerator outputs).The I S TYLE

LM model already captures certainstylistic aspects of the target author. We want ourmodel to leverage this understanding and improveon aspects where I S TYLE

LM does not performwell. To capture this, we compute the content andstyle scores of all the potential directives and gener-ator outputs and retain only those directives whichhave both the content and style scores better thanthe average of the generators’ outputs’ scores. The accepted directives become real directives for thegenerator and are used to train it using the teacher-forcing cross-entropy loss. Note again that thedirector remains frozen with I S TYLE

LM duringthe entire training process. In the case of multiplepotential directives being better than the generators’outputs’ average, the cross-entropy loss for each di-rective is weighted by its marginal difference fromthe generator’s average score on the style dimen-sion; i.e., if the style score for a directive is D s andaverage outputs’ style score from the generator is G s , its weight during the cross-entropy training is D s − G s . This objective is similar to the one usedin SCST (Rennie et al., 2017) but only accepted directives are encouraged and nothing is explicitlydiscouraged.In order to stabilize the Director-Generator ﬁne-tuning framework, we use (a) ﬁxed director , and(b) moving generator . Contrary to the natural ex-pectation of exploring better directives with thetraining of the director, the ﬁxed or frozen directorrevents catastrophic degradation in case the train-ing biases the model towards speciﬁc choices thatfurther train the model. It is a known phenomenonin RL frameworks that the model quickly learns tobias towards speciﬁc choices that are more reward-ing. Speciﬁcally, we observe that training the direc-tor as well leads to overﬁtting to the limited stylisticchoices, thus resulting in the exploration of sub-optimal potential directives that seldom cross therequired thresholds, especially the content preserva-tion ones. With a moving (i.e. trained at each step)generator, its outputs scores account for the current state of the model against a ﬁxed stable director,and hence only those directives get accepted whichare better than the current capabilities (thresholds)of the generator. With a ﬁxed generator, directivesthat would have been worse than current capabili-ties of the model but better than the capabilities ofthe ﬁxed generator would also get accepted, thustraining the model in the opposite direction. TheDirector-Generator ﬁnetuned I S TYLE

LM yieldsour proposed DRAG framework. At the inferencetime, we drop the director and use the Generator asour ﬁnal rewriting model.

We use a transformer encoder with 512 hiddenunits, 16 heads, a dropout rate of 0.1, and learnedpositional embeddings during our MLM training.The model is trained using Adam Optimizer witha learning rate of − . The batch size used is 32with a stream of 256 tokens, and the whole setup istrained until the validation performance (perplexityscores) shows no further improvement. The Trans-formers used in encoder-decoder setup also havethe same parameters, and are initialized using theabove encoder before training on further objectives.During DAE loss training, we use the same hyper-parameters used in (Conneau and Lample, 2019;Syed et al., 2020), and set p drop and p blank to 0.1.During director-generator training, we use n as 8and (cid:15) as 0.05. The learning rate used in this case is − . In all the models, we use Byte Pair Encoding(Sennrich et al., 2015) with 80 k codes learnt overthe entire generic corpus. We use the 2,857 books written by 142 authors inthe Gutenberg corpus (Lahiri, 2014), as used in(Syed et al., 2020), along with the Wikipedia ar- As proposed by Parisotto et al. and shown in Fig. 1 ticles, to form a corpus of about 4.6M passages.We refer to this corpus as generic during all ourexperiments, since it infuses only generic linguis-tic knowledge into the models. While MLM andV

ANILLA

LM are trained on the generic corpus, weselect three authors with the most distinct writingstyles, namely Albert Einstein, Michael Faraday,and John Stuart Mill, as measured by comparingtheir lexical alignments with the average lexicalalignment of the Gutenberg corpus, as the target au-thors for author-speciﬁc style rewriting. Note thatthe choice of the authors is made purely on statisti-cal basis with these three authors having maximumlexical style difference on their style vectors asdescribed earlier when compared with the lexicalstyle of entire generic corpus. For evaluation, weuse the Opinosis corpus (Ganesan et al., 2010) aswell as mixed author Gutenberg subset (with ﬁvepassages from all the authors except the target au-thor), which we refer to as

Generic (Test) . Table 1 shows the results averaged over the threeselected authors. The experiments are conductedon Opinosis and

Generic (Test) datasets, using thefollowing four models.• V ANILLA LM is initialized using MLM-trained encoders and decoders and ﬁne-tunedon the generic corpus using DAE loss.• S TYLE LM , proposed by Syed et al. (2020), is also initialized using MLM-trained en-coders and decoders, but ﬁne-tuned only onthe target author corpus (instead of the genericcorpus).• I S TYLE LM , an improved and stronger base-line compared to S TYLE

LM, is initializedwith V

ANILLA

LM and then ﬁne-tuned on thetarget author corpus.•

DRAG is our proposed model. We use I S TYLE

LM to initialize both director and gen-erator as described above, and then ﬁne-tunethem using inputs from generic corpus.While the

Generic(Test) corpus is predominantlyliterary due to the nature of the source, Opinosiscovers everyday language. As shown in Table 1,S

TYLE

LM improves on the style alignment scores, Note that the S

TYLE

LM code is not publicly available.Results shown in the table are using our own implementation. ataset Model Content Preservation ( ↑ ) Author Style ( ↓ ) ROUGE-1 ROUGE-2 ROUGE-L BLEU Lexical(RMSE ) Surface(RMSE) Syntactic(JSD)Opinosis

Vanilla LM 75.23 56.12 74.28 59.46 0.232 2.74 0.132

StyleLM (Syed et al., 2020) iStyleLM

DRAG 57.23 36.12 56.98 37.53

Vanilla LM 72.34 54.65 71.93 56.46 0.218 2.48 0.120

StyleLM (Syed et al., 2020) iStyleLM

DRAG 52.39 30.66 51.98 33.28

Table 1: ↑ indicates higher scores are better while ↓ indicates the opposite. Apart from lexical alignment whereStyleLM performs marginally better, DRAG outperforms prior approaches. Vanilla LM performs best at contentpreservation but lacks any stylization Input Albert Einstein Michael Faraday John Stuart Mill

The accuracy at this point is verygood The experimental deﬁnitions developedis very clearly

The point is very wonderful The physical for this , is veryvery pretty .The estimated time to arrivaldoes not seem to calculate thetravelling time accurately The estimated time relative to the lead-ing existence does not seem likely tocalculate the travelling time exactly The discovery of ascertainingtime ; indeed , do not not showaccuracy to the time to angles The total time is to infer thatarrival is not veriﬁed but oftenclearly a , accurately . Table 2: Qualitative Outputs For Three different authors for same inputs

Input StyleLM (Syed et al., 2020) DRAG (Ours) but after that it is very easy and quite accurate to use. But for all that it is very After question aboutthis and quite measured with consideration. But after all it is very accurate and quiteillustrious to the use of events .Leather seats are very comfortable. come on very very should we have any re-placed. This moving hypothetical seats are verycomfortable.I am not real fond of the electric seatand I ﬁnd it is not as comfortable as myF150 pickup on trips I am not real and use of the electric position and I ﬁnd that it is not as well may’s for thevery hardly small have we led train I am not real fond of the electric seat , and Iﬁnd it is not as comfortable as my physicalrelative on railway investigations. . Table 3: Comparison between S

TYLE

LM and DRAG for Albert Einstein but at a great cost of content preservation whenthe target author corpus is small. This is possi-bly due to the random initialization of encoder-decoder attention parameters in the DAE trainingover target corpus, as reﬂected in the superior per-formance of I S TYLE

LM. We also note that whilethe approach proposed in S

TYLE

LM (Syed et al.,2020) improves lexical scores signiﬁcantly, it failsto bring the same level of improvement in surfaceand syntactic alignments, perhaps due to the due torare chances of less frequent punctuation symbolsgetting masked during DAE training, even more sowhen the target author corpus is not large enoughto cover all possible masks. Similar reasoning ex-plains the syntactic alignment issues, The DRAGapproach, however, improves on both surface andsyntactic alignment along with content preserva-tion scores even though it comes at the marginalcost of lexical alignment. Please note that the pur-pose of Vanilla LM is to provide an estimation of upper limit on the content preservation scores andis not to be treated as a baseline due to the simple objective of its task (just copying the input tokens).

We also qualitatively show some comparisons fordifferent authors and different models. In Table 2,we show the outputs of DRAG for same input anddifferent target authors. Evidently, our model pro-duces changes both at the lexical as well as surfacelevels. Word ’good’ in the ﬁrst input is replacedby words like ’clearly’, ’wonderful’ and ’pretty’depending on the author. Some words do not re-place any word but still get added to change thesyntactical structure of the sentences. For example- appearance of word ’relative’ starts comparisonto the ’leading existence’ making it a bit complex.Sometimes surface level changes like appearanceof ’;’ also change the complexity of sentences.We also show the comparison betweenS

TYLE

LM and our proposed DRAG outputs forsame inputs when the target author is Albert Ein-stein as shown in Table 3. Evidently, whileboth models try to achieve the stylistic alignment,

TYLE

LM ends up distorting the input sentencetoo much resulting in poor content preservationproperties. Words like ’measured’, ’hypotheti-cal’, and ’physical’ relative reﬂect the objective approach used in Albert Einstein’s writings.

While the language generation advancements arehappening at a very high pace, the notion of styleand the ability of models to rewrite same contentin different styles is still far from being solved.One of the most important observation as madeby (Lample et al., 2018) is that it is very difﬁcultto separate content from style. In fact, previousapproaches which worked on the principle of dis-entangling style from content were not found todisentangle the style so much after all (Lampleet al., 2018). The notion of style is still very farfrom being deﬁned and concretized. While somepsycholinguistic concepts can be deﬁned to someextent (formality, sentiment, etc.), deﬁning it at thelevel of author’s style is very difﬁcult due to mani-festation of style at different levels as enumeratedby (Verma and Srinivasan, 2019). Despite, suchenumeration at various levels, it is far from exhaus-tive and therefore our approach still requires moregranular understanding of style to closely emulatetarget author’s style.Our evaluation uses automatic metrics for styledue to the difﬁculty associated with conductinghuman evaluation in author attribution tasks(Syedet al., 2020). The skill needed to identify the au-thor’s style is very intense thus making the humanevaluation very costly. A more granular and de-tailed study on understanding how humans inter-pret an author’s style is required to design a properfeedback mechanism. This is, however, outside thescope of this work.

In this section, we discuss some of our explorationsthat did not work as expected to aid future researchin author stylization. We experimented with var-ious reinforcement learning setups as it was amore natural choice once we had scoring enginesfor rewards. Using the V

ANILLA

LM as a policyand we explored Self Critical Sequence Training(SCST) (Rennie et al., 2017; Ranzato et al., 2015)and Proximal Policy Optimization(Schulman et al.,2017). However, all the setups were unstable invarious ways for our problem. Note that our ex- periments and observations here are limited to theproblem of author stylized rewriting only. SCST orself-critical sequence training is aimed at bringingthe advantages of reinforcement learning setupsfor sequence level problems. A model (or pol-icy) generates/explores outputs (or episodes) usingmultinomial sampling and greedy sampling. If thegreedily sampled episode reward is r b and the non-greedily sampled episode reward is r - the wholesetup is trained using REINFORCE(Sutton andBarto, 2018) with r as the actual reward and r b asbaseline reward. We found this to limit explorationconsidering our problem is relatively much harderthan previous metrics on which SCST has been suc-cessful due to our target metric being of an exactvalue. It, therefore, resulted in no improvement ineither style or content scores. We, therefore, shiftedto its modiﬁed version to encourage exploration,where we generated multiple episodes for each in-put and averaged their scores to use that as baselinereward r b and trained the setup on all generatedepisodes using REINFORCE (Sutton and Barto,2018). We found this approach to be effective atstyle incorporation but not generalizable at all. Themodel learned to repeat certain patterns with poorcontent preservation abilities. We, tried, to balanceit with occasional denoising autoencoder loss train-ing but that only delayed the overﬁtting and notsolve it. We also attempted Proximal Policy Opti-mization in a setup same as (Sun et al., 2019) butit resulted in even worse outputs due to the critic’sfailure to approximate complex value functions forour objectives.As discussed already, we only accept those di-rectives which have scores above the threshold. Wealso tried a variant of it which had even those di-rectives which do not score above the threshold.We scored them negatively thereby resulting in abit similar framework like SCST but within somesteps, we found more negative scores than positivedue to bad content preservation pushing the modelaway from a bad state towards some undeﬁned stateresulting in spurious and inconsistent outputs. In this work, we addressed the shortcomings ofthe prior approaches for the task of author styl-ized rewriting and overcame them through DRAG:a Director-Generator approach. We showed theeffectiveness of our proposed approach for styl-ized rewriting on three different authors from theuteneberg Corpus. Furthermore, we discussed thelimitations of our approach and some of the failurecases to aid future research. While our DRAG ap-proach is able to stabilize the training while improv-ing the content preservation abilities of the model,a standard reinforcement learning approach, whenstabilized, has the potential to improve these scoresto a much more improved level. Improved under-standing of author style while keeping a humanin the loop and stabilizing RL with transformersmodels are subjects of future research.

References

Julian Brooke and Graeme Hirst. 2013. A multi-dimensional bayesian approach to lexical style. In

Proceedings of the 2013 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies ,pages 673–679.Tom B Brown, Benjamin Mann, Nick Ryder, MelanieSubbiah, Jared Kaplan, Prafulla Dhariwal, ArvindNeelakantan, Pranav Shyam, Girish Sastry, AmandaAskell, et al. 2020. Language models are few-shotlearners. arXiv preprint arXiv:2005.14165 .Sven Buechel and Udo Hahn. 2017. EmoBank: Study-ing the impact of annotation perspective and repre-sentation format on dimensional emotion analysis.In

Proceedings of the 15th Conference of the Euro-pean Chapter of the Association for ComputationalLinguistics: Volume 2, Short Papers , pages 578–585,Valencia, Spain. Association for Computational Lin-guistics.Kevin Clark, Minh-Thang Luong, Quoc V Le, andChristopher D Manning. 2020. Electra: Pre-trainingtext encoders as discriminators rather than genera-tors. arXiv preprint arXiv:2003.10555 .Kevyn Collins-Thompson and Jamie Callan. 2005. Pre-dicting reading difﬁculty with statistical languagemodels.

Journal of the American Society for Infor-mation Science and Technology , 56(13):1448–1462.Alexis Conneau and Guillaume Lample. 2019. Cross-lingual language model pretraining. In

Advancesin Neural Information Processing Systems , pages7059–7069.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2018. Bert: Pre-training of deepbidirectional transformers for language understand-ing. arXiv preprint arXiv:1810.04805 .Song Feng, Ritwik Banerjee, and Yejin Choi. 2012.Characterizing stylistic elements in syntactic struc-ture. In

Proceedings of the 2012 joint conference onempirical methods in natural language processingand computational natural language learning , pages1522–1533. Jessica Ficler and Yoav Goldberg. 2017. Controllinglinguistic style aspects in neural language genera-tion. arXiv preprint arXiv:1707.02633 .Marie Forgeard. 2008. Linguistic styles of eminentwriters suffering from unipolar and bipolar mooddisorder.

Creativity Research Journal , 20(1):81–92.Zhenxin Fu, Xiaoye Tan, Nanyun Peng, Dongyan Zhao,and Rui Yan. 2018. Style transfer in text: Explo-ration and evaluation. In

Thirty-Second AAAI Con-ference on Artiﬁcial Intelligence .Kavita Ganesan, ChengXiang Zhai, and Jiawei Han.2010. Opinosis: A graph based approach to abstrac-tive summarization of highly redundant opinions.Nikesh Garera and David Yarowsky. 2009. Modelinglatent biographic attributes in conversational genres.In

Proceedings of the Joint Conference of the 47thAnnual Meeting of the ACL and the 4th InternationalJoint Conference on Natural Language Processingof the AFNLP , pages 710–718.Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, andYejin Choi. 2019. The curious case of neural textdegeneration. arXiv preprint arXiv:1904.09751 .Eduard H Hovy. 1990. Pragmatics and natural lan-guage generation.

Artiﬁcial Intelligence , 43(2):153–197.Zhiting Hu, Zichao Yang, Xiaodan Liang, RuslanSalakhutdinov, and Eric P Xing. 2017. Towardcontrolled generation of text. arXiv preprintarXiv:1703.00955 .Diana Inkpen and Graeme Hirst. 2006. Building andusing a lexical knowledge base of near-synonymdifferences.

Computational linguistics , 32(2):223–262.Parag Jain, Abhijit Mishra, Amar Prakash Azad, andKarthik Sankaranarayanan. 2019. Unsupervisedcontrollable text formalization. In

Proceedings ofthe AAAI Conference on Artiﬁcial Intelligence , vol-ume 33, pages 6554–6561.Harsh Jhamtani, Varun Gangal, Eduard Hovy, and EricNyberg. 2017. Shakespearizing modern languageusing copy-enriched sequence-to-sequence models. arXiv preprint arXiv:1707.01161 .Mark Kantrowitz. 2003. Method and apparatus foranalyzing affect and emotion in text. US Patent6,622,140.Brett Kessler, Geoffrey Nunberg, and Hinrich Sch¨utze.1997. Automatic detection of text genre. arXivpreprint cmp-lg/9707002 .Shibamouli Lahiri. 2014. Complexity of Word Collo-cation Networks: A Preliminary Structural Analy-sis. In

Proceedings of the Student Research Work-shop at the 14th Conference of the European Chap-ter of the Association for Computational Linguistics ,pages 96–105, Gothenburg, Sweden. Association forComputational Linguistics.uillaume Lample and Alexis Conneau. 2019. Cross-lingual language model pretraining. arXiv preprintarXiv:1901.07291 .Guillaume Lample, Sandeep Subramanian, Eric Smith,Ludovic Denoyer, Marc’Aurelio Ranzato, and Y-Lan Boureau. 2018. Multiple-attribute text rewrit-ing. In

International Conference on Learning Rep-resentations .Jinhyuk Lee, Wonjin Yoon, Sungdong Kim,Donghyeon Kim, Sunkyu Kim, Chan Ho So, andJaewoo Kang. 2020. Biobert: a pre-trained biomed-ical language representation model for biomedicaltext mining.

Bioinformatics , 36(4):1234–1240.Juncen Li, Robin Jia, He He, and Percy Liang.2018. Delete, retrieve, generate: A simple approachto sentiment and style transfer. arXiv preprintarXiv:1804.06437 .Bing Liu. 2012. Sentiment analysis and opinion min-ing.

Synthesis lectures on human language technolo-gies , 5(1):1–167.Nelson F Liu, Matt Gardner, Yonatan Belinkov,Matthew E Peters, and Noah A Smith. 2019. Lin-guistic knowledge and transferability of contextualrepresentations. arXiv preprint arXiv:1903.08855 .Alexander Patrick Mathews, Lexing Xie, and XumingHe. 2016. Senticap: Generating image descriptionswith sentiments. In

Thirtieth AAAI conference onartiﬁcial intelligence .Philip M McCarthy, Gwyneth A Lewis, David F Dufty,and Danielle S McNamara. 2006. Analyzing writ-ing styles with coh-metrix. In

FLAIRS Conference ,pages 764–769.Tong Niu and Mohit Bansal. 2018. Polite dialogue gen-eration without parallel data.

Transactions of the As-sociation for Computational Linguistics , 6:373–389.Emilio Parisotto, H Francis Song, Jack W Rae, RazvanPascanu, Caglar Gulcehre, Siddhant M Jayakumar,Max Jaderberg, Raphael Lopez Kaufman, AidanClark, Seb Noury, et al. 2019. Stabilizing trans-formers for reinforcement learning. arXiv preprintarXiv:1910.06764 .Ellie Pavlick and Joel Tetreault. 2016. An empiri-cal analysis of formality in online communication.

Transactions of the Association for ComputationalLinguistics , 4:61–74.Alec Radford, Jeffrey Wu, Rewon Child, David Luan,Dario Amodei, and Ilya Sutskever. 2019. Languagemodels are unsupervised multitask learners.

OpenAIBlog , 1(8):9.Colin Raffel, Noam Shazeer, Adam Roberts, KatherineLee, Sharan Narang, Michael Matena, Yanqi Zhou,Wei Li, and Peter J Liu. 2019. Exploring the limitsof transfer learning with a uniﬁed text-to-text trans-former. arXiv preprint arXiv:1910.10683 . Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli,and Wojciech Zaremba. 2015. Sequence level train-ing with recurrent neural networks. arXiv preprintarXiv:1511.06732 .Sudha Rao and Joel Tetreault. 2018. Dear sir ormadam, may i introduce the gyafc dataset: Corpus,benchmarks and metrics for formality style transfer. arXiv preprint arXiv:1803.06535 .Steven J Rennie, Etienne Marcheret, Youssef Mroueh,Jerret Ross, and Vaibhava Goel. 2017. Self-criticalsequence training for image captioning. In

Proceed-ings of the IEEE Conference on Computer Visionand Pattern Recognition , pages 7008–7024.John Schulman, Filip Wolski, Prafulla Dhariwal, AlecRadford, and Oleg Klimov. 2017. Proximalpolicy optimization algorithms. arXiv preprintarXiv:1707.06347 .Rico Sennrich, Barry Haddow, and Alexandra Birch.2015. Neural machine translation of rare words withsubword units. arXiv preprint arXiv:1508.07909 .Tianxiao Shen, Tao Lei, Regina Barzilay, and TommiJaakkola. 2017. Style transfer from non-parallel textby cross-alignment. In

Advances in neural informa-tion processing systems , pages 6830–6841.Hrituraj Singh, Gaurav Verma, and Balaji Vasan Srini-vasan. 2020. Incorporating stylistic lexical prefer-ences in generative language models. In

Proceed-ings of the 2020 Conference on Empirical Methodsin Natural Language Processing: Findings , pages1074–1079.Sandeep Subramanian, Guillaume Lample,Eric Michael Smith, Ludovic Denoyer,Marc’Aurelio Ranzato, and Y-Lan Boureau.2018. Multiple-attribute text style transfer. arXivpreprint arXiv:1811.00552 .Chi Sun, Xipeng Qiu, Yige Xu, and Xuanjing Huang.2019. How to ﬁne-tune bert for text classiﬁcation?In

China National Conference on Chinese Computa-tional Linguistics , pages 194–206. Springer.Richard S Sutton and Andrew G Barto. 2018.

Rein-forcement learning: An introduction . MIT press.Bakhtiyar Syed, Gaurav Verma, Balaji Vasan Srini-vasan, Anandhavelu Natarajan, and Vasudeva Varma.2020. Adapting language models for non-parallelauthor-stylized rewriting. In

AAAI , pages 9008–9015.Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In

Advances in neural information pro-cessing systems , pages 5998–6008.Gaurav Verma and Balaji Vasan Srinivasan. 2019.A lexical, syntactic, and semantic perspectivefor understanding style in text. arXiv preprintarXiv:1909.08349 .aniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom BBrown, Alec Radford, Dario Amodei, Paul Chris-tiano, and Geoffrey Irving. 2019. Fine-tuning lan-guage models from human preferences. arXivpreprint arXiv:1909.08593arXivpreprint arXiv:1909.08593