[PDF] Stylized Dialogue Response Generation Using Stylized Unpaired Texts

Abstract

Generating stylized responses is essential to build intelligent and engaging dialogue systems. However, this task is far from well-explored due to the difficulties of rendering a particular style in coherent responses, especially when the target style is embedded only in unpaired texts that cannot be directly used to train the dialogue model. This paper proposes a stylized dialogue generation method that can capture stylistic features embedded in unpaired texts. Specifically, our method can produce dialogue responses that are both coherent to the given context and conform to the target style. In this study, an inverse dialogue model is first introduced to predict possible posts for the input responses, and then this inverse model is used to generate stylized pseudo dialogue pairs based on these stylized unpaired texts. Further, these pseudo pairs are employed to train the stylized dialogue model with a joint training process, and a style routing approach is proposed to intensify stylistic features in the decoder. Automatic and manual evaluations on two datasets demonstrate that our method outperforms competitive baselines in producing coherent and style-intensive dialogue responses.

Full PDF

SStylized Dialogue Response Generation Using Stylized Unpaired Texts

Yinhe Zheng , , Zikai Chen , Rongsheng Zhang , Shilei Huang , Xiaoxi Mao , Minlie Hang ∗ Department of Computer Science and Technology, Institute for Artiﬁcal Intelligence, State KeyLab of Intelligent Technology and Systems, Beijing National Research Center forInformation Science and Technology, Tsinghua University, Beijing, China. Fuxi AI Lab, NetEase Inc., Hangzhou, China Samsung Research China - Beijing (SRC-B), Beijing, China [email protected], [email protected] { zhangrongsheng, huangshilei, maoxiaoxi } @[email protected] Abstract

Generating stylized responses is essential tobuild intelligent and engaging dialogue sys-tems. However, this task is far from well-explored due to the difﬁculties of renderinga particular style in coherent responses, espe-cially when the target style is embedded onlyin unpaired texts that cannot be directly usedto train the dialogue model. This paper pro-poses a stylized dialogue generation methodthat can capture stylistic features embedded inunpaired texts. Speciﬁcally, our method canproduce dialogue responses that are both co-herent to the given context and conform tothe target style. In this study, an inverse dia-logue model is ﬁrst introduced to predict pos-sible posts for the input responses, and thenthis inverse model is used to generate stylizedpseudo dialogue pairs based on these stylizedunpaired texts. Further, these pseudo pairs areemployed to train the stylized dialogue modelwith a joint training process, and a style rout-ing approach is proposed to intensify stylisticfeatures in the decoder. Automatic and manualevaluations on two datasets demonstrate thatour method outperforms competitive baselinesin producing coherent and style-intensive dia-logue responses.

Building a conversational agent that can producestylized and coherent responses has been one ofthe major challenges in dialogue systems (Huanget al., 2020). Such an agent can not only yield morevivacious dialogues but also deliver more engagingconversations by taking advantage of the linguisticstyle matching phenomenon (Niederhoffer and Pen-nebaker, 2002), which suggests that people tend toadjust their linguistic style during communicationto pursue higher engagement. ∗ Corresponding Author: [email protected]

The pleasure is all mine, my lady . √ The pleasure is all mine, sir . × Thanks for helping my husband. nevermind XD

Text Style Transfer Model 𝒚 : 𝒚(cid:3557) : 𝒙 Our modelPipelined approach 𝒚(cid:3557) : Figure 1: A pipelined approach to produce formal di-alogue responses. For a post x , a response y is ﬁrstproduced using a dialogue model and then it is trans-ferred to a formal response (cid:101) y using a text style transfermodel. Generating stylized dialogue responses has beeninvestigated in various studies, where the deﬁnitionof styles covers a variety of subtle concepts, such assentiment (Shen et al., 2017b), emotion (Zhou et al.,2018), or persona (Li et al., 2016b). Despite thesuccess, previous studies are generally conductedin a fully supervised setting that requires to usedialogue pairs in the target style. However, in mostcases, the stylistic features we want to capture areembedded in unpaired texts that can not be directlyutilized by these supervised models (Gao et al.,2019).Few studies for dialogue modeling have beenproposed to capture the stylistic features embeddedin unpaired texts. Speciﬁcally, Niu and Bansal(2018) employs a style-aware reinforce loss, andGao et al. (2019) resorts to a joint continuous latentspace. However, despite the reported feasibility,we argue that due to the discrete nature of texts andsubtle deﬁnition of text styles, it is hard to producecoherent and style-speciﬁc responses by relying onsparse reinforce signals or controlling continuousrepresentations.Note that we can also implement a straightfor-ward stylized dialogue generation pipeline withthe help of an unsupervised text style transfermodel (Hu et al., 2017), which can be trained using a r X i v : . [ c s . C L ] S e p tylized unpaired texts. Speciﬁcally, for a post x ,a non-stylized dialogue response y is ﬁrst gener-ated using a regular dialogue model, and then y is transferred to a stylized response (cid:101) y using a textstyle transfer model. However, this approach mayhurt the coherence between x and (cid:101) y since the styletransferring process is unaware of x and thus mayintroduce improper contents. As shown in Figure 1,the style transfer model generates a strong stylisticword “sir” to emphasize the formality of (cid:101) y . How-ever, this makes (cid:101) y incoherent with x since x is mostlikely to be issued by a female.In this paper, we propose to build a stylized di-alogue generation model that can capture stylis-tic features embedded in a set of unpaired texts D s . Speciﬁcally, in order to tackle the problemof lacking stylized dialogue pairs, an inverse di-alogue model is built to predict posts based onthe responses, and a set of stylized pseudo dia-logue pairs are constructed by producing pseudoposts for texts in D s . Then a stylized dialoguemodel is trained using these pseudo pairs, and ajoint training process is introduced to enhance thecoherency between the post and the resulting re-sponses. Moreover, our dialogue models are pa-rameterized using the Transformer-based encoder-decoder framework, and initialized with the pre-trained GPT weights (Radford et al., 2018). Astyle routing approach is devised to fuse a styleembedding in each decoder block of the stylizeddialogue model to intensify the stylistic features inthe decoding process.We evaluate our method on two datasets withtwo distinct writing styles: 1) Jinyong novels in Chinese, and 2) formality in English writing.Automatic and human evaluations show that ourmethod signiﬁcantly outperforms competitive base-lines with a large margin in generating coherentdialogue responses while rendering stronger stylis-tic features.Our contributions can be summarized as: A novel method is proposed to build a styl-ized dialogue model that can capture stylistic fea-tures embedded in unpaired texts. Speciﬁcally, aninverse dialogue model is introduced to generatestylized pseudo dialogue pairs, which are furtherutilized in a joint training process. An effectivestyle routing approach is devised to further inten-sify the stylistic features in the decoder. Jinyong is a famous Chinese writer who wrote many KungFu novels. Automatic and human evaluations on twodatasets show that our method outperforms com-petitive baselines with a large margin in producingstylized and coherent dialogue responses.

Stylized dialogue generation has attracted numer-ous attentions in recent years (Gao et al., 2019;Niu and Bansal, 2018). With a rather wide deﬁni-tion of styles, various studies that focus on control-lable dialogue generation have been categorized as“stylized” dialogue generation, such as generatingpersonalized (Li et al., 2016b) or emotional (Zhouet al., 2018) dialogues. However, the training pro-cess of these dialogue model usually require dia-logue pairs in the target style, whereas our studyaims to capture stylistic features embedded in un-paired texts.Moreover, the styles deﬁned in most previ-ous studies are deeply fused with the text con-tents (Tikhonov et al., 2019). Enforcing these stylesmay limit the expressive ability of the dialoguemodel because there are contradictions betweencertain semantic contents and style categories. Forexample, it is hard, if not impossible, for a serviceagent to yield comforting contents when enforc-ing a negative sentiment. Unlike most previousworks, our study investigates to model the writingstyles that are “orthogonal” to the text semantic,so that the contents we want to deliver will not beconstrained by the style we intend to render.

Text style transfer is a related but different taskcompared to our work. Speciﬁcally, these text styletransfer models aim to preserve the style-agnosticcontents of the input text (Fu et al., 2018). Incontrast, our study aims to produce coherent re-sponses rather than to preserve the contents of theposts. Early works on this task focus to disentan-gle the representation of styles and contents (Huet al., 2017; Shen et al., 2017a; Prabhumoye et al.,2018). However, recent studies argue the effective-ness of such disentanglement (Lample et al., 2019),and propose to revise the latent codes using clas-siﬁers (Liu et al., 2020; Wang et al., 2019). Someworks are also proposed to render the target stylesby replacing stylistic words (Wu et al., 2019a,b).We have also noticed a recent work that consid-ers a contextual constraint in the text style trans-ferring process (Cheng et al., 2020). However, al-though being feasible, the training of this modelrequires style-labelled parallel data. This hinders ost 𝑥 Response 𝑦 Style 𝑆 (cid:3036) Inverse Encoder 𝑒̂ Inverse Decoder 𝑑(cid:4632)

Encoder 𝑒 Decoder 𝑑 Figure 2: Overall framework. us from directly employing this model in our studysince these parallel data are usually unavailable.

Back translation is a popular approach that hasbeen widely employed in various NLP tasks suchas machine translation (Sennrich et al., 2016), dia-logue data augmentation (Su et al., 2020), and textstyle transfer (Zhang et al., 2018; Lample et al.,2019; Dai et al., 2019). This approach is simi-lar to the inverse dialogue model introduced inour study. However, different from previous ap-proaches that focus on modeling the one-to-onemapping between the source and target languages,our inverse dialogue model tries to capture the one-to-many mappings between the responses and postswith the help the proposed joint training process.In our study, the diversity of the generated pseudoposts are enhanced using a sampling approach.

In this study, we propose to build a stylized dia-logue model without utilizing dialogue pairs in thetarget style. Speciﬁcally, our method takes as inputtwo sets of data in the training stage: 1) M unpairedtexts D s = { t , ..., t M } in the writing style S ; 2) N dialogue pairs D p = {(cid:104) x , y (cid:105) , ..., (cid:104) x N , y N (cid:105)} with style S , where x i and y i is the post and re-sponse, respectively. Our stylized dialogue modelaims to generate a response y that is coherent toa given post x while exhibiting a certain style S i ( i = 0 , ): y = arg max y (cid:48) p ( y (cid:48) | x, S i ) . (1) Our model consists of two mirrored sub-modules(Figure 2): (1). A stylized dialogue module (i.e., e and d in Figure 2) that can produce a stylizedresponse y based on a given post x and a style label S i ( i = 0 , . A style routing approach is devised Post 𝑥 Response 𝑦 Masked Multi-head AttentionLayer NormFeed ForwardLayer Norm + Multi-head Attention + Layer NormFeed ForwardLayer Norm + Multi-head Attention × N × N LinearResponse 𝑦 Encoder Decoder (Shifted Right) + Average Style Embedding

Figure 3: Architecture of the stylized dialogue model. to incorporate stylistic features in d ; (2). An in-verse dialogue module (i.e., ˆ e and ˆ d in Figure 2)that aims to produce pseudo posts x based on aninput response y . Note that the inverse dialoguemodel is introduced to tackle the problem of lack-ing dialogue pairs in style S , i.e., we can regardthe texts in D s as possible dialogue responses anduse the predicted pseudo posts to construct pseudodialogue pairs in style S . Therefore, we omit thestyle label in the inverse decoder ˆ d to encourageit to focus more on the semantic aspect of the dia-logue.The dialogue modules in our study are parame-terized using the Transformer-based encoder anddecoder architecture (Vaswani et al., 2017) and areinitialized using pretrained GPT (Radford et al.,2019) weights. Further, we also follow previousworks (Golovanov et al., 2019) to share the weightsof the encoder and decoder from the same sub-module to save memories. Particularly, the weightsof e and d are shared, and the weights of ˆ e and ˆ d are shared.Moreover, to better capture the one-to-many phe-nomenon and alleviate the problem of producingtrivial posts in the inverse dialogue model, a top-ksampling scheme is employed to sample multiplepseudo posts for each stylized text in D s , and allthese posts are utilized in the training process. Fur-ther, a joint training process is also introduced totrain these two sub-modules in an iterative fashionto enhance the coherency of the response. .3 Style Routing There exist various approaches to condition the de-coder d on the style label. For example, employinga special style token as the start token (Lampleet al., 2019), or adding a style embedding to eachword embedding (Zheng et al., 2020). However,these approaches only incorporate the style repre-sentation in the input layer of the decoder, whereasthe higher layers are not affected.In this study, a style routing approach is devisedto enhance existing approaches to stylize d in thestylized dialogue model (see Figure 3). Speciﬁ-cally, in each decoder block, we ﬁrst fuse the rep-resentation of the post x and previously decodedtoken sequence y p using the attention routing mech-anism (Zheng et al., 2020), i.e., two sequences ofrepresentations, R prev , R post ∈ R l × h , are ﬁrst cal-culated: R prev = MMHA[ e w ( y p ) , e w ( y p ) , e w ( y p )] , (2) R post = MHA[ e w ( y p ) , e ( x ) , e ( x )] , (3)where e w ( y p ) ∈ R l × h denotes the embeddingof y p and it is used as the query in MMHA andMHA, which represent the masked and un-maskedmulti-head attention operation, respectively. l isthe length of y p , and h is the hidden size. e w ( x ) is the output of the encoder. A sequence of fusedrepresentations R avg is obtained as: R avg = ( R prev + R post ) / . (4)Then for a given style S i , a style embedding e s ( S i ) ∈ R × h is allocated and e s ( S i ) is routedinto R avg by adding it to each time step of thesequence: R merge = R avg + e s ( S i ) . (5)Also note that the fusion operation in Eq. 4 and5 is similar to some previous studies that try toincorporate additional contexts in a transformer-based decoder (Golovanov et al., 2019). However,different from these approaches that focus to modelsequential contexts, the styles modeled in our studyare categorical, and more priority is allocated tothe style representation in our model. Moreover,we are the ﬁrst to use such style routing approachin the stylized dialogue generation task. The training of our model involves the followinglosses: 1) standard maximum log likelihood losses

Algorithm 1

Joint training process

Input : M unpaired texts: D s = { t i } Mi =1 in style S , N dia-logue pairs D p = {(cid:104) x i , y i (cid:105)} Ni =1 in style S . Output : A stylized dialogue model1: Init the stylized and inverse dialogue model e , d , ˆ e , ˆ d while not converge do

3: Sample n d dialogue pairs D bp = {(cid:104) x i , y i (cid:105)} n d i =1 ⊂ D p

4: Train e and d by optimizing L p r (Eq. 6) on D bp

5: Train ˆ e and ˆ d by optimizing L r p (Eq. 7) on D bp if Current Step > N f then D pp ← empty set.8: Sample n s stylized texts D bs = { t i } n s i =1 ⊂ D s for each t i ∈ D bs do

10: Decode m posts { x (cid:48) ij } mj =1 from p ˆ d ( x | ˆ e ( t i )) D pp ← D pp (cid:83) {(cid:104) x (cid:48) ij , t i (cid:105)} mj =1 end for

13: Train e and d by optimizing L inv (Eq. 8) on D pp end if end while evaluated on dialogue pairs from D p : L p r = E (cid:104) x,y (cid:105)∼D p − log p d ( y | e ( x ) , S ) , (6) L r p = E (cid:104) x,y (cid:105)∼D p − log p ˆ d ( x | ˆ e ( y )) . (7)The loss L p r and L r p is used to train the styl-ized dialogue model and inverse dialogue model,respectively; 2) an inverse dialogue loss evaluatedon texts from D s : L inv = E t ∼D s ,x (cid:48) ∼ p ˆ d ( x | ˆ e ( t )) − log p d ( t | e ( x (cid:48) ) , S ) , (8)in which x (cid:48) is the pseudo post sampled from theinverse dialogue model.Note that the gradient back-propagation throughthe loss L inv is intractable due to the in-differentiable sampling process in Eq. 8. In thisstudy, we approximate the ideal back-propagationprocess through L inv by truncating the gradientsassociated with the sampling operation. Speciﬁ-cally, when optimizing L inv , the parameters of theinverse dialogue model are ﬁxed, and the stylizeddialogue model is trained with pseudo posts x (cid:48) thatare sampled from the inverse dialogue model. Sim-ilar approaches have been proven to be effectivein other NLP tasks (Lample et al., 2018; He et al.,2020). However, unlike previous works that usethe greedy decoding scheme, our study employsthe top-k sampling scheme with beam search toproduce x (cid:48) since the mapping between dialogue re-sponses and posts is not unique. The greedy decod-ing scheme may limit the diversity of the decodedpseudo posts and lead to sub-optimal performance. ataset Train Test D p D s D t WDJN Size 300.0K 95.13K 2.0K 2.0KStyle Weibo Jinyong Weibo JinyongTCFC Size 217.2K 500.0K 0.97K 0.97KStyle Informal Formal Informal Formal

Table 1: Statistics of datasets

To facilitate the learning with the above gradientapproximation approach, a joint training processis introduced to train the model literately. Specif-ically, in each training iteration, we ﬁrst updatethe stylized and inverse dialogue model by opti-mizing the losses L p r and L r p using a batch ofdialogue pairs sampled in D p . Further, a batch ofstylized sentences D bs are sampled from D s . Foreach sentence t i ∈ D bs , m pseudo posts x (cid:48) i , ..., x (cid:48) im are sampled from the inverse dialogue model, and m pseudo dialogue pairs (cid:104) x (cid:48) ij , t i (cid:105) , ( j = 1 , ..., m ) in the style S are constructed. These pseudo pairsare used to train the stylized dialogue model withthe loss L inv . Moreover, to avoid corrupted pseudoposts at the beginning of the training process, wepre-train the inverse dialogue model on L r p for N f steps before using it to decode pseudo posts.The detailed training process is summarized in Al-gorithm 1. Our method is evaluated on two datasets with twodistinct styles (see statistics in Table 1).

1) WDJN : We collect 300K Weibo Dialogues(style S ) as D p and sampled 95.1K stylized textsfrom Jinyong’s Novels (style S ) as D s . Moreover,we also extracted 2K dialogue pairs from Jinyong’snovels with hand-designed rules. These dialoguesare used as the test set D t together with 2K addi-tional Weibo dialogues. Note that all the Weibodialogues in our WDJN dataset (both training andtesting) are manually inspected and ﬁltered by an-notators.Also note that to prevent the model from copyingstylistic phrases in the post when producing Jiny-ong style responses in the testing phase, we erasethe stylistic features related to Jinyong’s writingfrom the posts in these 2K Jinyong style dialoguesin D t using the back translation approach (Zhanget al., 2020). Moreover, all the resulting posts aremanually checked and revised to ensure the stylis- tic features related to style S are erased. Moredetails about the WDJN dataset can be ﬁnd the Ap-pendix A. The WDJN dataset will be released forpublic use.

2) TCFC (Wu et al., 2020): This dataset focuseson the formality in English writing. We sampled217.2K informal dialogue pairs (style S ) as D p and 500.0K formal texts (style S ) as D s from theoriginal dataset, and used the test data in the origi-nal dataset as our test set D t , which contains 1,956manually-crafted dialogue pairs (978 informal pairsand 978 formal pairs). For experiments on the WDJN and TCFC dataset,we used the pre-trained CDial-GPT (Wang et al.,2020) and DialoGPT (size 345M) (Zhang et al.,2019) model to initialize the dialogue modules,respectively. The top-K sampling process in Al-gorithm 1 employs a K = 20 and beam size of 4(WDJN) or 2 (TCFC). The value of N f is set to300. The training of our model stops after 10 itera-tion epochs on D p (WDJN) or after 8,000 steps ofupdates (TCFC). See Appendix B for more detailsof the reproduction guidance. We choose two groups of baselines:The ﬁrst group contains dialogue models withdifferent style modeling scheme: S2S (Golo-vanov et al., 2019): a strong Transformer-baseddialogue model that is only trained on D p . Thisbaseline can only produce responses in style S ; SLM : the “Fusion” model proposed by Niu andBansal (2018), in which an independent stylizedlanguage model is trained on D s , and the distri-butions decoded from the S2S baseline and thestylized LM are fused when producing responsesin style S ; SRL : the “RL” model proposed byNiu and Bansal (2018), in which a reinforce signalproduced by a style classiﬁer is used to enforcethe style S ; SFusion (Gao et al., 2019): Afused latent space is built using a multi-task train-ing scheme. Speciﬁcally, for each post, six re-sponses are sampled, and two classiﬁers are usedto rank these responses for the styles.The second group of baselines are built using thepipelined approach, i.e., different unsupervised textstyle transfer models are trained on texts from D s and D p , and responses produced by the S2S base-line (in style S ) are transferred to exhibit the target odel WDJN Dataset TCFC DatasetBLEU-1,2 Dist. BERT SVM Flu. Coh. Style HAvg. BLEU-1,2 Dist. BERT SVM Flu. Coh. Style HAvg.SLM 2.90 0.37 26.6 26.7 40.7 1.96 ∗ ∗ ∗ ∗ ∗ Human N/A 49.3 80.1 85.4 1.93 1.60 1.53 1.67 N/A 62.7 89.6 85.8 1.91 1.18 1.83 1.56

Table 2: Automatic and manual evaluation results for responses with style S . All differences between our modeland baselines are signiﬁcant with p -value < Model WDJN Dataset TCFC DatasetBLEU-1,2 Dist. BERT SVM Flu. Coh. Style HAvg. BLEU-1,2 Dist. BERT SVM Flu. Coh. Style HAvg.S2S 8.50 2.42 35.1 97.0 93.0 1.96 ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ SFusion 8.65 0.82 35.3 ∗ ∗ Human N/A 56.4 97.9 94.4 1.89 1.86 1.98 1.91 N/A 72.6 72.0 72.1 1.76 1.19 1.76 1.52

Table 3: Automatic and manual evaluation results for responses with style S . All differences between our modeland baselines are signiﬁcant with p -value < style S using these models: S2S+BT : a back-translation-based text style transfer model (Heet al., 2020); S2S+CT : a model that tries to en-tangle the latent code for styles and contents (Wanget al., 2019); S2S+PTO : a model that rendersthe target style by replacing stylistic words (Wuet al., 2019a).Note that for baselines SLM, SRL and all thebaselines in the second group, the responses gener-ated by the S2S baseline are used as their responsesfor the S style since they can only produce re-sponses in S once trained. Moreover, for fair com-parisons, we implemented baselines 1-3 using thesame architecture and hyper-parameters as in ourmodel. For baselines 4-7, we used the ofﬁcial codesreleased by the authors. Note that it is non-trivialto utilize the pre-trained GPT model in the baseline SFusion since it handles ﬁxed-length latent codes.

We ﬁrst used automatic metrics toevaluate the response quality of our model: 1).

BLEU (Papineni et al., 2002) was used to measuren-gram (n=1, 2) overlap between the generated re-sponses and the reference responses; 2).

Distinct ( Dist. ) (Li et al., 2016a) measures the proportion ofunique n-grams in the generated responses (n=2).To evaluate the style intensity of the each model, we ﬁrst trained two text style classiﬁers (i.e.,

BERT (Devlin et al., 2019) and

SVM ) and then calculatedthe style intensity score as the portion of generatedresponses that conform to the target style based onthese classiﬁers. In our study, the texts from D p and D s were used to train the classiﬁers for theWDJN experiments, and the formal/informal textsfrom the GYAFC dataset (Rao and Tetreault, 2018)were used to train the classiﬁers for the TCFC ex-periments. The accuracy of the BERT and SVMclassiﬁer on the holdout test set was 98.52% and94.20% respectively for the WDJN experiments,and 93.98% and 89.57% respectively for the TCFCexperiments (see Appendix C for more details). Results:

We separately evaluated the responses A v e r a g e S t y l e I n t e n s i t y (a) WDJN Dataset 0 1 2Coherency012 A v e r a g e S t y l e I n t e n s i t y (b) TCFC DatasetS2SSRL S2S+PTOSLM SFusionS2S+BT Ours Figure 4: Averaged style intensity scores for responseswith different coherency scores. odel WDJN Dataset TCFC DatasetBLEU-1,2 Dist. BERT SVM Flu. Coh. Style HAvg. BLEU-1,2 Dist. BERT SVM Flu. Coh. Style HAvg.Ours ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ Table 4: Automatic and manual evaluation results of ablation models for responses with style S and S . Alldifferences between our model and ablation models are signiﬁcant with p -value < in style S (Table 2) and S (Table 3). Note thatthe baseline S2S is not included in Table 2 since itcan not produce responses in style S . Similarly,only the baselines S2S and

SFusion are contained inTable 3. Signiﬁcance tests are performed betweenthe results of our model and all the baselines usingthe t-test with bootstrap resampling (Koehn, 2004).As can be seen from the automatic results, ourmethod outperforms all the baselines with largemargins when generating dialogue responses instyle S (Table 2), and achieves competitive perfor-mance when producing responses in style S (Ta-ble 3). This indicates that our model can producehigh quality responses that are both coherent tothe given context and consistent to the target style.We can further observe that: 1). The pipelinedapproaches achieve lower BLEU scores compar-ing to our method. This veriﬁes our claim that theresponse coherency is affected by the style trans-ferring process. Similar results are also observedin manual evaluation. 2). The high diversity (i.e., Dist. scores) of the baselines on the TCFC datasetcome along with a dramatic decrease on the BLEUscores. This is because that these baselines overﬁtto the diverse colloquial phrases in the informalresponses, and fail to render responses in style S ,which are more formal and less diverse.Also note that the style intensity scores for hu-man generated responses (last row in Table 2 and3) do not match the accuracy of our style classiﬁers.This is because that the train data of these classi-ﬁers involve non-conversational texts, which leadsto mismatches when testing using conversational re-sponses. To alleviate this mismatch, we performedmanual evaluations to concrete our analysis. For a given post, dialogue responses withdifferent styles were generated using our modeland all the baselines. Three annotators were re- cruited from the crowd-sourcing platform to evalu-ate these responses from three aspects: 1)

Fluency ( Flu. ): whether the response is ﬂuent and free fromgrammar errors; 2)

Coherency ( Coh. ): whetherthe response is coherent with the dialogue context;3)

Style Intensity ( Style ): whether the responseconforms to the given style. Each metric is ratedamong {

0, 1, 2 } , in which 0 means worst and 2best. Moreover, the Harmonic Average (i.e,

HAvg. )of above measures is also reported.

Results:

We sampled 300 posts from D t foreach of these two datasets. Fleiss’s kappa κ (Ran-dolph, 2005) was used to measure the annotationagreement between annotators. Speciﬁcally, for Flu. , Coh. , and

Style , the κ value was 0.69, 0.50,0.86, respectively on the WDJN dataset (indicatingsubstantial, moderate, and substantial agreement),and 0.44, 0.31, 0.42, respectively on the TCFCdataset (indicating moderate, fair, and moderateagreement).As shown in Table 2, our model surpasses all thebaselines signiﬁcantly on style intensity (except forS2S+CT on the WDJN dataset, which comes withdramatic decreases on the ﬂuency and coherencyscores) when producing responses in style S , andit achieves competitive or higher ﬂuency and co-herency scores. This veriﬁes the superiority of ourmethod in producing coherent and style intensiﬁeddialogue responses. Moreover, results in table 3also shows that our model achieved competitiveperformance when generating responses in style S .Another interesting observation from the resultsin Table 2 and 3 is the trade-off between the co-herency and style intensity when generating styl-ized dialogue responses, i.e., the high style intensityusually comes at the cost of a low coherency. Forexample, the model S2S+CT achieves the best styleintensity score on the WDJN dataset (1.50) whenproducing responses in style S , but it obtains the DJN dataset TCFC datasetPost Haven’t eaten hot pot in a long time ( 好久没吃火锅了 ) It’s only 9:57 pm and I’m already falling asleep. S S2S I haven’t eaten hot pot in a long time too ( 我也好久没吃火锅 ) You’re not falling asleep yet, lolSFusion I also want to eat, just started ( 我也想吃 , 刚刚开始 ) dude same here, my friend has a reason at nightOurs I also want to eat ( 我也想吃了 ) it’s almost 9 am here and i just got up... S SLM With that said, I want to eat too ( 这么一说 , 我也想吃了 ) I have a headache and I can not stop drinking.SRL I haven’t eaten in a long time. I really want to eat ( 好久没吃了 , 好想吃啊 ) isn’t it 5:30 in the morning?SFusion I’m almost done ( 我已经快好了 ) Same here but I think it’s gna say hello!S2S+BT We haven’t eaten hot pot in a long time ( 我们好久没吃火锅 ) She is not falling asleep yet.S2S+CT I have no problem for a long time too. I went to the hot pot butunfortunately they didn’t ( 我也好久没问题 , 老衲去打了火锅可惜他们没 ) That is not falling asleep then Maguties out forriddle.S2S+PTO I haven’t eaten hot pot in a long time too ( 我也好久没吃火锅 ) / ’ re not falling asleep yetOurs Pretty good, but hero, you are hungry for a whole day. Let’s eatﬁrst! ( 不错 , 大侠饿了一天 , 现下先吃饭吧！ ) Yes, it is 9:06 pm here, and I am still on thecouch. Table 5: Example responses produced by our model and the baselines on the TCFC and WDJN datasets. worst coherency (0.19) score. This phenomenon isalso observed in various previous studies (Niu andBansal, 2018; Zheng et al., 2020). Nevertheless,our model achieves a competitive coherency whileproducing style-intensive responses.Also note that the baselines SFusion, SRL, andS2S+CT generally yield low

HAvg. scores on bothdatasets. This veriﬁes our claim that it is hard togenerate stylized and coherent responses relyingon the sparse reinforce signals (i.e., SRL) or con-tinuous latent codes (i.e., SFusion and S2S+CT).The superiority of our method to generate styl-ized dialogue responses is further demonstrated byanalyzing the style intensity scores of responseswith different levels of coherency. Speciﬁcally, allthe annotated responses that are generated witha designated style of S were collected and cate-gorized into three groups based on the coherencyscores (i.e., 0, 1, or 2) they received. The averagedstyle intensity score for each group was calculatedand shown in Figure 4. It can be seen that ourmodel achieves the highest style intensity scoresin all coherency groups. This further demonstratesthat the responses produced by our method aremore style-intensive than those by the baselines. Ablation studies were performed to verify the effectof each component in our method. Speciﬁcally,the following variants were tested: 1) without thestyle routing approach ( w/o Rout. ), i.e., the styleembedding is not explicitly incorporated in eachdecoder block as in Eq.5. The decoder d is stylizedby employing a style token as the start token andadding a style embedding to each word embedding; 2) Without the joint training process ( w/o JointT ),i.e., an inverse dialogue model is ﬁrst trained, andthen a ﬁxed set of pseudo pairs are generated andused to train the stylized dialogue model. Notethat the same amount of pseudo pairs were used tooptimize the loss L inv in this variant as it is used inAlgorithm 1; 3) Without using the top-K samplingscheme when producing pseudo posts ( w/o Samp. ),i.e., pseudo pairs are decoded greedily; 4) Withoutusing the pre-trained GPT weights ( w/o PreT ).As shown in Table 4, our model achieves thehighest BLEU and

Coh. scores among all the ab-lation models. We can further observe that: 1)Almost all our variants surpass the baselines witha large margin on the style intensity score. Thisveriﬁes the feasibility of our framework in captur-ing stylistic features; 2) Removing the joint train-ing process ( w/o JointT ) or the top-K samplingscheme ( w/o Samp. ) makes the dialogue modelsover-ﬁt to render more stylistic features while fail-ing to achieve high

BLEU and

Coh. scores. How-ever, we argue that since our stylized decoder isalready strong in capturing stylistic features, it iscritical to utilize the proposed joint training andtop-K sampling scheme to improve the responsecoherency; 3) The pre-training approach signiﬁ-cantly improves the diversity and coherency of thegenerated responses.

Table 5 shows some dialogue responses generatedby our model and the baselines on the two datasets.We can observe that the models that directly ma-nipulate the continuous latent space (i.e., SFusionand S2S+CT) yield non-ﬂuent responses. This is

CFC datasetPseudo Post: Are you enjoying the new album?Text in S : Yes, I am. I am loving her last cd.Pseudo Post: I’m so tired of golf.Text in S : What is the point of golf?Pseudo Post: Hey, are you going to the game tonight?Text in S : Hardly, I live up north. Maybe next time.WDJN datasetPseudo Post: I am very, very sad today ( 今天的我 , 伤心的不得了 )Text in S : Are your hurt? ( 你受伤了么？ )Pseudo Post: Did anyone come to see me today?( 今天有人来看我么 ?)Text in S : Brother, someone is coming.( 大哥，有人来啦。 )Pseudo Post: I’m going to kill you today ( 今天我要杀了你 )Text in S : Dude, you could have killed me, but you didn’t.( 老兄，刚才你本可杀我，没有下手。 ) Table 6: Example pseudo pairs generated by the inversedialogue model in the training process. because that it hard to build a smooth latent spacefor discrete texts. Moreover, pipelined approacheseither fail to convert the inputs to the target style(i.e., S2S+PTO on the WDJN dataset), or hurt thecoherency between the response and the post (i.e.,S2S+BT, S2S+CT, and S2S+PTO on the TCFCdataset).In addition, we sampled some of these pseudopairs generated by the inverse dialogue model inthe training phase (Table 6). It can be seen thatthese pseudo pairs are generally of high qualityboth in ﬂuency and coherency.

In this paper, we present a stylized dialogue gener-ation method that can produce coherent and style-intensive responses by utilizing stylized unpairedtexts. An inverse dialogue model is introduced inour method to produce stylized pseudo dialoguepairs, which are used in a joint training process totrain the stylized dialogue model. Further, a stylerouting approach is introduced to intensify stylis-tic features in the decoding process. We demon-strate our method on two datasets with two differentstyles: Chinese Jinyong novels and formality in En-glish writing. Automatic and manual evaluationshows that our method outperforms competitivebaselines in producing coherent and style-intensiveresponses. As future works, we will extend thismethod to other stylized text generation tasks.

References

Yu Cheng, Zhe Gan, Yizhe Zhang, Oussama Elachqar,Dianqi Li, and Jingjing Liu. 2020. Contextual textstyle transfer.

ArXiv , abs/2005.00136.Ning Dai, Jianze Liang, Xipeng Qiu, and XuanjingHuang. 2019. Style transformer: Unpaired textstyle transfer without disentangled latent represen-tation. In

Proceedings of the 57th Annual Meet-ing of the Association for Computational Linguis-tics , pages 5997–6007, Florence, Italy. Associationfor Computational Linguistics.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. Bert: Pre-training ofdeep bidirectional transformers for language under-standing. In

Proceedings of the 2019 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, Volume 1 (Long and Short Papers) , pages4171–4186.Zhenxin Fu, Xiaoye Tan, Nanyun Peng, Dongyan Zhao,and Rui Yan. 2018. Style transfer in text: Explo-ration and evaluation. In

Thirty-Second AAAI Con-ference on Artiﬁcial Intelligence .Xiang Gao, Yizhe Zhang, Sungjin Lee, Michel Galley,Chris Brockett, Jianfeng Gao, and Bill Dolan. 2019.Structuring latent spaces for stylized response gen-eration. In

Proceedings of the 2019 Conference onEmpirical Methods in Natural Language Processingand the 9th International Joint Conference on Natu-ral Language Processing (EMNLP-IJCNLP) , pages1814–1823.Sergey Golovanov, Rauf Kurbanov, Sergey Nikolenko,Kyryl Truskovskyi, Alexander Tselousov, andThomas Wolf. 2019. Large-scale transfer learningfor natural language generation. In

Proceedings ofthe 57th Annual Meeting of the Association for Com-putational Linguistics , pages 6053–6058.Junxian He, Xinyi Wang, Graham Neubig, and TaylorBerg-Kirkpatrick. 2020. A probabilistic formulationof unsupervised text style transfer. In

InternationalConference on Learning Representations .Zhiting Hu, Zichao Yang, Xiaodan Liang, RuslanSalakhutdinov, and Eric P Xing. 2017. Towardcontrolled generation of text. In

Proceedingsof the 34th International Conference on MachineLearning-Volume 70 , pages 1587–1596. JMLR. org.Minlie Huang, Xiaoyan Zhu, and Jianfeng Gao. 2020.Challenges in building intelligent open-domain dia-log systems.

ACM Transactions on Information Sys-tems .Philipp Koehn. 2004. Statistical signiﬁcance testsfor machine translation evaluation. In

Proceed-ings of the 2004 Conference on Empirical Meth-ods in Natural Language Processing , pages 388–395, Barcelona, Spain. Association for Computa-tional Linguistics.uillaume Lample, Alexis Conneau, Ludovic Denoyer,and Marc’Aurelio Ranzato. 2018. Unsupervised ma-chine translation using monolingual corpora only.In

International Conference on Learning Represen-tations .Guillaume Lample, Sandeep Subramanian, Eric Smith,Ludovic Denoyer, Marc’Aurelio Ranzato, and Y-Lan Boureau. 2019. Multiple-attribute text rewrit-ing. In

International Conference on Learning Rep-resentations .Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao,and Bill Dolan. 2016a. A diversity-promoting ob-jective function for neural conversation models. In

Proceedings of the 2016 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies ,pages 110–119, San Diego, California. Associationfor Computational Linguistics.Jiwei Li, Michel Galley, Chris Brockett, Georgios Sp-ithourakis, Jianfeng Gao, and Bill Dolan. 2016b. Apersona-based neural conversation model. In

Pro-ceedings of the 54th Annual Meeting of the Associa-tion for Computational Linguistics (Volume 1: LongPapers) , pages 994–1003, Berlin, Germany. Associ-ation for Computational Linguistics.Dayiheng Liu, Jie Fu, Yidan Zhang, Chris Pal, andJiancheng Lv. 2020. Revision in continuous space:Unsupervised text style transfer without adversariallearning. national conference on artiﬁcial intelli-gence .Kate G Niederhoffer and James W Pennebaker. 2002.Linguistic style matching in social interaction.

Jour-nal of Language and Social Psychology , 21(4):337–360.Tong Niu and Mohit Bansal. 2018. Polite dialogue gen-eration without parallel data.

Transactions of the As-sociation for Computational Linguistics , 6:373–389.Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic eval-uation of machine translation. In

Proceedings ofthe 40th Annual Meeting of the Association for Com-putational Linguistics , pages 311–318, Philadelphia,Pennsylvania, USA. Association for ComputationalLinguistics.Shrimai Prabhumoye, Yulia Tsvetkov, Ruslan Salakhut-dinov, and Alan W Black. 2018. Style transferthrough back-translation. In

Proceedings of the56th Annual Meeting of the Association for Com-putational Linguistics (Volume 1: Long Papers) ,pages 866–876, Melbourne, Australia. Associationfor Computational Linguistics.Alec Radford, Karthik Narasimhan, Tim Salimans, andIlya Sutskever. 2018. Improving language under-standing by generative pre-training.

OpenAI Blog . Alec Radford, Jeffrey Wu, Rewon Child, David Luan,Dario Amodei, and Ilya Sutskever. 2019. Languagemodels are unsupervised multitask learners.

OpenAIBlog .Justus J Randolph. 2005. Free-marginal multiraterkappa (multirater k [free]): An alternative to ﬂeiss’ﬁxed-marginal multirater kappa.

Online submission .Sudha Rao and Joel Tetreault. 2018. Dear sir ormadam, may I introduce the GYAFC dataset: Cor-pus, benchmarks and metrics for formality styletransfer. In

Proceedings of the 2018 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, Volume 1 (Long Papers) , pages 129–140,New Orleans, Louisiana. Association for Computa-tional Linguistics.Rico Sennrich, Barry Haddow, and Alexandra Birch.2016. Improving neural machine translation mod-els with monolingual data. In

Proceedings of the54th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers) , pages86–96, Berlin, Germany. Association for Computa-tional Linguistics.Tianxiao Shen, Tao Lei, Regina Barzilay, and TommiJaakkola. 2017a. Style transfer from non-paralleltext by cross-alignment. In

Advances in neural in-formation processing systems , pages 6830–6841.Xiaoyu Shen, Hui Su, Yanran Li, Wenjie Li, ShuziNiu, Yang Zhao, Akiko Aizawa, and Guoping Long.2017b. A conditional variational framework for di-alog generation. In

Proceedings of the 55th AnnualMeeting of the Association for Computational Lin-guistics (Volume 2: Short Papers) , pages 504–509,Vancouver, Canada. Association for ComputationalLinguistics.Hui Su, Xiaoyu Shen, Sanqiang Zhao, Zhou Xiao,Pengwei Hu, Randy Zhong, Cheng Niu, and JieZhou. 2020. Diversifying dialogue generation withnon-conversational text. In

Proceedings of the 58thAnnual Meeting of the Association for Computa-tional Linguistics , pages 7087–7097, Online. Asso-ciation for Computational Linguistics.Alexey Tikhonov, Viacheslav Shibaev, Aleksander Na-gaev, Aigul Nugmanova, and Ivan P Yamshchikov.2019. Style transfer for texts: Retrain, report er-rors, compare with rewrites. In

Proceedings of the2019 Conference on Empirical Methods in Natu-ral Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP) , pages 3927–3936.Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, Ł ukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In

Advances in Neural Information Pro-cessing Systems 30 , pages 5998–6008. Curran Asso-ciates, Inc.e Wang, Hang Hua, and Xiaojun Wan. 2019. Control-lable unsupervised text attribute transfer via editingentangled latent representation. In

Advances in Neu-ral Information Processing Systems , pages 11034–11044.Yida Wang, Pei Ke, Yinhe Zheng, Kaili Huang, YongJiang, Xiaoyan Zhu, and Minlie Huang. 2020. Alarge-scale chinese short-text conversation dataset.In

NLPCC .Chen Wu, Xuancheng Ren, Fuli Luo, and Xu Sun.2019a. A hierarchical reinforced sequence opera-tion method for unsupervised text style transfer. In

Proceedings of the 57th Annual Meeting of the Asso-ciation for Computational Linguistics , pages 4873–4883.Xing Wu, Tao Zhang, Liangjun Zang, Jizhong Han,and Songlin Hu. 2019b. Mask and inﬁll: Apply-ing masked language model for sentiment transfer.In

Proceedings of the Twenty-Eighth InternationalJoint Conference on Artiﬁcial Intelligence, IJCAI-19 , pages 5271–5277. International Joint Confer-ences on Artiﬁcial Intelligence Organization.Yu Wu, Yunli Wang, and Shujie Liu. 2020. A datasetfor low-resource stylized sequence-to-sequence gen-eration.

AAAI 2020 .Yi Zhang, Tao Ge, and Xu Sun. 2020. Parallel data aug-mentation for formality style transfer. In

Proceed-ings of the 58th Annual Meeting of the Associationfor Computational Linguistics .Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen,Chris Brockett, Xiang Gao, Jianfeng Gao, JingjingLiu, and Bill Dolan. 2019. Dialogpt: Large-scalegenerative pre-training for conversational responsegeneration.Zhirui Zhang, Shuo Ren, Shujie Liu, Jianyong Wang,Peng Chen, Mu Li, Ming Zhou, and Enhong Chen.2018. Style transfer as unsupervised machine trans-lation. arXiv preprint arXiv:1808.07894 .Yinhe Zheng, Rongsheng Zhang, Xiaoxi Mao, andMinlie Huang. 2020. A pre-training based personal-ized dialogue generation model with persona-sparsedata.

AAAI .Hao Zhou, Minlie Huang, Tianyang Zhang, XiaoyanZhu, and Bing Liu. 2018. Emotional chatting ma-chine: Emotional conversation generation with inter-nal and external memory. In