[PDF] GTAE: Graph-Transformer based Auto-Encoders for Linguistic-Constrained Text Style Transfer

Abstract

Non-parallel text style transfer has attracted increasing research interests in recent years. Despite successes in transferring the style based on the encoder-decoder framework, current approaches still lack the ability to preserve the content and even logic of original sentences, mainly due to the large unconstrained model space or too simplified assumptions on latent embedding space. Since language itself is an intelligent product of humans with certain grammars and has a limited rule-based model space by its nature, relieving this problem requires reconciling the model capacity of deep neural networks with the intrinsic model constraints from human linguistic rules. To this end, we propose a method called Graph Transformer based Auto Encoder (GTAE), which models a sentence as a linguistic graph and performs feature extraction and style transfer at the graph level, to maximally retain the content and the linguistic structure of original sentences. Quantitative experiment results on three non-parallel text style transfer tasks show that our model outperforms state-of-the-art methods in content preservation, while achieving comparable performance on transfer accuracy and sentence naturalness.

Full PDF

11 GTAE: Graph-Transformer based Auto-Encoders forLinguistic-Constrained Text Style Transfer

Yukai Shi * , Sen Zhang * ,Chenxing Zhou, Xiaodan Liang, Xiaojun Yang, Liang Lin Abstract —Non-parallel text style transfer has attracted in-creasing research interests in recent years. Despite successes intransferring the style based on the encoder-decoder framework,current approaches still lack the ability to preserve the contentand even logic of original sentences, mainly due to the largeunconstrained model space or too simpliﬁed assumptions onlatent embedding space. Since language itself is an intelligentproduct of humans with certain grammars and has a limited rule-based model space by its nature, relieving this problem requiresreconciling the model capacity of deep neural networks withthe intrinsic model constraints from human linguistic rules. Tothis end, we propose a method called Graph Transformer basedAuto Encoder (GTAE), which models a sentence as a linguisticgraph and performs feature extraction and style transfer at thegraph level, to maximally retain the content and the linguisticstructure of original sentences. Quantitative experiment resultson three non-parallel text style transfer tasks show that our modeloutperforms state-of-the-art methods in content preservation,while achieving comparable performance on transfer accuracyand sentence naturalness.

Index Terms —Text style transfer, graph neural network, nat-ural language processing

I. I

NTRODUCTION

As one of the important content in social multimedia,language draws considerable attention and produce a series oftext-related systems [1], [2], [3], [4], [5]. However, the textstyle has long been neglected in above applications and systems.Althought style transfer technologies have made signiﬁcantprogress in the ﬁeld of computer vision [6], [7], [8], [9],their success in natural language processing is much limited.Recently some preliminary works have emerged aiming atthis task for text corpus [10], [11], [12], where a sentence istransferred to a target style attribute (e.g. sentiment, gender,opinion, etc.) while preserving the same content except forthe style part. The manipulation of sentence attributes haswide applications in dialog systems and many other naturallanguage ﬁelds. Different from related tasks like machinetranslation [13] and generic text generation [14], corpus for textstyle transfer are usually non-parallel, and thus the trainingprocess is performed in an unsupervised manner. Lackingpaired text corpus poses a great challenge to preserving thestyle-independent content while transferring the style guidedby other datasets.Current approaches mainly fall into the encoder-decoderparadigm. An encoder is used to extract latent embeddingsof the sentence. Content and style embeddings are thendisentangled in this space, so that style transfer can beachieved by manipulating the style feature vector, followed * The ﬁrst two authors share equal-authorship. by a decoder to generate style-transferred sentences whilekeeping the content embeddings unchanged [10]. However, inpractice these two components are often deeply entangled andit is difﬁcult to separate them in the latent space empirically.Besides, early variational autoencoder [15] based methods takeGaussian distribution as the prior of latent features, which is atoo simpliﬁed assumption for natural language. Attempts to alle-viate these problems include generating the prior with a learnedneural network to replace the naive Gaussian assumption [16],adopting a cross-alignment strategy [11] and translating thesentence to the embedding space of another language under theassumption that translation will remove style-related featuresautomatically [17], [18], [19]. Nevertheless, due to the largesearching space of models for natural language processingtasks and the lack of paired training corpus, it is non-trivialto retain semantic information in the learned content features.In practice, these approaches could still suffer from the largemodel space of natural language and fail to preserve the contentand even logic of original sentences.However, language itself is an intelligent product of humanbeings, and by its nature should be limited to a muchsmaller model space based on linguistic rules which makeup a language. Without parallel text corpus to restrict thecontent and the logic of transferred sentences, leveragingthe linguistic knowledge provided by language grammarsand logic rules could be an effective approach to preventsemantic information loss during the unsupervised trainingprocess. Inspired by this observation, we propose Graph-Transformer based Auto Encoder (GTAE) for non-paralleltext style transfer, which models the linguistic constraints asgraphs explicitly, and performs feature extraction and styletransfer at the graph level, so as to maximally retain thecontent and the linguistic structure of original sentences. Anillustration is provided in Fig. 1. In particular, we design twomodules, i.e. a Self Graph Transformer (SGT) that updates thelatent node embeddings with the desired style attribute and aCross Graph Transformer (CGT) that retrieves the semanticinformation of input sentences, to replace the encoder and thedecoder respectively. A linguistic graph is constructed using apublic available semantic dependency parser based on learnedgrammar rules [20]. Both SGT and CGT modules performlatent graph transformations constrained by the same linguisticgraph structure for each input sentence. Then a simple RNNrephraser is used to generate ﬁnal sentences from the transferredgraphs.To evaluate the performance of our proposed method, weconduct experiments on three text style transfer datasetsfor sentiments, political slants and title types, where our a r X i v : . [ c s . C L ] F e b 𝑥 LGC 𝐺 𝑥 SGT 𝑠 𝑥 𝑠 𝑦 𝐺 𝑥′ CGTCGT 𝐺 𝑦′ 𝐷 𝑠 𝐷 𝑔 𝑦 ′ 𝐿 𝑐𝑙𝑎𝑠,𝑔𝑐 𝐿 𝑐𝑙𝑎𝑠,𝑔𝑡 𝐿 𝑐𝑙𝑎𝑠,𝑠𝑡 𝐿 𝑟𝑒𝑐 reconstructionstyle transfer (a) Overview of GTAE (b) Illustration of Linguistic Graph and Text Style TransferIreallychili I really chili1 0 2 0 Output: I really dislike chili dislike like dislike like 𝐿 𝑐𝑙𝑎𝑠,𝑠𝑐 Input: I really like chili like dislike chiliIreally

Fig. 1. An illustration of the constructed linguistic graph from the dependencytree parser and the style transfer process. The input sentiment is transferred frompositive to negative while preserving style-irrelevant contents in the originalsentence. Values in the adjacency matrix represent different dependency typesdetermined by grammar rules, which are then used as the imposed linguisticconstraints. model outperforms state-of-the-art models in terms of contentpreservation as expected, while achieving highly comparabletransfer intensities and naturalness. The main contributions ofthis work are:(1) We propose to restrict the model space of naturallanguage by its linguistic rules, rather than training withoutconstraints or making too simpliﬁed assumptions, to mimicthe natural language generation process of human.(2) We introduce a graph structure to model the linguisticconstraints and propose a framework named Graph-Transformerbased Auto Encoder to manipulate the latent space at the graphlevel. And our model is highly extendable for other tasks andmodel designs.(3) Empirical studies demonstrate the ability of our textstyle transfer system in preserving the semantic content. Andwe show that feature transformation in the graph-constrainedmodel space will not affect transfer intensities and naturalnessas well. II. R

ELATED W ORK a) Image Style Transfer:

Style transfer has made greatprogress in computer vision. [6] extracts content and stylefeatures and then constrains the synthesized image to be close toboth the content of source image and the target style. [8] furtherexploits a perceptual loss that achieves better optimizationefﬁciency. Besides, several works have built their models uponthe generative adversarial networks [21], [9] for cross-domainimage transformation. These works provide insightful analysisand solutions for style transfer, however, it is non-trivial toadapt these methods into the text domain due to the essentialdifference between image and natural language. b) Non-Parallel Text Style Transfer:

Style transfer withnon-parallel text corpus has attracted increasing researchinterests in recent years. [10] used variational autoencoder(VAE) to generate sentences with controllable attributes bydisentangling latent representations. [12] further introducedadversarial training to learn style-irrelevant content embeddings.To relieve the simplistic Gaussian prior of latent embeddingsin VAE [15], [11] proposed a cross-aligned strategy to direcly match transferred text distributions with two adversarial dis-criminators, assuming a shared latent content distribution forboth styles. As a pioneer work in text style transfer, [22]employs a reinforced model on non-parallel text data topresent style transfer. [23] uses pseudo-parallel data and neuralmachine translation to achieve language style transfer. [24]proposed a dual agent reinforcement algorithm for text styletransfer. Similarly, [25] incorporates a Gaussian kernel layerto ﬁnely control the sentiment intensity to present ﬁne-grainedsentiment transfer. IMaT[26] constructs a pseudo-parallel databy aligning semantically similar sentences from pair-wisecorpora to implement text attribute transfer. [27] presents aStyle Transformer that employs an attention mechanism inTransformer to achieve content preservation and text styletransfer. [16] used a generator model to learn the latent priorsinstead, and minimized the Wasserstain distance between thelatent embedding and the learned prior with an adversarialregularization loss. [28] replaced binary classiﬁer discriminatorswith a target domain language model to stabilize training signalsfrom transferred sentences.Our work differs from these antecedents by explicitlyimposing linguistic constraints into the latent space for bettercontent and logic preservation. It is worth noting that ourproposed GTAE is highly extendable with above modelingstrategies by simply replacing their encoder-decoder withour SGT-CGT modules. Another line of work resorts toback-translation methods that removes style attributes in therepresentation space of another language [17], [18], [19].However, their methods still lack effective constraints on thelatent space. [29] argues that text styles are often embodiedin certain phrases and proposes a direct delete-and-retrieveapproach. Though their method strictly preserve the linguisticstructure, the computation load of retrival prohibits its practicaluse and the hard-manipulation strategy could suffer whenstyle information cannot be simply disentangled from certainphrases. In comparison, GTAE provides a more ﬂexible andcomputationally efﬁcient solution that allows operations in thelatent space and evolve the corresponding style-related parts.Recently [30] also investigates the incorporation of linguisticconstraints into the modeling process. Their work lies onnoun consistency for content preservation while ours resortsto dependency parse trees instead, which are complementaryto each other. c) Graph Neural Network:

Graph provides a ﬂexible datarepresentation for many tasks, especially for natural languagewhich contains highly structured syntactic information in spiteof its sequential format. Though the concept of graph neural net-work was proposed much earlier in [31], however, graph-leveltransformations with cutting-edge deep learning techniques arestill under-exploration until recent years. Early works attemptto incorporate the success of convolution network into this ﬁeld,by deﬁning convolutions on groups of node neighbors [32],[33], [34], [35]. More recently, research on graph modeling hasput more focus on the attention mechanisms [36] to enhance themodel capacity. [37] proposed graph attention network to evolvenode embeddings by attending to their spatial neighbors. Thetransformer-based models proposed in [38] and [39] are mostsimilar to ours, while our proposed GTAE allows information ﬂow both within the graph itself and cross different graphs.Besides, we further design a learnable global style node in ourself-graph transformer to ﬁt the goal of non-parallel text styletransfer.III. G

RAPH -T RANSFORMER B ASED A UTO -E NCODER

A. Formulation

Non-parallel text style transfer targets at generating a newsentence y with a speciﬁed style attribute s y from the condi-tional distribution p ( y | x , s y ) while preserving the same contentas the original sentence x . The difﬁculty of this task lies in thelack of parallel training corpus due to the high labor cost totransfer sentences manually. Suppose we observe two datasets X = { x ( i ) | i ∈ [1 , N x ] } and Y = { y ( j ) | j ∈ [1 , N y ] } withdifferent style attributes s x and s y respectively. Without pairedinformation to provide training loss, we want to estimate thetwo distributions p ( y | x , s y ) and p ( x | y , s x ) . Previous attemptsinclude using auto-encoder based models to reconstruct y bymanipulating style-embeddings in the latent space. However,these models either make too simpliﬁed assumptions on thelatent embeddings or fail to constrain the language spaceeffectively, which could be a reason for their unsatisfactoryperformance of content preservation.Considering that language by its nature is restricted to alimited feature space due to grammar rules and human logics,we propose a novel framework named Graph Transformer basedAuto Encoder (GTAE) to effectively leverage this intrinsiclinguistic constraint. We start by giving an overview of thisframework. Then we go into details of each componentin the next section. First of all, we observe that linguisticstructures either extracted by dependency parsers or predictedby information extraction methods can both be sufﬁcientlyrepresented as a graph G x = ( V, E ) , where the node set V of the linguistic graph contains entities extracted from thissentence, and E is the edge set indicating the relationshipbetween each node pair. Adding linguistic constraints on thelatent space thus can be formulated as transforming sentencesinto the linguistic graphs ﬁrst, and then perform featureextraction, style transfer, and reconstruction at the graph level: p ( y | x , s y ) = (cid:82) G x p ( y | G x , s y ) p ( G x | x ) dG x (1) = (cid:82) G (cid:48) y (cid:82) G x p ( y | G (cid:48) y , G x ) p ( G (cid:48) y | G x , s y ) p ( G x | x ) dG x dG (cid:48) y . (2)Equation 2 implies a graph-level encoder-decoder frameworkwith an extra linguistic graph constructor (LGC), as depictedin Fig. 2. In this work, we simplify the integral operationsby modeling the three components as deterministic function.Speciﬁcally, we use a public available dependency parserfor p ( G x | x ) to extract the linguistic graph of each sentence.A self-graph transformer (SGT) is proposed for the graphencoder p ( G (cid:48) y | G x , s y ) to extract and transfer graph-levellatent embeddings with the extracted linguistic graph and thedesired style attribute. And we further design a cross-graphtransformer (CGT) with a simple rephraser to model the graphdecoder p ( y | G (cid:48) y , G x ) which reconstructs the sentence fromthe transferred graph embeddings G (cid:48) y while conditional on theoriginal graph G x . 𝑥 LGC 𝐺 𝑥 SGT 𝑠 𝑥 𝑠 𝑦 𝐺 𝑥′ CGTCGT 𝐺 𝑦′ 𝐷 𝑠 𝐷 𝑔 𝑦 ′ 𝐿 𝑐𝑙𝑎𝑠,𝑔𝑐 𝐿 𝑐𝑙𝑎𝑠,𝑔𝑡 𝐿 𝑐𝑙𝑎𝑠,𝑠𝑡 𝐿 𝑟𝑒𝑐 reconstructionstyle transfer (a) Overview of GTAE (b) Illustration of Linguistic Graph and Text Style TransferIreallychili I really chili1 0 2 0 Output: I really dislike chili dislike like dislike like 𝐿 𝑐𝑙𝑎𝑠,𝑠𝑐 Input: I really like chili like dislike chiliIreally

Fig. 2. Implementation of the proposed Graph-Transformer based AutoEncoders (GTAE). The intrinsic linguistic graph G x is extracted by a linguisticgraph constructor (LGC), followed by a self-graph transformer (SGT) withcertain style attribute s x / s y to obtain the latent graph embeddings G (cid:48) x /G (cid:48) y . A cross-graph transformer (CGT) that takes the transferred graph G (cid:48) x /G (cid:48) y and the original graph G x as input is then used for reconstruction with arephraser. Two classiﬁers D g and D s are used at the graph and sentence levelsrespectively to back-propagate transfer signals. B. Module Detailsa) Linguistic Graph Constructor:

Due to the poor perfor-mance of current triplet-based information extraction systemsfor texts, we adopt a widely-used linguistic dependencyparser [20] instead to construct the latent graph to represent theintrinsic content and logic of source sentences. For a sentence x = { t i | i ∈ [1 , k ] } with k tokens, its corresponding linguisticgraph G x can be represented as the set of the token-levelnodes V ∈ R d n × k and its adjacency matrix E ∈ R d e × k × k isbuilt based on the grammar rules, where d n and d e are thefeature size of node and edge embeddings respectively. Anillustration of the constructed linguistic graph is provided inFig. 1. Since the goal of this work is to explore the researchvalue of imposing such graph constraints rather than ﬁnetuningalgorithmic details, we use a binarized version of the edgeswith d e = 1 currently for simplicity. Mapping the edges toa higher-level feature space and treating them differently inthe following self/cross-graph transformers remain a futuredirection of this work. b) Self-Graph Transformer: [29] pointed out that the styleof a sentence is usually embodied in certain words or phrases.The procedure in their paper highly matches the graph structureproposed in our work, by deleting and retrieving certain nodesusing a rule-based n-gram style classiﬁer. In contrast to thishard-manipulation approach, our method let the graph nodesevolve progressively by attending to their neighbors and aglobal style node embedding, which offers more ﬂexibility andbetter computation efﬁciency. In particular, inspired by recentsuccess of attention-based graph neural networks, we propose aself-graph transformer (SGT) to encode G x = { V x , E x } giventhe speciﬁed style attribute, as depicted in Fig. 3a.We basically follow the architecture setup of the transformerproposed in [36]. To incorporate the linguistic graph informa-tion, we restrict the attention mechanism to be effective onlywithin the sets of connected graph nodes. Besides, a learnableglobal level style node v (cid:48) ∈ { s x , s y } , which is connected toall nodes, is applied to provide style preservation or transfer signals during training and inference. Speciﬁcally, we use aMLP layer to tranform the discrete style labels into a d n -dimensional latent feature and set the edges between this stylefeature and the tokens to be 1. Suppose V x = { v ( i ) x | i ∈ [1 , k ] } and E x = { e ij | i, j ∈ [1 , k ] } . At each layer, our graph attentionwith a residual function for each node reads: v ( i ) x = v ( i ) x + H{ σ [( W q v ( i ) x ) T ( W k V ( i ) x ) / √ d k ]( W v V ( i ) x ) } , (3) V ( i ) x = { v (cid:48) , v ( j ) x | e ij > } , (4)where W q , W k ∈ R d k × d n and W v ∈ R d v × d n are thetransformation matrices to map node embeddings into query,key and value embeddings respectively. V ( i ) x ∈ R d n ×| V ( i ) x | denotes the neighbor set of v ( i ) x . σ is a softmax functionto normalize the attention weights and H is the multi-headattention function to learn and concatenate node representationsin multiple subspaces. The proposed graph attention block isthen followed by two layer-norm operations with a residualposition-wise feed-forward network in between to extract higherlevel embeddings of each node.A graph-level style classiﬁer D g is adopted to transfer thenode embeddings given the desired style at the graph level.We do not use the adversarial strategy due to its instability inthe training phase. Instead, we train the classiﬁer itself and itsstyle transfer ability directly with the following two trainingobjectives: L cclas,g = E G (cid:48) x ∼ p ( G (cid:48) x | G x , s x ) [ − log D g ( s x | G (cid:48) x )] (5) + E G (cid:48) y ∼ p ( G (cid:48) y | G y , s y ) [ − log D g ( s y | G (cid:48) y )] , (6) L tclas,g = E G (cid:48) y ∼ p ( G (cid:48) y | G x , s y ) [ − log D g ( s y | G (cid:48) y )] (7) + E G (cid:48) x ∼ p ( G (cid:48) x | G y , s x ) [ − log D g ( s x | G (cid:48) x )] . (8) c) Cross-Graph Transformer: The graph decoder p ( y | G (cid:48) y , G x ) derived in Equation 2 takes two graphs as input.An ideal y (cid:48) generated from this distribution should borrow boththe style information from G (cid:48) y and the original content from G x . To this end, we design a cross-graph transformer (CGT)to update node embeddings of G (cid:48) y by attending to G x underthe linguistic constraints, followed by a simple RNN rephraserto reconstruct the ﬁnal transferred sentence, as illustrated inFig. 3b.The key difference that distinguishes our proposed CGT fromother counterparts is that a cross-graph attention mechanism isapplied to allow information ﬂow from the source graph forbetter semantic and logic reconstruction. Assuming G (cid:48) y and G x share the same linguistic graph structure, we have G (cid:48) y = { V (cid:48) y , E x } with V (cid:48) y = { v (cid:48) y ( i ) | i ∈ [1 , k ] } . Then the formula ofcross-graph attention reads: v (cid:48) y ( i ) = v (cid:48) y ( i ) + H{ σ [( W (cid:48) q v (cid:48) y ( i ) ) T ( W (cid:48) k V ( i ) x ) / (cid:112) d k ]( W (cid:48) v V ( i ) x ) } , V ( i ) x = { v ( j ) x | e ij > } . (9)The reconstruction loss for the proposed graph-level encoder-decoder is: L rec = E G (cid:48) x ∼ p ( G (cid:48) x | G x , s x ) [ − logp ( x | G (cid:48) x , G x )] (10) + E G (cid:48) y ∼ p ( G (cid:48) y | G y , s y ) [ − logp ( y | G (cid:48) y , G y )] . (11) TABLE IS

TATISTICS OF THE THREE DATASETS

Dataset Style Size VocabYelp Sentiment positive 382,917 9,357negative 256,026Political Slant democratic 170,423 11,441republican 176,011Paper-News Title papers 95,905 19,072news 104,633

To ensure the style transfer intensity after reconstruction,we further adopt a sentence-level style classiﬁer D s , trained inthe same way as D g with the following objectives: L cclas,s = E x ∼ X [ − log D s ( s x | x )] (12) + E y ∼ Y [ − log D s ( s y | y )] , (13) L tclas,s = E y (cid:48) ∼ p ( y | G (cid:48) y ,G x ) [ − log D s ( s y | y (cid:48) )] (14) + E x (cid:48) ∼ p ( x | G (cid:48) x ,G y ) [ − log D s ( s x | x (cid:48) )] . (15) d) Training Scheme: Our proposed GTAE is trained in anend-to-end manner. Gumbel-softmax continuous approximationis adopted in the RNN rephraser to overcome the discretenessproblem of text outputs. We ﬁrst pretrain the model for a fewwarm-up epochs by iteratively backpropagating the reconstruc-tion loss L rec and the joint classiﬁcation loss L cclas,g + L cclas,s .Then in the training phase, we instead iteratively backpropagatethe transfer loss L rec + λ g L tclas,g + λ s L tclas,s for training theSGT, CGT and rephraser modules and the classiﬁcation loss L cclas,g + L cclas,s for further training the classiﬁers. λ g and λ s are the hyperparameters that control the overall transfer strength.It is worth noting that though we consider binary-style transferin this paper, our method can be directly extended to multi-style transfer, where we can add corresponding classiﬁcationand reconstruction losses to L cclas,g , L cclas,s , and L rec . For thetransfer losses, we can either consider adding combinatorialtransfer losses into L tclas,g and L tclas,s , or iteratively select twostyles and conduct training in the same manner as the binarycase. IV. E XPERIMENTS

We verify the effectiveness of our proposed method on threenon-parallel text style transfer datasets: yelp restaurant reviews,political slant comments and paper-news titles. Followinga recent work on text style transfer evaluation [40], wecompare our approach with three representative baselines [11],[16], [29], i.e. the cross-aligned auto-encoders (CAAE), theadversarially regularized autoencoders (ARAE) and the delete-retrive-generate model (DAR) across all the three datasets. Forthe most representative yelp dataset w.r.t. sentiment transfer,we further conducted a more thorough comparison with alist of the state-of-the-arts methods. Our proposed GTAEachieves the best content preservation ability as expected, whilemaintaining highly comparable performance on naturalness andstyle transfer intensity with the state-of-the-arts. The model isimplemented using the Texar [41] tookit for text generationbased on the Tensorﬂow backend [42].

Linguistic Graph StructureSelf-Graph Attention Multi Head

LN LNFFN

Linguistic Graph Structure Cross-Graph Attention Multi Head

LN LNFFN 𝒙 𝒚′

Rephraser 𝑠 𝑦 𝐺 𝑦′ 𝐺 𝑥 𝐺 𝑥 𝑁 𝑇 × 𝐺 𝑦′ (a) Self-Graph Transformer 𝑁 𝑇 × (b) Cross-Graph Transformer Fig. 3. Model architectures of the proposed (a) self-graph transformer and (b) cross-graph transformer. Arrows between the attention module and the graphstructure represent the information ﬂow within/between the graphs constrained by the binarized linguistic adjacency matrix. LN and FFN denote layernormalization and feed-forward network respectively. And N T is the layer number. A. Datasets

For sentiment transfer task, we use the Yelp restaurantreviews dataset released in [11]. The political slant datasetis retrieved from top-level comments on Facebook posts fromthe members of United States Senate and House [18]. We alsoreport results on the dataset of paper-news title types released in[12]. Following [11], we ﬁlter out sentences with length largerthan 15 and replace words that occurs less than 5 times with. The datasets are randomly splited into train/dev/testsplits in the same manner as original papers. Summary statisticsof the datasets are listed in Table I.

B. Evaluation Metrics

A successful text style transfer model normally requiresthe capacities in three aspects: content preservation, styletransfer intensity and naturalness. Content preservation aimsat minimizing the semantic information loss of style-irrelevantcomponents of original sentences, while style transfer intensityis used to measure the correctness of transferred sentences interms of the given styles. To ensure that the generated sentencesshould be human-readable and avoids looking like machine-generated sentences, the evaluation of sentence naturalness isalso required. Due to the fact that human evaluation is quitelabor-intensive and time-consuming, automatic evaluation ofthe above aspects is desired for comparison of the methods inthis task.However, to the best of our knowledge, there is no consensuson automatic evaluation metrics for text style transfer. The maindifﬁculty lies in the lack of parallel corpus. Besides, even ifwe have labelled data, direct comparison between generatedresults and human-transferred sentences is still an ill-posedproblem since there usually exist multiple ways to express asentence with the speciﬁc style. Recently, [40] systematicallystudied this problem by evaluating possible metrics for textstyle transfer based on the correlation coefﬁcients with humanratings. They proposed to use EMD [43], [44], WMD [45]and adversarial naturalness for the three aspects mentionedabove. In this work, we basically follow the practice in [40]and report these metrics in the results. a) Style Transfer Intensity:

To determine the degree towhich a sentence is succesfully transferred, previous worksrelied on a pre-trained classiﬁer to calculate the empiricaltransfer accuracy. However, the binarized classiﬁer used in thisstrategy could miss the subtle difference between transferredsentences when they exhibit varied transfer intensities butlabelled with the same value without discrimination. Thisproblem was overcome in EMD by measuring the styledistribution difference instead, which also leads to WMD forcontent preservation. b) Content Preservation:

BLEU [46] is probably themost popular metric for content preservation evaluation bymeasuring the n-gram similarity w.r.t. original sentences. [29]proposed to calculate BLEU w.r.t. human-transferred sentencesinstead and released small test datasets with human labels forthat purpose. Nevertheless, the fact that possible transferredsentences are usually not unique and the high cost of humanlabors limit the usage of their method. WMD provides a morededicated distribution-based metric by calculating the minimumdistance between word embeddings. To remove style-relevantparts of the sentences to compare, we build a style lexicon bytraining a regression model and mask style-related words inWMD calculation as suggested by [40]. In addition, we reporta recently published metric BERTSCORE [47] which claimshigher-level semantic alignments as an alternative measurementfor comparison. c) Naturalness:

Perplexity calculated from pre-trainedlanguage models provides a solution for naturalness mea-surement, but still it is not necessarily a gold standard. Theadversarial naturalness proposed in [40] trains a classiﬁer todistinguish human-generated texts from machine-generatedtexts ﬁrst. Then naturalness reduces to the ability to fool thisclassiﬁer. [40] showed that adversarial naturalness serves as abetter measurement in terms of their correlation coefﬁcient w.r.t.human evaluation. In our experiments, we use the three pre-trained classiﬁers released in [40] and report the correspondingadversarial naturalness scores respectively.

C. Experimental Settings

We use the StanfordCoreNLP toolbox [20] to extractthe linguistic structure of each sentence. For the SGT andCGT modules, the numbers of heads and layers are 8 and2 respectively. We choose 512 as the size of all hiddenembeddings. The rephraser is implemented as a single-layerGRU encoder-decoder for ﬁnal sentence generation. And theTextCNN model [48] is used for both classiﬁers. We pretrainthe model for 10 epochs, followed by two or three extra trainingepochs, depending on validation errors. We empirically choose λ g and λ s as . and . respectively. We set the temperatureanneal rate for Gumbel-softmax approximation as 0.5. Themodel is trained using Adam optimizer [49] with an initiallearning rate 5e-4. Batch size is empirically set to be 128 for allthe three datasets. We report the results on randomly selected1000 test samples due to the heavy computational cost of theretrival process in DAR. All the training and evaluation sourcecodes as well as the results are released in this page . D. Results

This section shows the experiment results in the perspectiveof evaluation metrics. Speciﬁcally, in Table II, WMD isthe masked Word Mover’s distance with style-related tokensremoved, which serves as the major metric for contentpreservation evaluation. B-P, B-R and B-F1 denote the precision,recall and F1 score of the BERTSCORE, which are rescaledas suggested in [] for better illustration. We also report theBLEU scores between the original and transfer sentences. Asmaller WMD means the transfer sentence is closer to theoriginal one, while higher BLEU and BERTSCORE metricsindicate better content preservation ability. N-A, N-C and N-D are the adversarial naturalness scores calculating from thethree released classiﬁers trained on ARAE, CAAE and DARrespectively, where a higher score means a higher probabilityto be "natural" [40]. ACCU and EMD denote empirical transferaccuracy and Earth Mover’s Distance based on a pretrainedCNN classiﬁer. Higher ACCU and EMD scores indicate astronger transfer intensity. For the yelp dataset, we evaluatedthe results released in [24] for DualRl [24], UnsuperMT [23],UnpairedRL [22], and DAR-related methods [29], the results re-leased in [26] for IMaT [26], CAAE [11], Multi_Decoder [12],and Style_Emb [12], and the results of StyleTrans [27] andFineGrained [25] methods that are released on their ownwebsites. Note that for the FineGrained results, the authorsonly released a subset that containing 500 rather than 1000sentences. We also trained ARAE [16] under its default settingon the yelp dataset. For the political and title datasets, wetrained CAAE, ARAE and DAR-related methods using theirreleased source codes under the default settings as well. a) Yelp Sentiment Dataset:

Table II shows the evaluationresults on the Yelp sentiment dataset, where the methods areranked by the masked WMD metric (The lower the better).As expected, our model achieves the best content preservationperformance in terms of the masked WMD, BLEU, and BERT-Recall scores, while maintaining a reasonable transfer intensity https://github.com/SenZHANG-GitHub/GTAE and naturalness scores. It might be not surprising that retrieval-based DAR achieves the best naturalness scores and also ahigh transfer intensity since real human-written sentences areretrieved as results. However, the retrieval based strategiessacriﬁce a lot w.r.t. content preservation ability and take aroundﬁve seconds for the extra retrieval process during inference,which is practically unacceptable for real-world applications.UnsuperMT achieves the best transfer intensity scores, butalso at the price of content preservation ability and languagenaturalness. In terms of content preservation, style transformerwith multi-class discriminator is the one that is closest toour method, while our model outperforms this method w.r.t.naturalness and transfer intensity. It can also be seen thatthe difference of EMD could be smaller than that of transferaccuracy, supporting that EMD serves as a more dedicatedmetric by leveraging the style distribution information. Fig. 4gives an example of sentiment transfer results and illustrateshow our model works by leveraging the linguistic graph. b) Political Slant Dataset: The evaluation results onthe political slant dataset are similar to sentiment transferresults as shown in Table III. Our proposed GTAE stillperforms signiﬁcantly better than others in preserving the style-irrelevant contents while maintaining competitive results onnaturalness and transfer intensity. Besides, the unsatisfatorycontent preservation capbility of ARAE might result fromthe unstable adversarial training process. In comparison, ourmethod avoids the adversarial strategy by directly back-propagating style transfer signals at both graph and sentencelevels. c) Paper-News Title Dataset:

This task is more chal-lenging due to the larger semantic domain mis-alignmentbetween paper and news titles, which explains the worsecontent preservation capbilities of all models as shown inTable IV. Moreover, since the released adversarial naturalnessclassiﬁers are trained on Yelp reviews data, the domain mis-alignment between online reviews and formal titles also leadsto a reduction of naturalness scores. Our model still reaches thebest content preservation performance over the three baselines,with comparable naturalness and transfer intensity.

E. Ablation Analysis

We further examine the effect of each component in proposedmodel on the yelp sentiment dataset. a) Effect of the Linguistic Graph Constraint:

We ﬁrstexamine the effect of the linguistic graph constraint by settingthe adjacency matrix used in SGT and CGT to the identitymatrix, leading to the models SGT-I and CGT-I respectively.The model that sets both adjacency matrices in SGT and CGTto identity matrix is also evaluated (SGT-CGT-I). We report theresults in Tabel V. Without the linguistic graph constraint, bothcontent preservation and transfer intensity abilities degrade.The linguistic constraints in SGT and CGT are more crucialfor content preservation and style transfer respectively. b) Effect of the Graph- and Sentence-level Losses:

Wenext study the effect of the graph- and sentence-level losses. Weremove L cclas,s and L cclas,g when training the classiﬁers, leadingto two models c-clas-g-only and c-clas-s-only respectively. TABLE IIS

ENTIMENT TRANSFER RESULTS ON THE TEST SET .Model Content Preservation Naturalness Transfer IntensitymWMD BLEU B-P B-R B-F1 N-A N-C N-D ACCU EMD

GATE (ours) 0.1027 64.83

UnpairedRL [22] 0.3122 46.09 0.4504 0.4709 0.4612 0.7136 0.9035 0.6493 0.5340 0.4989DAR_Template [29] 0.4156 57.10 0.4970 0.5406 0.5185 0.6370 0.8984 0.6299 0.8410 0.7948DAR_DeleteOnly [29] 0.4538 34.53 0.4158 0.4823 0.4490 0.6345 0.9072 0.5511 0.8750 0.8297DAR_DeleteRetrieve [29] 0.4605 36.72 0.4268 0.4799 0.4534 0.6564 0.9359 0.5620 0.9010 0.8550CAAE [11] 0.5130 20.74 0.3585 0.3825 0.3710 0.4139 0.7006 0.5999 0.7490 0.7029IMaT [26] 0.5571 16.92 0.4750 0.4292 0.4501 0.4878 0.8407 0.6691 0.8710 0.8198Multi_Decoder [12] 0.5799 24.91 0.3117 0.3315 0.3223 0.4829 0.8394 0.6365 0.6810 0.6340FineGrained-0.7 [25] 0.6239 11.36 0.4023 0.3404 0.3717 0.3665 0.7125 0.5332 0.3960 0.3621FineGrained-0.9 [25] 0.6251 11.07 0.4030 0.3389 0.3713 0.3668 0.7148 0.5231 0.4180 0.3926FineGrained-0.5 [25] 0.6252 11.72 0.3994 0.3436 0.3718 0.3606 0.7254 0.5395 0.3280 0.2895BackTranslation [18] 0.7566 2.81 0.2405 0.2024 0.2220 0.3686 0.5392 0.4754 0.9500 0.9117Style_Emb [12] 0.8796 3.24 0.0166 0.0673 0.0429 0.5788 0.9075 0.6450 0.4490 0.4119DAR_RetrieveOnly [29] 0.8990 2.62 0.1368 0.1818 0.1598

Input : she was absolutely fantastic and i love what she did !

ARAE : she was absolutely awful and i did n't like it .

CAAE : she was just so glad and i did not know what she !

DAR : she was clearly did not seem to do anything to me there !

GTAE : she was absolutely rude and i shame what she did ! fantastic i andrude loveshameShesowas didwhat she cc nsubj nsubj dobj Fig. 4. An example of sentiment transfer results. The style is supposed to be transferred from positive to negative. By leveraging the established linguisticgraph, our model identiﬁes and transfers the style-related graph nodes while preserving the overall graph structure.TABLE IIIP

OLITICAL SLANT TRANSFER RESULTS ON THE TEST SET . Model Content Preservation Naturalness Transfer IntensityWMD BLEU B-P B-R B-F1 N-A N-C N-D ACCU EMD

GTAE (ours) 0.1506 65.61 0.6577 0.6706 0.664 0.3310

ARAE [16] 1.0347 2.95 0.0203 0.0117 0.0158 0.3092 0.7763 0.7333 0.944 0.9412

TABLE IVP

APER - NEWS TITLE TRANSFER RESULTS ON THE TEST SET . Model Content Preservation Naturalness Transfer IntensityWMD BLEU B-P B-R B-F1 N-A N-C N-D ACCU EMD

GTA (ours) 0.5161 19.67 0.1923 0.2134 0.2034

ARAE [16] 1.0253 0.00 -0.0447 -0.0539 -0.0486 0.2318 0.6061

Besides, we also We remove L tclas,s and L tclas,g when trainingthe style transfer ability, leading to the models t-clas-g-onlyand t-clas-s-only, respectively. Results are given in Table VI, which show that a properly trained sentence-level classiﬁeris crucial for the style transfer ability. Moreover, the transfersignals from both graph- and sentence-level are both important TABLE VA

BLATION RESULTS OF THE LINGUISTIC GRAPH ON THE YELP DATASET . Model Content Preservation Naturalness Transfer IntensityWMD BLEU B-P B-R B-F1 N-A N-C N-D ACCU EMDGTAE-full 0.1027 64.83 0.6991 0.7303 0.7149 0.6178 0.9272 0.6644 0.887 0.8505SGT-I 0.1394 61.10 0.6591 0.6735 0.6667 0.6692 0.9313 0.7141 0.881 0.8437CGT-I 0.1211 62.71 0.6772 0.6982 0.6878 0.6347 0.9297 0.6164 0.871 0.8368SGT-CGT-I 0.1369 64.77 0.6809 0.7055 0.6935 0.7509 0.9681 0.7193 0.843 0.8038 for achieving a good style transfer performance. c) Effect of the Warm-up Phase:

The warm-up phaseprepares pretrained graph- and sentence- level classiﬁers anda basic reconstruction ability for the following training phase.We study the effect of the pretrain epochs in this section andsummarize the results in Table VII. While content preservationability is relatively robust to the pretrain epochs, a certain levelof pretraining is essential to achieve a satisfactory transferintensity performance.V. C

ONCLUSION

In this work, we propose to impose intrinsic linguisticconstraints into the model space of natural language toachieve a better content and logic preservation ability for non-parallel text style transfer. We formulate this procedure as agraph-level encoder-decoder process and propose a Graph-Transformer based Auto-Encoder (GTAE) with two novelmodules, namely Self-Graph Transformer (SGT) and Cross-Graph Transformer (CGT) for better manipulation of the latentgraph embeddings. Experiments on three non-parallel text styletransfer datasets demonstrate that our model achieves the bestcontent perservation performation, while maintaining state-of-the-art transfer intensity and naturalness. Besides, our proposedGTAE is highly compatible with other methods. Potential futureworks include incorporating language models as discriminatorsto improve the naturalness and developing methods that allowvaried edge types to better leverage the linguistic structures.Moreover, extending this idea into more generic text generationtasks is also a promising and exciting research direction.R

EFERENCES[1] C. Zhang and H. Wang, “Resumevis: A visual analytics system to discoversemantic information in semi-structured resume data,”

ACM Transactionson Intelligent Systems and Technology , vol. 10, no. 1, pp. 1–25, 2019.[2] Z. Yin, L. Cao, Q. Gu, and J. Han, “Latent community topic analy-sis: Integration of community discovery with topic modeling,”

ACMTransactions on Intelligent Systems and Technology , vol. 3, no. 4, p. 63,2012.[3] P. Wang, L. Ji, J. Yan, D. Dou, N. De Silva, Y. Zhang, and L. Jin,“Concept and attention-based cnn for question retrieval in multi-viewlearning,”

ACM Transactions on Intelligent Systems and Technology ,vol. 9, no. 4, p. 41, 2018.[4] D. Thukral, A. Pandey, R. Gupta, V. Goyal, and T. Chakraborty,“Diffque: Estimating relative difﬁculty of questions in community questionanswering services,”

ACM Transactions on Intelligent Systems andTechnology , vol. 10, no. 4, p. 42, 2019.[5] W. Cui, J. Du, D. Wang, X. Yuan, F. Kou, L. Zhou, and N. Zhou, “Shorttext analysis based on dual semantic extension and deep hashing inmicroblog,”

ACM Transactions on Intelligent Systems and Technology ,vol. 10, no. 4, pp. 1–24, 2019.[6] L. A. Gatys, A. S. Ecker, and M. Bethge, “Image style transfer usingconvolutional neural networks,” in

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , 2016, pp. 2414–2423. [7] M.-Y. Liu and O. Tuzel, “Coupled generative adversarial networks,”

Neural Information Processing Systems , pp. 469–477, 2016.[8] J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time styletransfer and super-resolution,” in

European Conference on ComputerVision . Springer, 2016, pp. 694–711.[9] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-imagetranslation using cycle-consistent adversarial networks,” in , 2017, pp. 2242–2251.[10] Z. Hu, Z. Yang, X. Liang, R. Salakhutdinov, and E. P. Xing, “Towardcontrolled generation of text,”

International Conference on MachineLearning , pp. 1587–1596, 2017.[11] T. Shen, T. Lei, R. Barzilay, and T. S. Jaakkola, “Style transfer fromnon-parallel text by cross-alignment,”

Neural Information ProcessingSystems , pp. 6830–6841, 2017.[12] Z. Fu, X. Tan, N. Peng, D. Zhao, and R. Yan, “Style transfer intext: Exploration and evaluation,”

National Conference on ArtiﬁcialIntelligence , pp. 663–670, 2018.[13] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation byjointly learning to align and translate,”

International Conference onLearning Representations , 2015.[14] T.-H. Wen, D. Vandyke, N. Mrkši´c, M. Gasic, L. M. R. Barahona, P.-H. Su, S. Ultes, and S. Young, “A network-based end-to-end trainabletask-oriented dialogue system,” in

Proceedings of the 15th Conference ofthe European Chapter of the Association for Computational Linguistics:Volume 1, Long Papers , vol. 1, 2017, pp. 438–449.[15] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,”

International Conference on Learning Representations , 2014.[16] J. J. Zhao, Y. Kim, K. Zhang, A. M. Rush, and Y. LeCun, “Adversariallyregularized autoencoders,” in

ICML 2018: Thirty-ﬁfth InternationalConference on Machine Learning , 2018, pp. 9405–9420.[17] L. Logeswaran, H. Lee, and S. Bengio, “Content preserving textgeneration with attribute controls,”

Neural Information ProcessingSystems , pp. 5103–5113, 2018.[18] S. Prabhumoye, Y. Tsvetkov, R. Salakhutdinov, and A. W. Black,“Style transfer through back-translation,”

Meeting of the Associationfor Computational Linguistics , vol. 1, pp. 866–876, 2018.[19] G. Lample, S. Subramanian, E. Smith, L. Denoyer, M. Ranzato, andY.-L. Boureau, “Multiple-attribute text rewriting,” in

ICLR 2019 : 7thInternational Conference on Learning Representations , 2019.[20] C. D. Manning, M. Surdeanu, J. Bauer, J. R. Finkel, S. Bethard,and D. McClosky, “The stanford corenlp natural language processingtoolkit,” in

Proceedings of 52nd Annual Meeting of the Association forComputational Linguistics: System Demonstrations , 2014, pp. 55–60.[21] Z. Yi, H. Zhang, P. Tan, and M. Gong, “Dualgan: Unsupervised duallearning for image-to-image translation,” in , 2017, pp. 2868–2876.[22] J. Xu, X. Sun, Q. Zeng, X. Ren, X. Zhang, H. Wang, and W. Li,“Unpaired sentiment-to-sentiment translation: A cycled reinforcementlearning approach,” arXiv preprint arXiv:1805.05181 , 2018.[23] Z. Zhang, S. Ren, S. Liu, J. Wang, P. Chen, M. Li, M. Zhou, andE. Chen, “Style transfer as unsupervised machine translation,” arXivpreprint arXiv:1808.07894 , 2018.[24] F. Luo, P. Li, J. Zhou, P. Yang, B. Chang, Z. Sui, and X. Sun, “A dualreinforcement learning framework for unsupervised text style transfer,” arXiv preprint arXiv:1905.10060 , 2019.[25] F. Luo, P. Li, P. Yang, J. Zhou, Y. Tan, B. Chang, Z. Sui, and X. Sun,“Towards ﬁne-grained text sentiment transfer,” in

Proceedings of the 57thAnnual Meeting of the Association for Computational Linguistics , 2019,pp. 2013–2022.[26] Z. Jin, D. Jin, J. Mueller, N. Matthews, and E. Santus, “Imat: Unsuper-vised text attribute transfer via iterative matching and translation,” arXivpreprint arXiv:1901.11333 , 2019.

TABLE VIA

BLATION RESULTS OF THE GRAPH - AND SENTENCE - LEVEL LOSSES ON THE YELP DATASET . Model Content Preservation Naturalness Transfer IntensityWMD BLEU B-P B-R B-F1 N-A N-C N-D ACCU EMDGTAE-full 0.1027 64.83 0.6991 0.7303 0.7149 0.6178 0.9272 0.6644 0.887 0.8505c-clas-g-only 0.0017 99.64 0.9970 0.9969 0.9970 0.6738 0.9357 0.6841 0.027 -0.0003c-clas-s-only 0.1424 61.64 0.6704 0.6927 0.6818 0.6586 0.9217 0.7069 0.854 0.8159t-clas-g-only 0.0016 99.63 0.9963 0.9963 0.9963 0.6742 0.9358 0.6845 0.028 0.0008t-clas-s-only 0.0106 92.83 0.9480 0.9503 0.9492 0.6317 0.9278 0.6933 0.203 0.1725

TABLE VIIA

BLATION RESULTS OF THE PRETRAIN EPOCHS ON THE YELP DATASET . Model Content Preservation Naturalness Transfer IntensityWMD BLEU B-P B-R B-F1 N-A N-C N-D ACCU EMDpretrain-0 0.0999 67.43 0.7252 0.7305 0.7282 0.6288 0.9261 0.7282 0.798 0.7618pretrain-2 0.0943 66.32 0.7178 0.7275 0.7229 0.6312 0.9341 0.6927 0.849 0.8139pretrain-4 0.1598 60.20 0.6439 0.6749 0.6596 0.6840 0.9259 0.6424 0.878 0.8377pretrain-6 0.1093 64.87 0.6957 0.7150 0.7057 0.6397 0.9580 0.6987 0.855 0.8160pretrain-8 0.1152 67.00 0.7017 0.7225 0.7124 0.6645 0.9529 0.6821 0.836 0.7980pretrain-10 0.1027 64.83 0.6991 0.7303 0.7149 0.6178 0.9272 0.6644 0.887 0.8505 [27] N. Dai, J. Liang, X. Qiu, and X. Huang, “Style transformer: Unpaired textstyle transfer without disentangled latent representation,” arXiv preprintarXiv:1905.05621 , 2019.[28] Z. Yang, Z. Hu, C. Dyer, E. P. Xing, and T. Berg-Kirkpatrick, “Unsu-pervised text style transfer using language models as discriminators,”

Neural Information Processing Systems , pp. 7287–7298, 2018.[29] J. Li, R. Jia, H. He, and P. Liang, “Delete, retrieve, generate: A simpleapproach to sentiment and style transfer,” in

NAACL HLT 2018: 16thAnnual Conference of the North American Chapter of the Associationfor Computational Linguistics: Human Language Technologies , vol. 1,2018, pp. 1865–1874.[30] Y. Tian, Z. Hu, and Z. Yu, “Structured content preservation forunsupervised text style transfer,” arXiv preprint arXiv:1810.06526 , 2018.[31] M. Gori, G. Monfardini, and F. Scarselli, “A new model for learningin graph domains,” in

Proceedings. 2005 IEEE International JointConference on Neural Networks, 2005. , vol. 2, 2005, pp. 729–734.[32] D. K. Duvenaud, D. Maclaurin, J. Aguilera-Iparraguirre, R. Gómez-Bombarelli, T. Hirzel, A. Aspuru-Guzik, and R. P. Adams, “Convolu-tional networks on graphs for learning molecular ﬁngerprints,”

NeuralInformation Processing Systems , pp. 2224–2232, 2015.[33] R. A. Rossi, N. K. Ahmed, R. Zhou, and H. Eldardiry, “Interactive visualgraph mining and learning,”

ACM Transactions on Intelligent Systemsand Technology , vol. 9, no. 5, p. 59, 2018.[34] Z. Li, C. Lang, J. Feng, Y. Li, T. Wang, and S. Feng, “Co-saliencydetection with graph matching,”

ACM Transactions on Intelligent Systemsand Technology , vol. 10, no. 3, p. 22, 2019.[35] J. Tang, R. Hong, S. Yan, T. Chua, G. Qi, and R. Jain, “Image annotationby k nn-sparse graph-based label propagation over noisily tagged webimages,”

ACM Transactions on Intelligent Systems and Technology , vol. 2,no. 2, p. 14, 2011.[36] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,”

NeuralInformation Processing Systems , pp. 5998–6008, 2017.[37] P. Veliˇckovi´c, G. Cucurull, A. Casanova, A. Romero, P. Liò, andY. Bengio, “Graph attention networks,”

International Conference onLearning Representations , 2018.[38] Y. Li, X. Liang, zhiting hu, and E. Xing, “Knowledge-driven encode,retrieve, paraphrase for medical image report generation,”

NationalConference on Artiﬁcial Intelligence , 2019.[39] R. Koncel-Kedziorski, D. Bekal, Y. Luan, M. Lapata, and H. Hajishirzi,“Text generation from knowledge graphs with graph transformers.” arXivpreprint arXiv:1904.02342 , 2019.[40] R. Mir, B. Felbo, N. Obradovich, and I. Rahwan, “Evaluating styletransfer for text.” arXiv preprint arXiv:1904.02295 , 2019.[41] Z. Hu, H. Shi, Z. Yang, B. Tan, T. Zhao, J. He, W. Wang, X. Yu, L. Qin,D. Wang, X. Ma, H. Liu, X. Liang, W. Zhu, D. S. Sachan, and E. P.Xing, “Texar: A modularized, versatile, and extensible toolbox for textgeneration,” arXiv preprint arXiv:1809.00794 , pp. 13–22, 2018. [42] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin,S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga,S. Moore, D. G. Murray, B. Steiner, P. A. Tucker, V. Vasudevan,P. Warden, M. Wicke, Y. Yu, and X. Zheng, “Tensorﬂow: A systemfor large-scale machine learning,”

Operating Systems Design andImplementation , pp. 265–283, 2016.[43] Y. Rubner, C. Tomasi, and L. J. Guibas, “A metric for distributions withapplications to image databases,” in

Sixth International Conference onComputer Vision (IEEE Cat. No.98CH36271) , 1998, pp. 59–66.[44] O. Pele and M. Werman, “Fast and robust earth mover’s distances,” in , 2009, pp.460–467.[45] M. J. Kusner, Y. Sun, N. I. Kolkin, and K. Q. Weinberger, “Fromword embeddings to document distances,” in

Proceedings of The 32ndInternational Conference on Machine Learning , 2015, pp. 957–966.[46] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: A method forautomatic evaluation of machine translation,” in

Proceedings of 40thAnnual Meeting of the Association for Computational Linguistics , 2002,pp. 311–318.[47] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi, “Bertscore:Evaluating text generation with bert.” arXiv preprint arXiv:1904.09675 ,2019.[48] Y. Kim, “Convolutional neural networks for sentence classiﬁcation,” in

Proceedings of the 2014 Conference on Empirical Methods in NaturalLanguage Processing (EMNLP) , 2014, pp. 1746–1751.[49] D. P. Kingma and J. L. Ba, “Adam: A method for stochastic optimization,”