[PDF] Generating Syntactically Controlled Paraphrases without Using Annotated Parallel Pairs

Abstract

Paraphrase generation plays an essential role in natural language process (NLP), and it has many downstream applications. However, training supervised paraphrase models requires many annotated paraphrase pairs, which are usually costly to obtain. On the other hand, the paraphrases generated by existing unsupervised approaches are usually syntactically similar to the source sentences and are limited in diversity. In this paper, we demonstrate that it is possible to generate syntactically various paraphrases without the need for annotated paraphrase pairs. We propose Syntactically controlled Paraphrase Generator (SynPG), an encoder-decoder based model that learns to disentangle the semantics and the syntax of a sentence from a collection of unannotated texts. The disentanglement enables SynPG to control the syntax of output paraphrases by manipulating the embedding in the syntactic space. Extensive experiments using automatic metrics and human evaluation show that SynPG performs better syntactic control than unsupervised baselines, while the quality of the generated paraphrases is competitive. We also demonstrate that the performance of SynPG is competitive or even better than supervised models when the unannotated data is large. Finally, we show that the syntactically controlled paraphrases generated by SynPG can be utilized for data augmentation to improve the robustness of NLP models.

Full PDF

GGenerating Syntactically Controlled Paraphraseswithout Using Annotated Parallel Pairs

Kuan-Hao Huang

University of California, Los Angeles [email protected]

Kai-Wei Chang

University of California, Los Angeles [email protected]

Abstract

Paraphrase generation plays an essential rolein natural language process (NLP), and ithas many downstream applications. How-ever, training supervised paraphrase models re-quires many annotated paraphrase pairs, whichare usually costly to obtain. On the otherhand, the paraphrases generated by existingunsupervised approaches are usually syntacti-cally similar to the source sentences and arelimited in diversity. In this paper, we demon-strate that it is possible to generate syntac-tically various paraphrases without the needfor annotated paraphrase pairs. We propose

Syntactically controlled Paraphrase Genera-tor (SynPG), an encoder-decoder based modelthat learns to disentangle the semantics and thesyntax of a sentence from a collection of unan-notated texts. The disentanglement enablesSynPG to control the syntax of output para-phrases by manipulating the embedding in thesyntactic space. Extensive experiments usingautomatic metrics and human evaluation showthat SynPG performs better syntactic controlthan unsupervised baselines, while the qual-ity of the generated paraphrases is competitive.We also demonstrate that the performance ofSynPG is competitive or even better than su-pervised models when the unannotated data islarge. Finally, we show that the syntacticallycontrolled paraphrases generated by SynPGcan be utilized for data augmentation to im-prove the robustness of NLP models.

Paraphrase generation (McKeown, 1983) is a long-lasting task in natural language processing (NLP)and has been greatly improved by recently devel-oped machine learning approaches and large datacollections. Paraphrase generation demonstratesthe potential of machines in semantic abstractionand sentence reorganization and has already beenapplied to many NLP downstream applications,

Figure 1: Paraphrase generation with syntactic control.Given a source sentence and a target syntactic speciﬁ-cation (either a full parse tree or top levels of a parsetree), the model is expected to generate a paraphrasewith the syntax following the given speciﬁcation. such as question answering (Yu et al., 2018), chat-bot engines (Yan et al., 2016), and sentence simpli-ﬁcation (Zhao et al., 2018).In recent years, various approaches have beenproposed to train sequence-to-sequence (seq2seq)models on a large number of annotated paraphrasepairs (Prakash et al., 2016; Mallinson et al., 2017;Cao et al., 2017; Egonmwan and Chali, 2019).Some of them control the syntax of output sen-tences to improve the diversity of paraphrase gener-ation (Iyyer et al., 2018; Goyal and Durrett, 2020;Kumar et al., 2020). However, collecting annotatedpairs is expensive and induces challenges for somelanguages and domains. On the contrary, unsuper-vised approaches build paraphrase models withoutusing parallel corpora (Li et al., 2018; Roy andGrangier, 2019; Zhang et al., 2019). Most of themare based on the variational autoencoder (Bowmanet al., 2016) or back-translation (Mallinson et al.,2017; Wieting and Gimpel, 2018; Hu et al., 2019).Nevertheless, without the consideration of control-ling syntax, their generated paraphrases are oftensimilar to the source sentences and are not diversein syntax.This paper presents a pioneering study on syn-tactically controlled paraphrase generation based a r X i v : . [ c s . C L ] J a n n disentangling semantics and syntax. We aimto disentangle one sentence into two parts: 1) thesemantic part and 2) the syntactic part. The seman-tic aspect focuses on the meaning of the sentence,while the syntactic part represents the grammaticalstructure. When two sentences are paraphrased,their semantic aspects are supposed to be similar,while their syntactic parts should be different. Togenerate a syntactically different paraphrase of onesentence, we can keep its semantic part unchangedand modify its syntactic part.Based on this idea, we propose Syn tacticallyControlled P araphrase G enerator (SynPG) , aTransformer-based model (Vaswani et al., 2017)that can generate syntactically different para-phrases of one source sentence based on some tar-get syntactic parses. SynPG consists of a semanticencoder, a syntactic encoder, and a decoder. Thesemantic encoder considers the source sentenceas a bag of words without ordering and learns acontextualized embedding containing only the se-mantic information. The syntactic encoder embedsthe target parse into a contextualized embeddingincluding only the syntactic information. Then,the decoder combines the two representations andgenerates a paraphrase sentence. The design of dis-entangling semantics and syntax enables SynPG tolearn the association between words and parses andbe trained by reconstructing the source sentencegiven its unordered words and its parse. Therefore,we do not require any annotated paraphrase pairsbut only unannotated texts to train SynPG.We verify SynPG on four paraphrase datasets:ParaNMT-50M (Wieting and Gimpel, 2018), Quora(Iyer et al., 2017), PAN (Madnani et al., 2012), andMRPC (Dolan et al., 2004). The experimental re-sults reveal that when being provided with the syn-tactic structures of the target sentences, SynPG cangenerate paraphrases with the syntax more similarto the ground truth than the unsupervised baselines.The human evaluation results indicate that SynPGachieves competitive paraphrase quality to otherbaselines while its generated paraphrases are moreaccurate in following the syntactic speciﬁcations.In addition, we show that when the training data islarge enough, the performance of SynPG is com-petitive or even better than supervised approaches.Finally, we demonstrate that the syntactically con-trolled paraphrases generated by SynPG can be Our code and the pretrained models are available at https://github.com/uclanlp/synpg used for data augmentation to defense syntacticallyadversarial attack (Iyyer et al., 2018) and improvethe robustness of NLP models.

We aim to train a paraphrase model without usingannotated paraphrase pairs. Given a source sen-tence x = ( x , x , ..., x n ) , our goal is to generatea paraphrase sentence y = ( y , y , ..., y m ) that isexpected to maintain the same meaning of x buthas a different syntactic structure from x . Syntactic control.

Motivated by previous work(Iyyer et al., 2018; Zhang et al., 2019; Kumar et al.,2020), we allow our model to access additional syn-tactic speciﬁcations as the control signals to guidethe paraphrase generation. More speciﬁcally, in ad-dition to the source sentence x , we give the model atarget constituency parse p as another input. Giventhe input ( x , p ) , the model is expected to gener-ate a paraphrase y that is semantically similar tothe source sentence x and syntactically follows thetarget parse p . In the following discussions, weassume the target parse p to be a full constituencyparse tree. Later on, in Section 2.3, we will re-lax the syntax guidance to be a template , which isdeﬁned as the top two levels of a full parse tree.We expect that a successful model can control thesyntax of output sentences and generate syntacti-cally different paraphrases based on different targetparses, as illustrated in Figure 1.Similar to previous work (Iyyer et al., 2018;Zhang et al., 2019), we linearize the constituencyparse tree to a sequence. For example, the lin-earized parse of the sentence “ He eats apples. ” is (S(NP(PRP))(VP(VBZ)(NP(NNS)))(.)) .Accordingly, a parse tree can be considered as asentence p = ( p , p , ..., p k ) , where the tokensin p are non-terminal symbols and parentheses. Our main idea is to disentangle a sentence intothe semantic part and the syntactic part. Once themodel learns the disentanglement, it can generatea syntactically different paraphrase of one givensentence by keeping its semantic part unchangedand modifying only the syntactic part.Figure 2 illustrates the proposed paraphrasemodel called SynPG, a seq2seq model consistingof a semantic encoder, a syntactic encoder, anda decoder. The semantic encoder captures onlythe semantic information of the source sentence x , igure 2: SynPG embeds the source sentence and the target parse into a semantic embedding and a syntacticembedding, respectively. Then, SynPG generates a paraphrase sentence based on the two embeddings. while the syntactic encoder extracts only the syn-tactic information from the target parse p . Thedecoder then combines the encoded semantic andsyntactic information and generates a paraphrase y .We discuss the details of SynPG in the following. Semantic encoder.

The semantic encoder em-beds a source sentence x into a contextualized se-mantic embedding z sem . In other words, z sem = ( z , z , ..., z n ) = Enc sem (( x , x , .., x n )) . The semantic embedding z sem is supposed tocontain only the semantic information of the sourcesentence x . To separate the semantic informationfrom the syntactic information, we use a Trans-former (Vaswani et al., 2017) without the positionalencoding as the semantic encoder. We posit thatby removing position information from the sourcesentence x , the semantic embedding z sem wouldencode less syntactic information.We assume that words without ordering capturemost of the semantics of one sentence. Indeed, se-mantics is also related to the order. For example,exchanging the subject and the object of a sentencechanges its meaning. However, the decoder trainedon a large corpus also captures the selectional pref-erences (Katz and Fodor, 1963; Wilks, 1975) ingeneration, which enables the decoder to infer theproper order of words. In addition, we observethat when two sentences are paraphrased, they usu-ally share similar words, especially those wordsrelated to the semantics. For example, “ What isthe best way to improve writing skills? ” and “

Howcan I improve my writing skills? ” are paraphrased,and the shared words ( improve , writing , and skills )are strongly related to the semantics. In Section 4,we show that our designed semantic embeddingcaptures enough semantic information to generateparaphrases. Syntactic encoder.

The syntactic encoder em-beds the target parse p = ( p , p , ..., p k ) into acontextualized syntactic embedding z syn . That is, z syn = ( z , z , ..., z k ) = Enc syn (( p , p , .., p k )) . Since the target parse p contains no semanticinformation but only syntactic information, we usea Transformer with the positional encoding as thesyntactic encoder. Decoder.

Finally, we design a decoder that takesthe semantic embedding z sem and the syntacticembedding z syn as the input and generates a para-phrase y . In other words, y = ( y , y , ..., y m ) = Dec ( z sem , z syn ) . We choose Transformer as the decoder to gen-erate y autoregressively. Notice that the semanticembedding z sem does not encode the position infor-mation and the syntactic embedding z syn does notcontain semantics. This forces the decoder to ex-tract the semantics from z sem and retrieve the syn-tactic structure from z syn . The attention weightsattaching to z sem and z syn make the decoder learnthe association between the semantics and the syn-tax as well as the relation between the word orderand the parse structures. Therefore, SynPG is ableto reorganize the source sentence and use the givensyntactic structure to rephrase the source sentence. Our design of the disentanglement makes it possi-ble to train SynPG without using annotated pairs.We train SynPG with the objective to reconstructthe source sentences. More speciﬁcally, when train-ing on a sentence x , we ﬁrst separate x into twoparts: 1) an unordered word list ¯ x and 2) its lin-earized parse p x (can be obtained by a pretrainedparser). Then, SynPG is trained to reconstruct x from (¯ x , p x ) with the reconstruction loss = − n (cid:88) i =1 log P ( y i = x i | ¯ x , p x , y , ..., y i − ) . Notice that if we do not disentangle the seman-tics and the syntax, and directly use a seq2seqmodel to reconstruct x from ( x , p x ) , it is likelythat the seq2seq model only learns to copy x andignores p x since x contains all the necessary in-formation for the reconstruction. Consequently, atinference time, no matter what target parse p isgiven, the seq2seq model always copies the wholesource sentence x as the output (more discussionin Section 4).On the contrary, SynPG learns the disentangledembeddings z sem and z syn . This makes SynPGcapture the relation between the semantics and thesyntax to reconstruct the source sentence x . There-fore, at test time, given the source sentence x anda new target parse p , SynPG is able to apply thelearned relation to rephrase the source sentence x according to the target parse p . Word dropout.

We observe that the ground truthparaphrase may contain some words not appearingin the source sentence; however, the paraphrasesgenerated by the vanilla SynPG tend to includeonly words appearing in the source sentence due tothe reconstruction training objective. To encourageSynPG to improve the diversity of the word choicesin the generated paraphrases, we randomly discardsome words from the source sentence during train-ing. More precisely, each word has a probabilityto be dropped out in each training iteration. Ac-cordingly, SynPG has to predict the missing wordsduring the reconstruction, and this enables SynPGto select different words from the source sentenceto generate paraphrases. More details are discussedin Section 4.5.

In the previous discussion, we assume that a fulltarget constituency parse tree is provided as theinput to SynPG. However, the full parse tree of thetarget paraphrase sentence is unlikely available atinference time. Therefore, following the settingin Iyyer et al. (2018), we consider generatingthe paraphrase based on the template , whichis deﬁned as the top two levels of the full con-stituency parse tree. For example, the template of (S(NP(PRP))(VP(VBZ)(NP(NNS)))(.)) is (S(NP)(VP))(.)) .Motivated by Iyyer et al. (2018), we train a parsegenerator to generate full parses from templates. The proposed parse generator has the same architec-ture as SynPG, but the input and the output are dif-ferent. The parse generator takes two inputs: a tagsequence tag x and a target template t . The tag se-quence tag x contains all the POS tags of the sourcesentence x . For example, the tag sequence of thesentence “ He eats apples. ” is “ <.> ”. Similar to the source sentence inSynPG, we do not consider the word order of thetag sequence during encoding. The expected out-put of the parse generator is a full parse ˜ p whose asyntactic structure follows the target template t .We train the parse generator without any addi-tional annotations as well. Let t x be the the tem-plate of p x (the parse of x ), we end-to-end trainthe parse generator with the input being ( tag x , t x ) and the output being p x . Generating paraphrases from templates.

Theparse generator makes us generate paraphrases byproviding target templates instead of target parses.The steps to generate a paraphrase given a sourcesentence x and a target template t are as follows:1. Get the tag sequence tag x of the source sen-tence x .2. Use the parse generator to generate a full parse ˜ p with input ( tag x , t ) .3. Use SynPG to generate a paraphrase y withinput ( x , ˜ p ) . Post-processing.

We notice that certain tem-plates are not suitable for some source sentencesand therefore the generated paraphrases are nonsen-sical. We follow Iyyer et al. (2018) and use n-gramoverlap and paraphrastic similarity computed bythe model from Wieting and Gimpel (2018) toremove nonsensical paraphrases . We conduct extensive experiments to demonstratethat SynPG performs better syntactic control thanother unsupervised paraphrase models, while thequality of generated paraphrases by SynPG is com-parable to others. In addition, we show that theperformance of SynPG is competitive or even bet-ter than supervised models when the training datais large enough. https://github.com/jwieting/para-nmt-50m We set the minimum n-gram overlap to 0.3 and the mini-mum paraphrastic similarity to 0.7. .1 Datasets

For the training data, we consider ParaNMT-50M(Wieting and Gimpel, 2018), a paraphrase datasetcontaining over 50 million pairs of reference sen-tences and the corresponding paraphrases as well asthe quality scores. We select about 21 million pairswith higher quality scores as our training examples.Notice that we use only the reference sentences totrain SynPG and unsupervised paraphrase modelssince we do not require paraphrase pairs.We sample 6,400 pairs from ParaNMT-50M asthe testing data. To evaluate the transferability ofSynPG, we also consider the other three datasets:1) Quora (Iyer et al., 2017) contains over 400,000paraphrase pairs and we sample 6,400 pairs fromthem. 2) PAN (Madnani et al., 2012) contains 5,000paraphrase pairs. 3) MRPC (Dolan et al., 2004)contains 2,753 paraphrase pairs.

We consider paraphrase pairs to evaluate all themodels. For each test paraphrase pair ( x , x ) , weconsider x as the source sentence and treat x asthe target sentence (ground truth). Let p be theparse of x , given ( x , p ) , The model is expectedto generate a paraphrase y that is similar to thetarget sentence x .We use BLEU score (Papineni et al., 2002)and human evaluation to measure the similaritybetween x and y . Moreover, to evaluate howwell the generated paraphrase y follows the targetparse p , we deﬁne the template matching accu-racy (TMA) as follows. For each ground truthsentence x and the corresponding generated para-phrase y , we get their parses ( p and p y ) and tem-plates ( t and t y ). Then, we calculate the percent-age of pairs whose t y exactly matches t as the template matching accuracy. We consider the following unsupervised paraphrasemodels: 1)

CopyInput : a na¨ıve baseline which di-rectly copies the source sentence as the output with-out paraphrasing. 2)

BackTrans : back-translationis proposed to generate paraphrases (Mallinsonet al., 2017; Wieting and Gimpel, 2018; Hu et al.,2019). In our experiment, we use the pretrainedEN-DE and DE-EN translation models proposedby Ng et al. (2019) to conduct back-translation. https://github.com/pytorch/fairseq/tree/master/examples/wmt19 Notice that training translation models requires ad-ditional translation pairs. Therefore, BackTransneeds more resources than ours and the translationdata may not available for some low-resource lan-guages. 3)

VAE : we consider a vanilla variationalautoencoder (Bowman et al., 2016) as a simplebaseline. 4)

SIVAE : syntax-infused variational au-toencoder (Zhang et al., 2019) utilizes additionalsyntax information to improve the quality of sen-tence generation and paraphrase generation. UnlikeSynPG, SIVAE does not disentangle the semanticsand syntax. 5)

Seq2seq-Syn : we train a seq2seqmodel with Transformer architecture to reconstruct x from ( x , p x ) without the disentanglement. Weuse this model to study the inﬂuence of the disen-tanglement. 6) SynPG : our proposed model whichlearns disentangled embeddings.We also compare SynPG with supervised ap-proaches. We consider the following: 1)

Seq2seq-Sup : a seq2seq model with Transformer architec-ture trained on whole ParaNMT-50M pairs. 2)

SCPN : syntactically controlled paraphrase network(Iyyer et al., 2018) is a supervised paraphrasemodel with syntactic control trained on ParaNMT-50M pairs. We use their pretrained model . We consider byte pair encoding (Sennrich et al.,2016) for tokenization and use Stanford CoreNLPparser (Manning et al., 2014) to get constituencyparses. We set the max length of sentences to 40and set the max length of linearized parses to 160for all the models. For the encoders and the de-coder of SynPG, we use the standard Transformer(Vaswani et al., 2017) with default parameters. Theword embedding is initialized by GloVe (Penning-ton et al., 2014). We use Adam optimizer with thelearning rate being − and the weight decay be-ing − . We set the word dropout probability to0.4 (more discussion in Section 4.5). The numberof epoch for training is set to 5.Seq2seq-Syn, Seq2seq-Sup are trained with thesimilar setting. We reimplemnt VAE and SIVAE,and all the parameters are set to the default valuein the original papers. We ﬁrst discuss if the syntactic speciﬁcation en-ables SynPG to control the output syntax better. https://github.com/miyyer/scpnodel ParaNMT Quora PAN MRPCTMA BLEU TMA BLEU TMA BLEU TMA BLEUNo Paraphrasing CopyInput 33.6 16.4 55.0 20.0 37.3 26.8 47.9 30.7UnsupervisedModels BackTrans 29.0 16.3 53.0 16.4 27.9 16.2 47.2 21.6VAE 26.3 9.6 44.0 8.1 19.4 5.2 20.8 1.2With SyntacticSpeciﬁcations SIVAE 30.0 12.8 48.3 13.1 26.6 11.8 21.5 5.1Seq2seq-Syn 33.5 16.3 54.9 19.8 37.1 SynPG

Table 1: Paraphrase results on four datasets. TMA denotes the template matching accuracy, which evaluates howoften the generated paraphrases follow the target parses. With the syntactic control, SynPG obtains higher BLEUscore and the template matching accuracy. This implies the paraphrases generated by SynPG are more similar tothe ground truths and follow the target parses more accurately.

Model Example 1 (ParaNMT) Example 2 (Quora)Source Sent. these children are gonna die if we don’t act now. what are the best ways to improve writing skills?Ground Truth if we don’t act quickly, the children will die. how could i improve my writing skill?BackTrans these children will die if we do not act now. what are the best ways to improve your writing skills?VAE these children are gonna die if we don’t act now. what are the best ways to improve writing skills?SIVAE these children are gonna die if we don’t act now . what are the best ways to improve writing skills?Seq2seq-Syn these children are gonna die if we don’t act now. what are the best ways to improve writing skills?SynPG if we don’t act now, these children will die. how can i improve my writing skills?

Table 2: Paraphrases generated by each model. SynPG can generate paraphrases with the syntax more similar tothe ground truth than other baselines.

Table 1 shows the template matching accuracy andBLEU score for SynPG and the unsupervised base-lines. Notice that here we use the full parse treesas the syntactic speciﬁcations. We will discussthe inﬂuence of using the template as the syntacticspeciﬁcations in Section 4.3.Although we train SynPG on the reference sen-tences of ParaNMT-50M, we observe that SynPGperforms well on Quora, PAN, and MRPC as well.This validates that SynPG indeed learns the syntac-tic rules and can transfer the learned knowledge toother datasets. CopyInput gets high BLEU scores;however, due to the lack of paraphrasing, it obtainslow template matching scores. Compared to the un-supervised baselines, SynPG achieves higher tem-plate matching accuracy and higher BLEU scoreson all datasets. This veriﬁes that the syntactic spec-iﬁcation is indeed helpful for syntactic control.Next, we compare SynPG with Seq2seq-Synand SIVAE. All models are given syntactic spec-iﬁcations; however, without the disentanglement,Seq2seq-Syn and SIVAE tend to copy the sourcesentence as the output and therefore get low tem-plate matching scores.Table 2 lists some paraphrase examples gener-ated by all models. Again, we observe that withoutsyntactic speciﬁcations, the paraphrases generatedby unsupervised baselines are similar to the sourcesentences. Without the disentanglement, Seq2seq- Syn and SIVAE always copy the source sentences.SynPG is the only model can generate paraphrasessyntactically similar to the ground truths.

We perform human evaluation using Amazon Me-chanical Turk to evaluate the quality of generatedparaphrases. We follow the setting of previouswork (Kok and Brockett, 2010; Iyyer et al., 2018;Goyal and Durrett, 2020). For each model, we ran-domly select 100 pairs of source sentence x andthe corresponding generated paraphrase y fromParaNMT-50M test set (after being post-processedas mentioned in Section 2.3) and have three Turk-ers annotate each pair. The annotations are on athree-point scale: means y is not a paraphrase of x ; means x is paraphrased into y but y containssome grammatical errors; means x is paraphrasedinto y , which is grammatically correct.The results of human evaluation are reported inTable 3. If paraphrases rated or are consid-ered meaningful, we notice that SynPG generatesmeaningful paraphrases at a similar frequency tothat of SIVAE. However, SynPG tends to generatemore ungrammatical paraphrases (those rated ).We think the reason is that most of paraphrasesgenerated by SIVAE are very similar to the sourcesentences, which are usually grammatically cor-rect. On the other hand, SynPG is encouraged to odel Hit RateBackTrans 63.6 22.4 14.0 86.0 11.0SIVAE 57.6 20.3 22.0 78.0 6.5SynPG 44.3 32.0 23.7 76.3 28.9

Table 3: Human evaluation on a three-point scale ( = not a paraphrase, = ungrammatical paraphrase, = grammatical paraphrase). SynPG performs better onhit rate (deﬁned as the percentage of generated para-phrase getting and matching the target parse at thesame time) than other unsupervised models. use different syntactic structures from the sourcesentences to generate paraphrases, which may leadsome grammatical errors.Furthermore, we calculate the hit rate , the per-centage of generated paraphrases getting andmatching the target parse at the same time. The hitrate measures how often the generated paraphrasesfollow the target parses and preserve the seman-tics (veriﬁed by human evaluation) simultaneously.The results show that SynPG gets higher hit ratethan other models. Next, we discuss the inﬂuence of generating para-phrase by using templates instead of using fullparse trees. For each paraphrase pair ( x , x ) intest data, we consider two ways to generate theparaphrase. 1) Generating the paraphrase with thetarget parse. We use SynPG to generate a para-phrase directly from ( x , p ) . 2) Generating theparaphrase with the target template. We ﬁrst usethe parse generator to generate a parse ˜ p from ( tag , t ) , where tag is the tag sequence of x and t is the template of p . Then we use SynPGto generate a paraphrase from ( x , ˜ p ) . We calcu-late the template matching accuracy to comparethese two ways to generate paraphrases, as shownin Table 4. We also report the template matchingaccuracy of the generated parse ˜ p .We ﬁnd that most of generated parses ˜ p indeedfollow the target templates, which means that theparse generator usually generates good parses ˜ p .Next, we observe that generating paraphrases withtarget parses usually performs better than with tar-get templates. The results show a trade-off. Usingtemplates proves more effortless during the gener-ation process, but may compromise the syntacticcontrol ability. In comparison, by using the targetparses, we have to provide more detailed parses,but our model can control the syntax better.Another beneﬁt of generating paraphrase with Model Template Matching AccuracyParaNMT Quora PAN MRPC

Paraphrasesgenerated bytarget parses

Paraphrasesgenerated bytarget templates

Parses ˜ p generated byparse generator Table 4: Inﬂuence of using templates. Using templatesproves more effortless during the generation process,but may compromise the syntactic control ability. target templates is that we can easily generate a lotof syntactically different paraphrases by feedingthe model with different templates. Table 5 listssome paraphrases generated by SynPG with differ-ent templates. We can perceive that most generatedparaphrases are grammatically correct and havesimilar meanings to the original sentence.

Finally, we demonstrate that the performance ofSynPG can be further improved and be even com-petitive to supervised models on some datasets ifwe consider more training data. The advantageof unsupervised paraphrase models is that we donot require parallel pairs for training. Therefore,we can easily boost the performance of SynPG byconsider more unannotated texts into training.We consider SynPG-Large, the SynPG modeltrained on the reference sentences of ParaNMT-50M as well as One Billion Word Benchmark(Chelba et al., 2014), a large corpus for traininglanguage models. We sample about 24 million sen-tences from One Billion Word and add them tothe training set. In addition, we ﬁne-tune SynPG-Large on only the reference sentences of the testingparaphrase pairs, called SynPG-FT.From Table 6, We observe that enlarging thetraining data set indeed improves the performance.Also, with the ﬁne-tuning, the performance ofSynPG can be much improved and even is bet-ter than the performance of supervised models onsome datasets. The results demonstrate the poten-tial of unsupervised paraphrase generation withsyntactic control.

The word dropout rate plays an important role forSynPG since it controls the ability of SynPG togenerate new words in paraphrases. We test differ- emplate Generated ParaphraseOriginal can you adjust the cameras? (S(NP)(VP)(.)) you can adjust the cameras. (SBARQ(ADVP)(,)(S)(,)(SQ)(.)) well, adjust the cameras , can you? (S(PP)(,)(NP)(VP)(.)) on the cameras, you can adjust them?Original she doesn’t keep pictures from her childhood. (SBARQ(WHADVP)(SQ)(.)) why doesn’t she keep her pictures from childhood. (S(‘‘)(NP)(VP)(’’)(NP)(VP)(.)) “ she doesn’t keep pictures from her childhood ” she said. (S(ADVP)(NP)(VP)(.)) perhaps she doesn’t keep pictures from her childhood.

Table 5: Paraphrases generated by SynPG with different templates.

Model ParaNMT Quora PAN MRPCTMA BLEU TMA BLEU TMA BLEU TMA BLEUOurs SynPG 71.0 32.2 82.6 33.2 66.3 26.4 74.0 26.2SynPG-Large 70.3 31.8 83.8 34.7 66.6 27.1 79.3 36.2SynPG-FT – – 86.3

SupervisedModels Seq2seq-Sup 40.2 19.6 54.0 11.3 29.2 13.1 44.3 16.3SCPN

Table 6: Training on larger dataset improves the performance of SynPG. Since training SynPG does not requireannotated paraphrase pairs, it is possible to ﬁne-tune SynPG on the texts in the target domain. With the ﬁne-tuning,SynPG can have competitive or even better performance than supervised approaches. (a) BLEU score(b) Template matching accuracy

Figure 3: Inﬂuence of word drop out rate. Setting theword dropout rate to 0.4 can achieve the best BLEUscore. However, higher word dropout rate leads to bet-ter template matching accuracy. ent word dropout rates and report the BLEU scoresand the template matching accuracy in Figure 3.From Figure 3a, we can observe that settingthe word dropout rate to 0.4 can achieve the bestBLEU score in most of datasets. The only ex-ception is ParaNMT, which is the dataset used for training. On the other hand, Figure 3b showsthat higher word dropout rate leads to better tem-plate matching accuracy. The reason is that higherword dropout rate gives SynPG more ﬂexibilityto generate paraphrases. Therefore, the generatedparaphrases can match the target syntactic spec-iﬁcations better. However, higher word dropoutrate also make SynPG have less ability to preservethe meaning of source sentences. Considering allthe factors above, we recommend to set the worddropout rate to 0.4 for SynPG.

Recently, a lot of work show that NLP models canbe fooled by different types of adversarial attacks(Alzantot et al., 2018; Ebrahimi et al., 2018; Iyyeret al., 2018; Tan et al., 2020; Jin et al., 2020). Thoseattacks generate adversarial examples by slightlymodifying the original sentences without changingthe meanings, while the NLP models change thepredictions on those examples. However, a robustmodel is expected to output the same labels. There-fore, how to make NLP models not affected by theadversarial examples becomes an important task.Since SynPG is able to generate syntacticallydifferent paraphrases, we can improve the robust-ness of NLP models by data augmentation. Themodels trained with data augmentation are thusmore robust to the syntactically adversarial exam-ples (Iyyer et al., 2018), which are the adversarialsentences that are paraphrases to the original sen- odel SST-2 MRPC RTEAcc. Brok. Acc. Brok. Acc. Brok.Base 91.9 46.7 84.1 52.8 63.2 58.3SynPG 88.9

Table 7: Data augmentation improves the robustness ofmodels. SynPG denotes the base classiﬁer trained onaugmented data generated by SynPG. Acc denotes theaccuracy in the original dataset (the higher is the better).Brok denotes the percentage of examples changing pre-dictions after attacking (the lower is the better). tences but with syntactic difference.We conduct experiments on three classiﬁcationtasks covered by GLUE benchmark (Wang et al.,2019): SST-2, MRPC, and RTE. For each trainingexample, we use SynPG to generate four syntac-tically different paraphrases and add them to thetraining set. We consider the setting to generatesyntactically adversarial examples by SCPN (Iyyeret al., 2018). For each testing example, we gener-ate ﬁve candidates of adversarial examples. If theclassiﬁer gives at least one wrong prediction on thecandidates, we treat the attack to be successful.We compare the model without data augmenta-tion (Base) and with data augmentation (SynPG) inTable 7. We observe that with the data augmenta-tion, the accuracy before attacking is slightly worsethan Base. However, after attacking, the percentageof examples changing predictions is much less thanBase, which implies that data augmentation indeedimproves the robustness of models.

Paraphrase generation.

Traditional approachesusually require hand-crafted rules, such as rule-based methods (McKeown, 1983), thesaurus-basedmethods (Bolshakov and Gelbukh, 2004; Kauchakand Barzilay, 2006), and lattice matching methods(Barzilay and Lee, 2003). However, the diversityof their generated paraphrases is usually limited.Recently, neural models make success on para-phrase generation (Prakash et al., 2016; Mallinsonet al., 2017; Cao et al., 2017; Egonmwan and Chali,2019; Li et al., 2019; Gupta et al., 2018). Theseapproaches treat paraphrase generation as a trans-lation task and design seq2seq models based on alarge amount of parallel data. To reduce the effortto collect parallel data, unsupervised paraphrasegeneration has attracted attention in recent years.Wieting et al. (2017); Wieting and Gimpel (2018)use translation models to generate paraphrases viaback-translation. Zhang et al. (2019); Roy and Grangier (2019) generate paraphrases based onvariational autoencoders. Reinforcement learningtechniques are also considered for paraphrase gen-eration (Li et al., 2018).

Controlled generation.

Recent work on con-trolled generation can be grouped into two fam-ilies. The ﬁrst family is doing end-to-end train-ing with an additional trigger to control the at-tributes, such as sentiment (Shen et al., 2017; Huet al., 2017; Fu et al., 2018; Peng et al., 2018;Dai et al., 2019), tense (Logeswaran et al., 2018),plots (Ammanabrolu et al., 2020; Fan et al., 2019;Tambwekar et al., 2019; Yao et al., 2019; Goldfarb-Tarrant et al., 2019, 2020), societal bias (Wallaceet al., 2019; Sheng et al., 2020b,a), and syntax(Iyyer et al., 2018; Goyal and Durrett, 2020; Ku-mar et al., 2020). The second family controls theattributes by learning disentangled representations.For example, Romanov et al. (2019) disentanglethe meaning and the form of a sentence. Chenet al. (2019b,a); Bao et al. (2019) disentangle thesemantics and the syntax of a sentence.

We present syntactically controlled paraphrase gen-erator (SynPG), an paraphrase model that can con-trol the syntax of generated paraphrases based onthe given syntactic speciﬁcations. SynPG is de-signed to disentangle the semantics and the syn-tax of sentences. The disentanglement enablesSynPG to be trained without the need for anno-tated paraphrase pairs. Extensive experiments showthat SynPG performs better syntactic control thanunsupervised baselines, while the quality of thegenerated paraphrases is competitive to supervisedapproaches. Finally, we demonstrate that SynPGcan improve the robustness of NLP models by gen-erating additional training examples. SynPG isespecially helpful for the domain where annotatedparaphrases are hard to obtain but a large amountof unannotated text is available. One limitation ofSynPG is the need for mannually providing targetsyntactic templates at inference time. We leave theautomatic template generation as our future work.

Acknowledgments

We thank anonymous reviewers for their helpfulfeedback. We thank Kashif Shah and UCLA-NLPgroup for the valuable discussions and comments.This work is supported in part by Amazon ResearchAward. eferences

Moustafa Alzantot, Yash Sharma, Ahmed Elgohary,Bo-Jhang Ho, Mani B. Srivastava, and Kai-WeiChang. 2018. Generating natural language adversar-ial examples. In

EMNLP .Prithviraj Ammanabrolu, Ethan Tien, W. Cheung,Z. Luo, W. Ma, Lara J. Martin, and Mark O. Riedl.2020. Story realization: Expanding plot events intosentences. In

AAAI .Yu Bao, Hao Zhou, Shujian Huang, Lei Li, Lili Mou,Olga Vechtomova, Xin-Yu Dai, and Jiajun Chen.2019. Generating sentences from disentangled syn-tactic and semantic spaces. In

ACL .Regina Barzilay and Lillian Lee. 2003. Learningto paraphrase: An unsupervised approach usingmultiple-sequence alignment. In

NAACL .Igor A. Bolshakov and Alexander F. Gelbukh. 2004.Synonymous paraphrasing using wordnet and inter-net. In

NLDB .Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, An-drew M. Dai, Rafal J´ozefowicz, and Samy Ben-gio. 2016. Generating sentences from a continuousspace. In

CoNLL .Ziqiang Cao, Chuwei Luo, Wenjie Li, and Sujian Li.2017. Joint copying and restricted generation forparaphrase. In

AAAI .Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge,Thorsten Brants, Phillipp Koehn, and Tony Robin-son. 2014. One billion word benchmark for mea-suring progress in statistical language modeling. In

INTERSPEECH .Mingda Chen, Qingming Tang, Sam Wiseman, andKevin Gimpel. 2019a. Controllable paraphrase gen-eration with a syntactic exemplar. In

ACL .Mingda Chen, Qingming Tang, Sam Wiseman, andKevin Gimpel. 2019b. A multi-task approach fordisentangling syntax and semantics in sentence rep-resentations. In

NAACL .Ning Dai, Jianze Liang, Xipeng Qiu, and XuanjingHuang. 2019. Style transformer: Unpaired text styletransfer without disentangled latent representation.In

ACL .Bill Dolan, Chris Quirk, and Chris Brockett. 2004. Un-supervised construction of large paraphrase corpora:Exploiting massively parallel news sources. In

COL-ING .Javid Ebrahimi, Anyi Rao, Daniel Lowd, and DejingDou. 2018. Hotﬂip: White-box adversarial exam-ples for text classiﬁcation. In

ACL .Elozino Egonmwan and Yllias Chali. 2019. Trans-former and seq2seq model for paraphrase generation.In

EMNLP . Angela Fan, Mike Lewis, and Yann Dauphin. 2019.Strategies for structuring story generation. In

ACL .Zhenxin Fu, Xiaoye Tan, Nanyun Peng, Dongyan Zhao,and Rui Yan. 2018. Style transfer in text: Explo-ration and evaluation. In

AAAI .Seraphina Goldfarb-Tarrant, Tuhin Chakrabarty, RalphWeischedel, and Nanyun Peng. 2020. Content plan-ning for neural story generation with aristotelianrescoring. In

EMNLP .Seraphina Goldfarb-Tarrant, Haining Feng, andNanyun Peng. 2019. Plan, write, and revise: aninteractive system for open-domain story generation.In

NAACL system demonstration .Tanya Goyal and Greg Durrett. 2020. Neural syntacticpreordering for controlled paraphrase generation. In

ACL .Ankush Gupta, Arvind Agarwal, Prawaan Singh, andPiyush Rai. 2018. A deep generative framework forparaphrase generation. In

AAAI .J. Edward Hu, Rachel Rudinger, Matt Post, and Ben-jamin Van Durme. 2019. PARABANK: monolin-gual bitext generation and sentential paraphrasingvia lexically-constrained neural machine translation.In

AAAI .Zhiting Hu, Zichao Yang, Xiaodan Liang, RuslanSalakhutdinov, and Eric P. Xing. 2017. Toward con-trolled generation of text. In

ICML .Shankar Iyer, Nikhil Dandekar, and Korn´el Csernai.2017. First quora dataset release: Question pairs. data.quora.com .Mohit Iyyer, John Wieting, Kevin Gimpel, and LukeZettlemoyer. 2018. Adversarial example generationwith syntactically controlled paraphrase networks.In

NAACL .Di Jin, Zhijing Jin, Joey Tianyi Zhou, and PeterSzolovits. 2020. Is BERT really robust? A strongbaseline for natural language attack on text classiﬁ-cation and entailment. In

AAI .Jerrold J Katz and Jerry A Fodor. 1963. The structureof a semantic theory. language , 39(2):170–210.David Kauchak and Regina Barzilay. 2006. Paraphras-ing for automatic evaluation. In

NAACL .Stanley Kok and Chris Brockett. 2010. Hitting the rightparaphrases in good time. In

NAACL .Ashutosh Kumar, Kabir Ahuja, Raghuram Vadapalli,and Partha P. Talukdar. 2020. Syntax-guided con-trolled generation of paraphrases.

TACL , 8:330–345.Zichao Li, Xin Jiang, Lifeng Shang, and Hang Li.2018. Paraphrase generation with deep reinforce-ment learning. In

EMNLP .ichao Li, Xin Jiang, Lifeng Shang, and Qun Liu.2019. Decomposable neural paraphrase generation.In

ACL .Lajanugen Logeswaran, Honglak Lee, and Samy Ben-gio. 2018. Content preserving text generation withattribute controls. In

NeurIPS .Nitin Madnani, Joel R. Tetreault, and Martin Chodorow.2012. Re-examining machine translation metrics forparaphrase identiﬁcation. In

NAACL .Jonathan Mallinson, Rico Sennrich, and Mirella Lapata.2017. Paraphrasing revisited with neural machinetranslation. In

EACL .Christopher D. Manning, Mihai Surdeanu, John Bauer,Jenny Rose Finkel, Steven Bethard, and David Mc-Closky. 2014. The stanford corenlp natural languageprocessing toolkit. In

ACL .Kathleen R. McKeown. 1983. Paraphrasing questionsusing given and new information.

Am. J. Comput.Linguistics , 9(1):1–10.Nathan Ng, Kyra Yee, Alexei Baevski, Myle Ott,Michael Auli, and Sergey Edunov. 2019. Facebookfair’s WMT19 news translation task submission. In

WMT .Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic eval-uation of machine translation. In

ACL .Nanyun Peng, Marjan Ghazvininejad, Jonathan May,and Kevin Knight. 2018. Towards controllable storygeneration. In

NAACL Workshop .Jeffrey Pennington, Richard Socher, and Christopher D.Manning. 2014. Glove: Global vectors for word rep-resentation. In

EMNLP .Aaditya Prakash, Sadid A. Hasan, Kathy Lee, Vivek V.Datla, Ashequl Qadir, Joey Liu, and Oladimeji Farri.2016. Neural paraphrase generation with stackedresidual LSTM networks. In

COLING .Alexey Romanov, Anna Rumshisky, Anna Rogers, andDavid Donahue. 2019. Adversarial decompositionof text representation. In

NAACL .Aurko Roy and David Grangier. 2019. Unsupervisedparaphrasing without translation. In

ACL .Rico Sennrich, Barry Haddow, and Alexandra Birch.2016. Neural machine translation of rare words withsubword units. In

ACL .Tianxiao Shen, Tao Lei, Regina Barzilay, and Tommi S.Jaakkola. 2017. Style transfer from non-parallel textby cross-alignment. In

NeurIPS .Emily Sheng, Kai-Wei Chang, Premkumar Natarajan,and Nanyun Peng. 2020a. ” nice try, kiddo”: Adhominems in dialogue systems. arXiv preprintarXiv:2010.12820 . Emily Sheng, Kai-Wei Chang, Premkumar Natarajan,and Nanyun Peng. 2020b. Towards controllable bi-ases in language generation. In

EMNLP-Findings .Pradyumna Tambwekar, Murtaza Dhuliawala, Lara J.Martin, Animesh Mehta, B. Harrison, and Mark O.Riedl. 2019. Controllable neural story plot genera-tion via reward shaping. In

IJCAI .Samson Tan, Shaﬁq R. Joty, Min-Yen Kan, and RichardSocher. 2020. It’s morphin’ time! combatinglinguistic discrimination with inﬂectional perturba-tions. In

ACL .Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N. Gomez, LukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In

NeurIPS .Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner,and Sameer Singh. 2019. Universal adversarial trig-gers for attacking and analyzing nlp. In

EMNLP .Alex Wang, Amanpreet Singh, Julian Michael, FelixHill, Omer Levy, and Samuel R. Bowman. 2019.GLUE: A multi-task benchmark and analysis plat-form for natural language understanding. In

ICLR .John Wieting and Kevin Gimpel. 2018. Paranmt-50m:Pushing the limits of paraphrastic sentence embed-dings with millions of machine translations. In

ACL .John Wieting, Jonathan Mallinson, and Kevin Gimpel.2017. Learning paraphrastic sentence embeddingsfrom back-translated bitext. In

EMNLP .Yorick Wilks. 1975. An intelligent analyzer and under-stander of english.

Commun. ACM , 18(5):264–274.Zhao Yan, Nan Duan, Junwei Bao, Peng Chen, MingZhou, Zhoujun Li, and Jianshe Zhou. 2016. Doc-chat: An information retrieval approach for chatbotengines using unstructured documents. In

ACL .Lili Yao, Nanyun Peng, Weischedel Ralph, KevinKnight, Dongyan Zhao, and Rui Yan. 2019. Plan-and-write: Towards better automatic storytelling. In

AAAI .Adams Wei Yu, David Dohan, Minh-Thang Luong, RuiZhao, Kai Chen, Mohammad Norouzi, and Quoc V.Le. 2018. Qanet: Combining local convolution withglobal self-attention for reading comprehension. In

ICLR .Xinyuan Zhang, Yi Yang, Siyang Yuan, Dinghan Shen,and Lawrence Carin. 2019. Syntax-infused varia-tional autoencoder for text generation. In

ACL .Sanqiang Zhao, Rui Meng, Daqing He, Andi Saptono,and Bambang Parmanto. 2018. Integrating trans-former and paraphrase rules for sentence simpliﬁca-tion. In