Non-Autoregressive Text Generation with Pre-trained Language Models
Yixuan Su, Deng Cai, Yan Wang, David Vandyke, Simon Baker, Piji Li, Nigel Collier
NNon-Autoregressive Text Generation with Pre-trained Language Models
Yixuan Su ♦ Deng Cai ♥ Yan Wang ♠ David Vandyke ♣ Simon Baker ♦ Piji Li ♠ Nigel Collier ♦♦ Language Technology Lab, University of Cambridge ♥ The Chinese University of Hong Kong ♠ Tencent AI Lab ♣ Apple { ys484,sb895,nhc30 } @[email protected], [email protected] { brandenwang,pijili } @tencent.com Abstract
Non-autoregressive generation (NAG) has re-cently attracted great attention due to its fastinference speed. However, the generation qual-ity of existing NAG models still lags behindtheir autoregressive counterparts. In this work,we show that BERT can be employed as thebackbone of a NAG model to greatly improveperformance. Additionally, we devise mech-anisms to alleviate the two common prob-lems of vanilla NAG models: the inflexibil-ity of prefixed output length and the condi-tional independence of individual token predic-tions. Lastly, to further increase the speed ad-vantage of the proposed model, we propose anew decoding strategy, ratio-first , for applica-tions where the output lengths can be approx-imately estimated beforehand. For a compre-hensive evaluation, we test the proposed modelon three text generation tasks, including textsummarization, sentence compression and ma-chine translation. Experimental results showthat our model significantly outperforms exist-ing non-autoregressive baselines and achievescompetitive performance with many strong au-toregressive models. In addition, we also con-duct extensive analysis experiments to revealthe effect of each proposed component. Autoregressive generation (AG) models achievestate-of-the-art performance on a wide range oftext generation tasks, such as machine transla-tion (Vaswani et al., 2017) and text summarization(Rush et al., 2015). Such models generate a tokensequence in a left-to-right, token-by-token fashion.The prediction for the next token is conditioned onall previously generated tokens. This characteris-tic makes it impossible to parallelize the computa-tional overhead for token predictions in different All related code, data, and models can be found inhttps://github.com/yxuansu/NAG-BERT. positions, which leads to a relatively high latencyin inference. On the other hand, non-autoregressivegeneration (NAG) models (Gu et al., 2018) haveemerged as a promising alternative due to their fastinference speed. NAG models omit the sequentialdependencies within the output-side sequence andpredict tokens in all positions simultaneously oncethe output length has been determined beforehand.While NAG models enjoy full parallelism and fasterinference, the generation quality of NAG modelsoften lags behind their autoregressive counterparts.In this work, we explore the potential of large-scale pre-trained language models for improvingthe performance of non-autoregressive generation.Specifically, we utilize BERT (Devlin et al., 2019)as the backbone for NAG modelling and extendthe architecture of BERT with a CRF output layer(Lafferty et al., 2001; Sun et al., 2019) for bettercapturing the output-side dependencies.In addition, we analyze two significant limita-tions that NAG models currently suffer from: (1)the inflexibility of prefixed output length, and (2)the conditional independence of individual tokenpredictions. Accordingly, we devise two solutionsto these two problems.First, prior NAG models require the outputlength to be determined before token generation,thus an extra module for output length predictionis always required. Nevertheless, the most likelylength from the prediction module is not neces-sarily the best-suited one for the token generationmodel. To this end, previous works (Gu et al., 2018;Ma et al., 2019) usually rely on length-parallel de-coding (LPD) (Wei et al., 2019) for performanceenhancement; that is, generating and re-ranking theresults from different output length candidates. Inthis work, we propose a simple and elegant decod-ing mechanism that lets the model determine theoutput length on-the-fly. Specifically, our modeldynamically adjusts the output sequence length via a r X i v : . [ c s . C L ] F e b mitting an [eos] token at any output positionto indicate the ending of the generated sequence.Therefore, we can avoid the additional efforts ofoutput length prediction and results re-ranking.Second, most existing NAG models assume thetoken predictions in different positions are condi-tionally independent. As a consequence, they oftentend to generate results that are ungrammatical withrepetitions (Wang et al., 2019b). To alleviate thisproblem, we propose a context-aware learning ob-jective which impels the model to output differenttokens at adjacent positions, thereby reducing thepossibility of repetitive generation.Furthermore, for tasks like text summarization,the output sequence (summary) is known to beshorter than the source sequence (article). In suchcases, to further improve the model’s inference ef-ficiency, we introduce a new ratio-first decodingstrategy. Specifically, instead of performing infer-ence on all source-side hidden states, ratio-first gen-erates the result only based on a subset of sourcehidden states. The subset size is jointly determinedby the source length T and a predefined ratio α that is set based on our prior knowledge from thedata statistics. In the experiments, we show thatratio-first can significantly improve the inferencespeed while maintaining the generation quality.We evaluate the proposed model on three typicaltext generation tasks, including text summarization,sentence compression and machine translation. Ex-perimental results show that our model significantlyoutperforms many strong non-autoregressive base-lines, and even performs competitively with severalstrong autoregressive models. In addition, we con-duct extensive analysis experiments to study theeffect of individual proposed components.In summary, our contributions are: (1) We pro-pose a novel framework that utilizes BERT for textgeneration under the non-autoregressive generationparadigm; (2) We propose a decoding mechanismthat allows the model to dynamically determine theoutput length, and a new context-aware learningobjective that reduces errors stemming from theoutput-side conditional independence assumption;(3) We introduce a ratio-first decoding strategy thatfurther improve the model’s inference efficiency. Autoregressive generation (AG) models generatesequences based on a left-to-right factorization. Asshown in Figure 1, given the source sequence X , Figure 1: (a) Autoregressive; (b) Non-Autoregressive the target sequence Y with length T (cid:48) is generatedvia a chain of conditional probabilities based onthe left-to-right sequential dependencies as: p ( Y | X ) = T (cid:48) (cid:89) i =1 p ( y i | y
The architecture of the proposed model is presentedin Figure 2, in which the embedding layer andthe stack of transformer layers are initialized withBERT (Devlin et al., 2019).
Input Representation
Following the setup ofBERT, we first append a [cls] and a [sep] token on both sides of the source sequence. Thenwe attach a number of [pad] tokens at the endof source sequence to make its length equal to thepredefined maximum size (e.g., 256). Thus we canmake sure the source length is longer than or equalto the output length. As a special case, for taskslike text summarization where the source is knownto be longer than the target, we do not attach the [pad] tokens when constructing the input.
Transformer Layers
Given the source sequence X , it is processed by a stack of N transformer(Vaswani et al., 2017) layers. Formally, the Multi-Head Attention is defined as MultiHead ( Q , K , V ) ,where Q , K , V denotes the query, key and value re-spectively. The computation of the first transformerlayer is then defined as: V (1) = MultiHead ( E ( X ) , E ( X ) , E ( X )) , (3) O (1) = FFN ( V (1) ) , (4)FFN ( x ) = max(0 , xW + b ) W + b , (5)where E ( X ) = T E ( X ) + P E ( X ) in which T E ( · ) denotes the token embedding and P E ( · ) denotesthe position embedding. For other layers: V ( n ) = MultiHead ( O ( n − , O ( n − , O ( n − ) , (6) O ( n ) = FFN ( V ( n ) ) , (7) where n = 2 , ..., N and N is the total number oftransformer layers. The final sequence representa-tion H ∈ R T × d model is the output states of BERTfrom the last layer, where T is the source sequencelength and d model is the model size. CRF Layer
Then, H is passed through a linear-chain CRF (Lafferty et al., 2001). Under the CRFframework, the likelihood of the target sequence Y with length T (cid:48) is then modelled as: P CRF ( Y | X ) = e S ( X , Y ) (cid:80) Y (cid:48) e S ( X , Y (cid:48) ) = 1 Z ( X ) exp( T (cid:48) (cid:88) i =1 Φ y i ( h i ) + T (cid:48) (cid:88) i =2 t ( y i − , y i )) , (8)where Z ( X ) is the normalizing factor and Φ y i ( h i ) denotes the label score of y i at position i . In prac-tice, Φ is parameterized by a neural network thatmaps the BERT output state h i into the label (vo-cabulary) space. The t ( y i − , y i ) = T y i − ,y i de-notes the transition score from label y i − to y i where T ∈ R | V |×| V | is the transition matrix. Approximation
In the context of text genera-tion, the size of the label space (vocabulary size) | V | is typically large, e.g., k . Therefore, it is in-tractable to directly model the transition matrix T and the normalizing factor Z ( X ) . To this end, weadopt the techniques proposed by Sun et al. (2019)to approximate these two terms. Specifically, thefull transition matrix is approximated by the prod-uct of two low-rank matrices T = E E T , where E , E ∈ R | V |× d and d is much smaller than | V | .To compute the normalizing factor Z ( X ) , at eachime step, instead of searching through all possi-ble paths, the number of candidates is heuristicallytruncated to a predefined beam size k . We referreaders to the original paper for further details. In this section, we describe how to let the modeldetermine the output sequence length by itself.Our basic idea is that we want the model to dy-namically stop generation via emitting a special [eos] token. To achieve this, during training,we manually append two consecutive [eos] to-kens to the end of the target sequence, as shownin the top left part of Figure 2. In this way,the model can learn a deterministic transition be-haviour between two [eos] states, meaning that t ( [eos] , [eos] ) = max v ∈V t ( [eos] , v ) . Thisis because, during training, the model never sees atransition ( [eos] , v ), where v (cid:54) = [eos] .During inference, the result ˜ Y is acquired as ˜ Y = arg max Y (cid:48) S ( X , Y (cid:48) ) , where the CRF scor-ing function S ( X , Y (cid:48) ) in Equation (8) can be de-composed as: S ( X , Y (cid:48) ) = T (cid:88) i =1 Φ y (cid:48) i ( h i ) + T (cid:88) i =2 t ( y (cid:48) i − , y (cid:48) i )= Φ y (cid:48) ( h ) (cid:124) (cid:123)(cid:122) (cid:125) initial state + T (cid:88) i =2 { label score (cid:122) (cid:125)(cid:124) (cid:123) Φ y (cid:48) i ( h i ) + transition score (cid:122) (cid:125)(cid:124) (cid:123) t ( y (cid:48) i − , y (cid:48) i ) (cid:124) (cid:123)(cid:122) (cid:125) state transition } . (9)Once the decoded trajectory enters the [eos] state, the state transition term in S ( X , Y (cid:48) ) will be dominated by the transition score term t ( [eos] , [eos] ) . As a result, the model willkeep transitioning to [eos] in the remaining steps.An example is provided in the right part of Figure 2,from which we can see that, at step , the decodedtrajectory enters the [eos] state and remains at itin the rest of the generation process. In this way, ourmodel can dynamically control the length of outputsequence by entering the [eos] state during thegeneration process. After the entire generation pro-cess is completed, the final output sequence can beobtained by removing all generated [eos] tokens. We note that the outputs of BERT can be dividedinto two subsets. The first subset ranges from thebeginning to the position where the first [eos] is emitted, and the second subset is the rest. Forexample, in Figure 2, the first subset are those cor-responding to the output sequence “ y (1) y (2) y (3) y (4) [eos] ”. As for the second part, we can seethat it has little effect on the final output and remov-ing it should not change the result. This indicatesthat it suffices to only consider the beginning partof BERT outputs for improving the inference speed.Especially, for tasks like summarization where thetarget is known to be shorter than the source se-quence, we are safe to only use the first [ α · T ] outputs of BERT to perform inference. Here T de-notes the source length, α ∈ (0 . , . is set basedon the data statistics and [ · ] is the integer roundingoperation. Formally, given the source sequence X ,the ratio-first decoding is defined as ˜ Y = arg max Y (cid:48) F ( X , Y (cid:48) , α ) , = arg max Y (cid:48) { [ α · T ] (cid:88) i =1 Φ y (cid:48) i ( h i ) + [ α · T ] (cid:88) i =2 t ( y (cid:48) i − , y (cid:48) i ) } . (10)When α = 1 . , ratio-first degenerates to the stan-dard decoding strategy in CRF-based models.It should be noted that, [ α · T ] only constrainsthe maximum length of the generated result, andthe actual output length (after removing the gener-ated [eos] tokens) is still decided by the modelitself. In the experiment section, we demonstratethat ratio-first can notably improve the inferencespeed whilst maintaining the generation quality. Due to the conditional independence approxima-tion on output tokens, NAG models often tend togenerate repeated tokens (Wang et al., 2019b). Oneway to alleviate this problem is to introduce im-plicit dependencies on the output side. In this work,we propose to use the unlikelihood formulationof Welleck et al. (2020) in the context of NAG,where we define the set of negative candidate asthe surrounding tokens within a predefined contextwindow c . Formally, given the source sequence X and the target sequence Y with length T (cid:48) , theproposed context-aware objective is defined as: L CA ( Y | X ) = − T (cid:48) (cid:88) i =1 { log p θ ( y i | h i ; X ) + l CA ( i ) } ,l CA ( i ) = j = i + c (cid:88) j = i − c,y j (cid:54) = y i log(1 . − p θ ( y j | h i ; X )) , (11)here h i is the model output state at position i .At position i , the proposed objective maximizesthe probability of token y i while minimizing theprobabilities of the surrounding tokens. In this way,it discourages the model from generating repetitivetokens at different time steps.The overall learning objective is then defined as L CRF = − log P CRF ( Y | X ) , L = L CRF + λ · L CA , (12)where λ controls the importance of different lossterms and P CRF ( Y | X ) is described in Equation (8). Non-Autoregressive generation was first intro-duced by Gu et al. (2018) to reduce the inferencelatency in machine translation. Recent works in thisarea have investigated ways to mitigate the trade-off between the decoding speed and generationquality. Gu et al. (2018) utilized fertility as latentvariables for better translation performance. Wanget al. (2019b) proposed two auxiliary objectivesfor better modelling the output states and solvingthe under-translation problem. To better model theintermediate alignments between source and targetsides, Ma et al. (2019) proposed a model basedon the generative flow framework. Ghazvininejadet al. (2019) proposed to use a masked languageobjective to train the NAG model. During infer-ence, starting from a fully masked sequence, theoutput is generated in an iterative refinement man-ner. Recently, Sun et al. (2019) proposed to incor-porate a conditional random field into the decoderof a NAG model for better modelling the output-side dependencies. Our work is different from priorworks in two aspects: (1) we directly utilize a pre-trained language model (BERT) to perform non-autoregressive generation; (2) our model can dy-namically generate the output sequence without theneed of prespecified output length.
We evaluate the proposed model on three typicaltext generation tasks: (1) text summarization; (2)sentence compression and (3) machine translation.
We implement the proposed model with PyTorch(Paszke et al., 2017). The BERT model we use isthe Huggingface implementation (Wolf et al., 2019)(bert-base-uncased). To approximate the transition matrix in the CRF layer, we set the dimension d of matrices E and E as 32. For the normalizingfactor Z ( X ) , we set the predefined beam size k as256. As for the overall learning objective, we setthe window size c as and λ as . . In training, weuse Adam optimizer (Kingma and Ba, 2015). Tomeasure the relative speedup, we follow the stan-dard setup which runs inference for each individualexample separately. The model’s inference speedis computed by averaging the results of test cases.For a fair comparison, we measure the inferencespeed of all models on the same platform. Text summarization aims to automatically generatea compact summary that retains the most importantcontent of the original text document (Nenkova andMcKeown, 2012). In this experiment, we use theGigawords dataset (Rush et al., 2015) as our bench-mark. For evaluation, standard metrics includingROUGE-1 (R-1), ROUGE-2 (R-2) and ROUGE-L(R-L) (Lin, 2004) are reported.We compare our model with several representa-tive and the latest NAG models, including NAG-NMT (Gu et al., 2018), NAR-REG (Wang et al.,2019b) and NAG-CRF (Sun et al., 2019). Follow-ing previous works, during training, we train alength predictor to predict the output length. Dur-ing inference, for each NAG baseline, we adoptthe length-parallel decoding strategy (LPD- k ) (Weiet al., 2019), that is, generating k results usingthe top- k possible output length predictions fromthe length predictor. The results are then re-rankedby a transformer model to get the final ouput. Inthe experiment, we report the results of differentNAG baselines using LPD- decoding. In addi-tion, to better examine the effect of using BERTin NAG models, we add a B NAG-CRF baselinewhich adopts the same structure of the NAG-CRFmodel but using BERT as the encoder. We alsocompare our model with several strong autoregres-sive models, which are Luong-NMT (Luong et al.,2015), Pointer-Generator (See et al., 2017), DRGD(Li et al., 2017) and Concept Pointer (Wang et al.,2019a). To measure the relative inference speedup,we include transformer as a baseline model.The results are shown in Table 1, from whichwe can see that, by using length-parallel decod-ing, the performance of all NAG baselines can benotably improved. However, such procedure signif-icantly increases the inference latency. In contrast, odels R-1 R-2 R-L Speedup
AutoregressiveLuong-NMT 33.10 14.45 30.71 -Pointer-Generator 35.98 15.99 33.33 -DRGD 36.25 17.61 33.55 -Concept Pointer 36.62 16.40 33.98 -Transformer (b = 4 ) 35.74 16.97 33.43 1.00 × Non-AutoregressiveNAG-NMT 27.20 8.96 25.58 × +LPD-9 29.76 10.03 28.04 5.28 × NAR-REG 28.56 9.79 26.83 8.64 × +LPD-9 31.23 11.14 29.55 4.74 × NAG-CRF 30.29 12.61 28.71 8.07 × +LPD-9 32.91 14.31 31.03 4.32 × B NAG-CRF 32.63 14.32 30.82 6.13 × +LPD-9 34.56 16.10 32.76 3.21 × Ours ( α = 0 . ) 34.67 16.13 32.81 × Ours ( α = 1 . ) × Table 1: Results on Gigawords dataset, where b in thetransformer baseline stands for beam search size. our model can self-determine the output lengthwithout any re-ranking process. As shown in theresults, our model outperforms the best NAG base-line (with LPD) and achieves performances that arecomparable with several strong AG models.Comparing the results of B NAG-CRF and NAG-CRF, we can see that incorporating BERT as en-coder helps to improve the model performance.Nonetheless, our model still outperforms B NAG-CRF with LPD-9 decoding. This is because thedynamic length decoding mechanism allows ourmodel to generate results with optimal length, lead-ing to stronger model performances.Finally, we analyze the proposed ratio-first de-coding. From the results, we observe a moderateperformance drop when using ratio-first ( α = 0 . ).It comes from the fact that, for some input doc-uments with length T , the reference summary islonger than [ α · T ] . In such cases, ratio-first failsto generate the complete reference summary, lead-ing to the drop of performance. On the other hand,we can see that, ratio-first can notably improvethe inference speedup. With α = 0 . , our modelachieves the highest inference speedup while stilloutperforms all compared NAG models. Sentence compression aims at compressing a longsentence into a short one by deleting redundantwords. In this experiment, we use the Google sen-tence compression dataset (Filippova and Altun,2013) as our benchmark. For evaluation, we use
Models F1 R-1 R-2 R-L Speedup
AutoregressiveBi-LSTM-Dep 82.3 81.5 74.1 81.3 -Tagger 82.8 81.1 72.4 80.9 -Tagger+ILP 79.0 76.1 64.6 75.8 -HiSAN-Dep 82.7 82.1 74.9 81.9 -HiSAN 83.2 82.9 75.8 82.7 -Transformer (b = 4 ) 82.4 82.0 74.6 81.8 1.00 × Non-AutoregressiveNAG-NMT 72.5 72.1 59.9 71.8 × +LPD-9 73.8 73.6 61.0 73.1 6.09 × NAG-REG 73.7 73.1 61.5 73.0 10.00 × +LPD-9 75.6 75.1 63.4 74.9 5.49 × NAG-CRF 75.1 74.4 66.8 74.2 9.41 × +LPD-9 77.3 76.5 69.0 76.3 5.04 × B NAG-CRF 77.1 76.2 68.9 76.0 7.21 × +LPD-9 79.3 78.5 71.7 78.2 3.91 × Ours ( α = 0 . ) 79.5 79.0 72.1 78.7 10.00 × Ours ( α = 1 . ) × Table 2: Results on sentence compression task the standard token-kept-F1 (F1) score. In addition,We also report the results of other standard metricsincluding ROUGE-1, ROUGE-2 and ROUGE-L.We compare the proposed model with the sameNAG baselines as in the previous experiment. Wealso compare our model with several strong autore-gressive models, including Bi-LSTM-Dep (Filip-pova et al., 2015), Tagger and Tagger+ILP (Wanget al., 2017), HiSAN-Dep and HiSAN (Kamigaitoet al., 2018). To measure the inference speedup, weinclude transformer as a baseline model.The results are presented in Table 2, from whichwe see that our model outperforms the best reportedNAG baseline (with LPD) in terms of both the gen-eration quality and inference speed. Comparingwith the strong autoregressive models, our modelcan achieve competitive performance with a over . × inference speed up. We also report the re-sults of our model using the ratio-first decodingstrategy. By setting α as . , it achieves a . × inference speedup while still outperforming othercompared NAG baselines. Machine translation aims at translating text fromthe source language to the target language. In thistask, we use the IWSLT14 German-to-English (DE-EN) dataset as our benchmark. Following previousworks, we use the sequence-level knowledge distil-lation (Gu et al., 2018) during training. For evalu-ation, we report results in BLEU scores (Papineniet al., 2002). In this experiment, we use the BERTmodel in German language.We compare our model with a range of strong odels BLEU Speedup( × ) AutoregressiveLSTM-based 28.53 -CNN-based 32.84 -Transformer (b = 4 ) 33.31 1.00Non-AutoregressiveENAG-E 24.13 (27.30) 15.08 (7.39)ENAG-P 25.09 (28.60) 14.48 (7.24)NAG-REG 23.89 (28.04) 16.45 (9.05)NAG-NMT 23.04 (26.79) 13.92 (7.24)NAG-CRF 26.39 (29.21) 11.74 (6.03) B NAG-CRF 26.73 (29.67) 9.42 (5.01)Ours ( α = 0 . ) 29.71 13.92Ours ( α = 1 . ) Table 3: Results on IWSLT14 De-En dataset. The num-bers in () are results using length-parallel decoding.
BERT CRF R-1 R-2 R-L (cid:88) (cid:88) × (cid:88) (cid:88) × × × Table 4: Ablation study on Gigawords dataset.
NAG models, including NAG-NMT (Gu et al.,2018), ENAG-E and ENAG-P (Guo et al., 2019),NAG-REG (Wang et al., 2019b), NAG-CRF (Sunet al., 2019) and B NAG-CRF. For each NAGbaseline, we also report the results using LPD- decoding. In addition, we compare our modelwith several strong autoregressive models, includ-ing LSTM-based (Wu et al., 2016), CNN-based(Gehring et al., 2017) and transformer model.The results are shown in Table 3, from whichwe see that our model outperforms the best NAGbaseline (with LPD) in terms of both the generationquality and inference speedup. Additionally, wealso report the results using the ratio-first decoding.By setting α as . , the inference speedup can befurther boosted to . × while the generationquality is still higher than the best NAG baseline. In this section, we present further discussions andempirical analysis of the proposed model.
BERT & CRF
To quantify the importance ofeach component (BERT & CRF) of our model, weevaluate the performance on Gigawords dataset byremoving each component iteratively.The results are shown in Table 4, from whichwe can see that by removing any of these compo-
Models rep- rep- rep- rep- R-L w/o CA 6.897 2.640 0.741 0.295 32.89Ours 5.786 1.978 0.427 0.106 33.28Transformer 4.329 1.348 0.267 0.089 33.43
Table 5: Evaluation results on n -gram repetitions. nents, the overall performance decreases. By re-moving BERT from the model, we observe notabledrop across all metrics. This shows that the knowl-edge of BERT is an important factor of the model’sstrong performance. Comparing with results in Ta-ble 1, it still outperforms vanilla NAG-CRF andperforms comparably with NAG-CRF using LPDdecoding, which demonstrates the merit of the pro-posed dynamic length decoding mechanism. An-other interesting finding is that, by only removingthe CRF layer, the most notable drop is observed onthe bigram-level metric (ROUGE-2). This showsthat the bigram-level dependencies on the outputside are mainly captured by the CRF module. Inaddition, by removing both BERT and CRF, allmetrics further decrease. This confirms that eachof these two components positively contributes tothe model’s overall performance. Context-Aware Objective
In this part, we studythe effect of the context-aware objective. As de-scribed in Equation (11), it aims at alleviating theproblem of repetitive generation. To give a quantita-tive analysis, we use the measurement of sentence-level repetition (Welleck et al., 2020) to computethe ratio of duplicate n -grams (rep- n ) in the gener-ated result. This metric is defined asrep- n ( Y ) = 100 × (1 . − | unique n -grams ( Y ) || n -grams ( Y ) | ) . (13)For each generated result, rep- n is . when it hasno repeating n -grams. The final result is computedby averaging over the entire evaluation set.We conduct experiments on Gigawords datasetto evaluate the n -gram repetitions ranging fromuni-gram to -gram. The results are shown in Table5, where w/o CA means the model is trained with-out using context-aware objective and R-L denotesthe model’s ROUGE-L score. Additionally, we alsoshow the results from transformer model for a di-rect comparison. Comparing the two variants of ourmodel, we see that training with context-aware ob-jective leads to a 42% drop on rep- metric ( . vs . ) and a 64% drop on rep- metric ( . vs . ). The ROUGE-L results also indicate that odels Ours Length-Parallel Decoding( α = 1 . ) LPD- LPD- LPD- BLEU × ) 11.31 11.84 8.92 6.01 Table 6: Results comparison on IWSLT14 dataset the reduction in token repetition can effectivelyimprove the model generation quality.
Dynamic Length Determination
Next, we ex-amine the importance of the model’s ability to dy-namically determine the length of the generatedoutput. To this end, we train another model vari-ant by removing the two [eos] tokens from thetarget sequence. In this way, the model is not ableto self-determine the output length throughout thegeneration process. To perform inference, we uselength-parallel decoding (LPD) with different num-ber of length candidates. Formally, for each lengthcandidate l , the model generates the result ˜ Y as ˜ Y = arg max Y (cid:48) { l (cid:88) i =1 Φ y (cid:48) i ( h i ) + l (cid:88) i =2 t ( y (cid:48) i − , y (cid:48) i ) } . (14)The final result is acquired by re-ranking the gener-ated results with a transformer model.We conduct experiments on the IWSLT14 DE-EN dataset in which we try a different number oflength candidates, including top- , top- and top- . The results are shown in Table 6, from whichwe can see, as the number of length candidates in-creases, the model performance increases as well.The reason is that a larger candidates set is morelikely to contain the best-suited length for the gen-eration model, leading to better performance. How-ever, such decoding procedure inevitably increasesthe required computation overhead. We can seethat, when setting k as , the inference speedupdecreases from . × to . × . In contrast, ourproposed model is able to determine the optimaloutput length by itself. Without any re-ranking pro-cess, it outperforms the model with LPD- de-coding and achieves the inference speedup that iscomparable with the model using LPD- decoding. Ratio-First Decoding
We are also interested inthe effect of the ratio-first decoding strategy. Toprovide a quantitative analysis, we perform infer-ence on the Gigawords dataset using ratio-first withdifferent α . The experimental results with differ-ent α are presented in Figure 3. It can be observedthat, when α reaches . , the model approximately Figure 3: Experiment results on Gigawords dataset us-ing ratio-first decoding with different α .Figure 4: The distribution of target/source length ratioof the training and test set in Gigawords dataset. achieves its optimal performance. At the same time,a notable improvement can be observed in terms ofthe inference speedup ( . × → . × ).Now we illustrate why the near optimal perfor-mance can be achieved when α reaches . . InFigure 4, we present the distribution of the tar-get/source length ratio of every data instance in theGigawords dataset. We can see that, for most cases,the ratio between the target length T (cid:48) and sourcelength T is less than . . Recall the definition ofratio-first decoding in Equation (10), the [ α · T ] constrains the maximum length of the generatedresult. Therefore, once we have a prior knowledgeon the data statistic, we can easily choose a proper α that both improves the inference speed whilstmaintaining the generation quality. In this case, aproper α could be . which is demonstrated bythe results in Figure 3 and 4. By setting different α , ratio-first provides us an explicit way to controlthe balance between the inference speed and thegeneration quality. This property of ratio-first isespecially favorable in real-life scenarios where theinference speed is the highest concern. Conclusion
In this work, we explored the potential of BERTin various text generation tasks under the NAGframework. To address problems from NAG mod-els previously having a prefixed output length, wedevised a decoding mechanism which enables themodel to determine the output length dynamically.To reduce errors stemming from the assumptionof conditional independence of output tokens, weproposed a context-aware objective as well as us-ing a CRF decoding. Furthermore, to maximize theinference speed advantage of our model, we intro-duced a ratio-first decoding strategy. We evaluatedour model on three benchmark datasets and theresults show that our model significantly outper-forms many strong NAG baselines and performscomparably to many strong AG models.
Acknowledgments
The authors wish to thank Jialu Xu, Guanlin Li,Xing Wang for their insightful discussions and sup-port. Many thanks to our anonymous reviewers fortheir suggestions and comments.
References
Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: pre-training ofdeep bidirectional transformers for language under-standing. In
Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, NAACL-HLT 2019, Minneapolis, MN,USA, June 2-7, 2019, Volume 1 (Long and Short Pa-pers) , pages 4171–4186.Katja Filippova, Enrique Alfonseca, Carlos A. Col-menares, Lukasz Kaiser, and Oriol Vinyals. 2015.Sentence compression by deletion with lstms. In
Proceedings of the 2015 Conference on EmpiricalMethods in Natural Language Processing, EMNLP2015, Lisbon, Portugal, September 17-21, 2015 ,pages 360–368.Katja Filippova and Yasemin Altun. 2013. Overcom-ing the lack of parallel data in sentence compres-sion. In
Proceedings of the 2013 Conference onEmpirical Methods in Natural Language Process-ing, EMNLP 2013, 18-21 October 2013, Grand Hy-att Seattle, Seattle, Washington, USA, A meeting ofSIGDAT, a Special Interest Group of the ACL , pages1481–1491.Jonas Gehring, Michael Auli, David Grangier, DenisYarats, and Yann N. Dauphin. 2017. Convolutionalsequence to sequence learning. In
Proceedingsof the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11August 2017 , pages 1243–1252.Marjan Ghazvininejad, Omer Levy, Yinhan Liu, andLuke Zettlemoyer. 2019. Mask-predict: Paralleldecoding of conditional masked language models.In
Proceedings of the 2019 Conference on Empiri-cal Methods in Natural Language Processing andthe 9th International Joint Conference on NaturalLanguage Processing, EMNLP-IJCNLP 2019, HongKong, China, November 3-7, 2019 , pages 6111–6120.Jiatao Gu, James Bradbury, Caiming Xiong, Vic-tor O. K. Li, and Richard Socher. 2018. Non-autoregressive neural machine translation. In .Junliang Guo, Xu Tan, Di He, Tao Qin, Linli Xu,and Tie-Yan Liu. 2019. Non-autoregressive neu-ral machine translation with enhanced decoder in-put. In
The Thirty-Third AAAI Conference on Artifi-cial Intelligence, AAAI 2019, The Thirty-First Inno-vative Applications of Artificial Intelligence Confer-ence, IAAI 2019, The Ninth AAAI Symposium on Ed-ucational Advances in Artificial Intelligence, EAAI2019, Honolulu, Hawaii, USA, January 27 - Febru-ary 1, 2019 , pages 3723–3730.Hidetaka Kamigaito, Katsuhiko Hayashi, Tsutomu Hi-rao, and Masaaki Nagata. 2018. Higher-order syn-tactic attention network for longer sentence compres-sion. In
Proceedings of the 2018 Conference of theNorth American Chapter of the Association for Com-putational Linguistics: Human Language Technolo-gies, NAACL-HLT 2018, New Orleans, Louisiana,USA, June 1-6, 2018, Volume 1 (Long Papers) , pages1716–1726.Diederik P. Kingma and Jimmy Ba. 2015. Adam: Amethod for stochastic optimization. In .John D. Lafferty, Andrew McCallum, and FernandoC. N. Pereira. 2001. Conditional random fields:Probabilistic models for segmenting and labeling se-quence data. In
Proceedings of the Eighteenth Inter-national Conference on Machine Learning (ICML2001), Williams College, Williamstown, MA, USA,June 28 - July 1, 2001 , pages 282–289.Piji Li, Wai Lam, Lidong Bing, and Zihao Wang. 2017.Deep recurrent generative decoder for abstractivetext summarization. In
Proceedings of the 2017 Con-ference on Empirical Methods in Natural LanguageProcessing, EMNLP 2017, Copenhagen, Denmark,September 9-11, 2017 , pages 2091–2100.Chin-Yew Lin. 2004. Rouge: A package for automaticevaluation of summaries. In
Proc. ACL workshop onText Summarization Branches Out , page 10.hang Luong, Hieu Pham, and Christopher D. Man-ning. 2015. Effective approaches to attention-basedneural machine translation. In
Proceedings of the2015 Conference on Empirical Methods in NaturalLanguage Processing, EMNLP 2015, Lisbon, Portu-gal, September 17-21, 2015 , pages 1412–1421.Xuezhe Ma, Chunting Zhou, Xian Li, Graham Neu-big, and Eduard H. Hovy. 2019. Flowseq: Non-autoregressive conditional sequence generation withgenerative flow. In
Proceedings of the 2019 Con-ference on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Con-ference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7,2019 , pages 4281–4291.Ani Nenkova and Kathleen R. McKeown. 2012. A sur-vey of text summarization techniques. In
MiningText Data , pages 43–76.Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic eval-uation of machine translation. In
Proceedings of the40th Annual Meeting of the Association for Compu-tational Linguistics, July 6-12, 2002, Philadelphia,PA, USA , pages 311–318.Adam Paszke, Sam Gross, Soumith Chintala, GregoryChanan, Edward Yang, Zachary DeVito, ZemingLin, Alban Desmaison, Luca Antiga, and AdamLerer. 2017. Automatic differentiation in pytorch.In
NIPS-W .Alexander M. Rush, Sumit Chopra, and Jason Weston.2015. A neural attention model for abstractive sen-tence summarization. In
Proceedings of the 2015Conference on Empirical Methods in Natural Lan-guage Processing, EMNLP 2015, Lisbon, Portugal,September 17-21, 2015 , pages 379–389.Abigail See, Peter J. Liu, and Christopher D. Manning.2017. Get to the point: Summarization with pointer-generator networks. In
Proceedings of the 55th An-nual Meeting of the Association for ComputationalLinguistics, ACL 2017, Vancouver, Canada, July 30 -August 4, Volume 1: Long Papers , pages 1073–1083.Zhiqing Sun, Zhuohan Li, Haoqing Wang, Di He,Zi Lin, and Zhi-Hong Deng. 2019. Fast structureddecoding for sequence models. In
Advances in Neu-ral Information Processing Systems 32: Annual Con-ference on Neural Information Processing Systems2019, NeurIPS 2019, 8-14 December 2019, Vancou-ver, BC, Canada , pages 3011–3020.Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N. Gomez, LukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In
Advances in Neural Information Pro-cessing Systems 30: Annual Conference on NeuralInformation Processing Systems 2017, 4-9 Decem-ber 2017, Long Beach, CA, USA , pages 5998–6008. Liangguo Wang, Jing Jiang, Hai Leong Chieu,Chen Hui Ong, Dandan Song, and Lejian Liao. 2017.Can syntax help? improving an lstm-based sentencecompression model for new domains. In
Proceed-ings of the 55th Annual Meeting of the Associationfor Computational Linguistics, ACL 2017, Vancou-ver, Canada, July 30 - August 4, Volume 1: Long Pa-pers , pages 1385–1393.Wenbo Wang, Yang Gao, Heyan Huang, and Yuxi-ang Zhou. 2019a. Concept pointer network forabstractive summarization. In
Proceedings of the2019 Conference on Empirical Methods in Natu-ral Language Processing and the 9th InternationalJoint Conference on Natural Language Processing,EMNLP-IJCNLP 2019, Hong Kong, China, Novem-ber 3-7, 2019 , pages 3074–3083.Yiren Wang, Fei Tian, Di He, Tao Qin, ChengXiangZhai, and Tie-Yan Liu. 2019b. Non-autoregressivemachine translation with auxiliary regularization. In
The Thirty-Third AAAI Conference on Artificial In-telligence, AAAI 2019, The Thirty-First InnovativeApplications of Artificial Intelligence Conference,IAAI 2019, The Ninth AAAI Symposium on Edu-cational Advances in Artificial Intelligence, EAAI2019, Honolulu, Hawaii, USA, January 27 - Febru-ary 1, 2019 , pages 5377–5384.Bingzhen Wei, Mingxuan Wang, Hao Zhou, JunyangLin, and Xu Sun. 2019. Imitation learning for non-autoregressive neural machine translation. In
Pro-ceedings of the 57th Conference of the Associationfor Computational Linguistics, ACL 2019, Florence,Italy, July 28- August 2, 2019, Volume 1: Long Pa-pers , pages 1304–1312.Sean Welleck, Ilia Kulikov, Stephen Roller, Emily Di-nan, Kyunghyun Cho, and Jason Weston. 2020. Neu-ral text generation with unlikelihood training. In .Thomas Wolf, Lysandre Debut, Victor Sanh, JulienChaumond, Clement Delangue, Anthony Moi, Pier-ric Cistac, Tim Rault, R’emi Louf, Morgan Funtow-icz, and Jamie Brew. 2019. Huggingface’s trans-formers: State-of-the-art natural language process-ing.
ArXiv , abs/1910.03771.Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V.Le, Mohammad Norouzi, Wolfgang Macherey,Maxim Krikun, Yuan Cao, Qin Gao, KlausMacherey, Jeff Klingner, Apurva Shah, Melvin John-son, Xiaobing Liu, Lukasz Kaiser, Stephan Gouws,Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, KeithStevens, George Kurian, Nishant Patil, Wei Wang,Cliff Young, Jason Smith, Jason Riesa, Alex Rud-nick, Oriol Vinyals, Greg Corrado, Macduff Hughes,and Jeffrey Dean. 2016. Google’s neural machinetranslation system: Bridging the gap between humanand machine translation.