Faster Re-translation Using Non-Autoregressive Model For Simultaneous Neural Machine Translation
Hyojung Han, Sathish Indurthi, Mohd Abbas Zaidi, Nikhil Kumar Lakumarapu, Beomseok Lee, Sangha Kim, Chanwoo Kim, Inchul Hwang
FFaster Re-translation Using Non-Autoregressive Model For Simultaneous NeuralMachine Translation
Hyojung Han * , Sathish Indurthi * , Mohd Abbas Zaidi, Nikhil Kumar Lakumarapu,Beomseok Lee, Sangha Kim, Chanwoo Kim, Inchul Hwang Samsung Research, Seoul, South Korea { h.j.han, s.indurthi, abbas.zaidi, n07.kumar, bsgunn.lee, sangha01.kim, chanw.com, inc.hwang } @samsung.com Abstract
Recently, simultaneous translation has gathered a lot of atten-tion since it enables compelling applications such as subtitletranslation for a live event or real-time video-call translation.Some of these translation applications allow editing of par-tial translation giving rise to re-translation approaches. Thecurrent re-translation approaches are based on autoregres-sive sequence generation models (
ReTA ), which generate tar-get tokens in the (partial) translation sequentially. The multi-ple re-translations with sequential generation in
ReTA modelslead to an increased inference time gap between the incom-ing source input and the corresponding target output as thesource input grows. Besides, due to a large number of infer-ence operations involved, the
ReTA models are not favorablefor resource-constrained devices. In this work, we proposea faster re-translation system based on a non-autoregressivesequence generation model (
FReTNA ) to overcome the afore-mentioned limitations. We evaluate the proposed model onmultiple translation tasks and our model reduces the inferencetimes by several orders and achieves a competitive BLEUscore compared to the
ReTA and streaming (
Wait-k ) models.The proposed model reduces the average computation time bya factor of 20 when compared to the
ReTA model by incurringsmall drop in the translation quality. It also outperforms thestreaming based
Wait-k model both in terms of computationtime (1.5 times lower) and translation quality.
Simultaneous Neural Machine Translation (SNMT) ad-dresses the problem of real-time interpretation in machinetranslation. In order to achieve live translation, an SNMTmodel alternates between reading the source sequence andwriting the target sequence using either a fixed or an adap-tive policy. Streaming SNMT models can only append to-kens to a partial translation as more source tokens are avail-able with no possibility for revising the existing partial trans-lation. A typical application is conversational speech trans-lation, where target tokens must be appended to the existingoutput.For certain applications such as live captioning on videos,not revising the existing translation is overly restrictive.Given that we can revise the previous (partial) translation * Equal contribution * Equal contribution simply re-translating each successive source prefix becomesa viable strategy. The re-translation based strategy is not re-stricted to preserve the previous translation leading to hightranslation quality. The current re-translation approaches arebased on the autoregressive sequence generation models(
ReTA ), which generate target tokens in the (partial) transla-tion sequentially. As the source sequence grows, the multiplere-translations with sequential generation in
ReTA modelslead to an increased inference time gap causing the transla-tion to be out of sync with the input stream. Besides, due toa large number of inference operations involved, the
ReTA models are not favourable for resource-constrained devices.In this work, we build a re-translation based simultaneoustranslation system using non-autoregressive sequence gen-eration models to reduce the computation cost during theinference. The proposed system generates the target tokensin a parallel fashion whenever a new source information ar-rives; hence, it reduces the number of inference operationsrequired to generate the final translation. To compare the ef-fectiveness of the proposed approach, we implement the re-translation based autoregressive (
ReTA ) (Arivazhagan et al.2019a) and
Wait-k (Ma et al. 2019) models along with ourproposed system. Our experimental results reveal that theproposed model achieves significant performance gains overthe
ReTA and
Wait-k models in terms of computation timewhile maintaining the property of superior translation qual-ity of re-translation over the streaming based approaches.Revising the existing output can cause textual instabilityin re-translation based approaches. The previous approachedproposed a stability metric, Normalized Erasure (NE) (Ari-vazhagan et al. 2019a), to capture this instability. However,the NE only considers the first point of difference betweenpair of translation and fails to quantify the textual instabil-ity experienced by the user. In this work, we propose a newstability metric, Normalized Click-N-Edit (NCNE), whichbetter quantifies the textual instabilities by considering thenumber of insertions/deletions/replacements between a pairof translations.The main contributions of our work are as follows:• We propose re-translation based simultaneous system toreduce the high inference time of current re-translationapproaches.• We propose a new stability metric, Normalized Click-N- a r X i v : . [ c s . C L ] D ec igure 1: Overview of the proposed FReTNA and illustrated using German-to-English example.Edit, which is more sensitive to the flickers in the transla-tion as compared to existing stability metric, NormalizedErasure.• We conduct several experiments on simultaneous text-to-text translation tasks and establish the efficacy of the pro-posed approach.
We briefly describe the simultaneous translation systemto define the problem and set up the notations. Thesource and the target sequences are represented as x = { x , x , · · · , x S } and y = { y , y , · · · , y T } , with S and T being the length of the source and the target sequences.Unlike the offline neural translation models (NMT), the si-multaneous neural translation models (SNMT) produce thetarget sequence concurrently with the growing source se-quences. In other words, the probability of predicting thetarget token at time t depends only on the partial source se-quence ( x , · · · , x g ( t ) ). The probability of predicting the en-tire target sequence y is given by: p g ( y | x ) = T (cid:89) t =1 p ( y t |E ( x ≤ g ( t ) ) , D ( y FReTNA ,is illustrated with a German-English example in the Figure1. We describe the main components of LevT and proposedchanges to enable the smoother re-translation in the follow-ing paragraphs.The LevT model parallelly generates all the tokens in thetranslation and iteratively modifies the translation by usinginsertion/deletion operations. These operations are achievedby employing Placeholder classifier, Token Classifier, andDeletion Classifier components in the Transformer decoder.The sequence of insertion operations are carried out by usingthe placeholder and token classifiers where the placeholderclassifier is for finding the positions to insert the new tokensand the token classifier is for filling these positions with theactual tokens from the vocabulary ν . The sequence of dele-tion operations are performed by using the deletion classi-Label (Partial) Translations NE NCNEprev I live South Korea and - -current 1 I live in South Korea, and I am 3 1current 2 I live in North Carolina and I am 3 3Table 1: Examples computations of Normalized Erasure andthe proposed Normalized Click-N-Edit stability measures onthe previous and current (partial) translations.er. The inputs to these classifiers come from the Trans-former encoder ( T E ) and decoder blocks ( T D ) and are com-puted as: e l0 , · · · , e lg ( t ) = E x0 + P , · · · , E xg ( t ) + P g ( t ) , l = 0T E ( e l − , · · · , e l − ( t ) ) , l = { , · · · , L } (3) h l0 , · · · , h lt = (cid:40) E y0 + P , · · · , E yt + P t , l = 0T D ( { h l − , e l − ≤ g ( t ) } , · · · , { h l − , e l − ≤ g ( t ) } ) , l = { , · · · , L } (4) where E ∈ R | ν |× d and P ∈ R N max × d are word and positionembeddings of a token. The decoder outputs from the lastLayer ( h Lt ) are later passed to the three classifiers to edit theprevious translation by performing insertion/deletion oper-ations. These operations are repeated whenever new sourceinformation arrives. Placeholder classifier: It predicts the number of tokens tobe inserted between every two tokens in the current partialtranslation. As compared to the LevT’s placeholder classi-fier, we incorporate a positional bias which is given by thesecond term in the Eq. 5. As the predicted sequence lengthgrows, the bias becomes stronger, and the model insertslesser tokens at the start, reducing the flicker. The place-holder classifier with positional bias is given by: π pcθ ( p | i, x ≤ g ( t ) , y i ) = softmax( α ∗ h . B T + (1 − α ) q ) ,q k = γ i k + 1 ,i = { , · · · , t − } , k = { , · · · , K − } , (5)where h = [ h Li : h Li +1 ] , B ∈ R K × d , and γ i = t − it . Basedon the number of (0 ∼ ( K − of tokens predicted byEq. 5, we insert that many placeholders ( < plh > ) at thecurrent position i and it is calculated for all the positions inthe (partial) translation of length t . Here, K represents themaximum number of insertions between two tokens and α isa learnable parameter which balances the predictions basedon the hidden states and (partial) translation length. Token Classifier: The token classifier is similar to LevT’stoken classifier, it fills in tokens for all the placeholders in-serted by the placeholder classifier. This is achieved as fol-lows: π tcθ ( v | i, x ≤ g ( t ) , y i ) = softmax( h Li .C T ) , ∀ y i = φ, (6)where C ∈ R | ν |× d and φ is the placeholder token. Deletion Classifier: It scans over the hidden states ( h L , · · · , h Lt ) (except for the start token and end token) andpredicts whether to keep (1) or delete (0) each token in the(partial) translation. Similar to the placeholder classifier, wealso add a positional bias to the deletion classifier to dis-courage the deletion of initial tokens of the translation asthe source sequence grows. The deletion classifier with po-sitional bias is given by: π dcθ ( d | i, x ≤ g ( t ) , y i ) = softmax( β ∗ h Li . A T +(1 − β ) γ i ∗ l ) , l = [0 , , i = { , · · · , t } , (7)where A ∈ R × d , and we always keep the boundary tokens( < s >, < /s > ).The model with these modified placeholder and deletionclassifiers focuses more on appending the partial translationwhenever new source information comes in, which results insmoother translation having lower textual instability. Here, β is a learnable parameter.The insertion and deletion operations are complementary;hence, we combine them in an alternate fashion. In each it-eration, first we call the Placeholder classifier followed by Token classifier , and the Deletion classifier . We repeat thisprocess till a certain stopping condition is met, i.e., gener-ated translation is same in consecutive iterations, or MAXiterations are reached. In our experimental results, we foundthat two iterations of insertion-deletion operations are suf-ficient while generating the partial translation for the newlyarrived source information. To produce the partial transla-tion, the model incurs ∗ Z cost, where Z is the cost forinsertion-deletion operations equals to C + (cid:15) since we alsohave similar decoding layer. The overall time complexity ofour model ( FReTNA ) is O ( Ss × Z ) , since all the target to-kens are generated parallelly. The FReTNA computationalcost is ∼ T times less than the ReTA model. We use imitation learning to train the FReTNA similarto the Levenshtein Transformer. Unlike Arivazhagan et al.(2020a), which is trained on prefix sequences along withfull-sentence corpus, we train the model on the full sequencecorpus only. The expert policy used for imitation learning isderived from a sequence-level knowledge distillation pro-cess (Kim and Rush 2016). More precisely, we first train anautoregressive model using the same datasets and then re-place the original target sequence by the beam-search resultof this model. Please refer to Gu, Wang, and Zhao (2019) formore details on imitation learning for LevT model. At inference time, we greedily (beam size=1) apply thetrained model over the streaming input sequence. For everyset of new source tokens, we apply the insertion and dele-tions policies and pick the actions associated with high prob-abilities in Eq. 5, 6, and 7. During the re-translation basedsimultaneous translation, the partial translations are inher-ently revised when a new set of input token arrives; hence,we apply only two iterations of insertion-deletion sequenceon the current partial translation. We also impose a penaltyon current partial translation to match the prefix part of theprevious translation by subtracting a penalty η from the log-its in eq. 5 and eq. 7. The time complexities provided for ReTA and FReTNA modelsare for comparison and do not represent the actual computationalcosts igure 2: Quality v/s Latency plots of FReTNA , ReTA , Wait-k models for the DeEn and EnDe language pairs with differentstability constraints. One important property of the re-translation based modelsis that they should produce the translation output with asfew textual instabilities or flickers as possible; otherwise,the frequent changes in the output can be distracting to theusers. The ReTA model (Arivazhagan et al. 2020a) uses Nor-malized Erasure (NE) as a stability measure by following ? Niehues et al. (2018a); Arivazhagan et al. (2020), it mea-sures the length of the suffix that is to be deleted from theprevious partial translation to produce the current transla-tion. However, the metric does not account for the actualnumber of insertions/deletions/replacements, which providea much better measure to gauge the visual instability. In theTable 1, the NE gives same penalty to both the current trans-lations, however, the current translation 2 would obviouslycause more visual instability as compared to the currenttranslation 1 . In order to have a better metric to represent theflickers during the re-translation, we suggest a new stabilitymeasure metric, called Normalized Click-N-Edit (NCNE).The NCNE is computed (Eq 8) using the Levenshtein dis-tance (Levenshtein 1966), which computes the number of in-sertions/deletions/replacements to be performed on the cur-rent translation to match the previous translation. As shownin the Table 1, the NCNE gives higher penalty to the cur-rent translation 2 since it has a higher visual difference ascompared to the current translation 1 . The NCNE mea-sure aligns better with the textual stability goal of the re- translation based SMT models. The metric is given as NCNE = 1 T S (cid:88) i =2 levenshtein distance( o i , o i − ) , (8)where o i and o i − represent the current and previous trans-lations. We use three diversified MT language pairs to evalu-ate the proposed model: WMT’15 German-English(DeEn),IWSLT 2020 English-German(EnDe), WMT’14 English-French(EnFr) data. DeEn translation task: We use WMT15 German-to-English (4.5 million examples) as the training set. We use newstest2013 as dev set. All the results have been reportedon the newstest2015 . EnDe translation task: For this task, we use the datasetcomposition given in IWSLT 2020. The training corpus con-sists of MuST-C, OpenSubtitles2018, and WMT19, with atotal of 61 million examples. We choose the best systembased on the MuST-C dev set and report the results on theMuST-C tst-COMMON test set. The WMT19 dataset fur-ther consists of Europarl v9, ParaCrawl v3, Common Crawl,News Commentary v14, Wiki Titles v1 and Document-splitigure 3: Quality v/s Latency plot of FReTNA , ReTA , Wait-k models for EnFr language pair with different stability con-straints.Rapid for the German-English language pair. Due to thepresence of noise in the OpenSubtitles2018 and ParaCrawl,we only use 10 million randomly sampled examples fromthese corpora. EnFr translation task: We use WMT14 EnFr (36.3 mil-lion examples) as the training set, newstest2012 + new-stest2013 as the dev set, and newstest2014 as the test set.More details about the data statistics can be found in theAppendix. We adopt the evaluation framework similar to Arivazhaganet al. (2019a), which includes the metrics for quality andlatency. The translation quality is measured by calculatingthe de-tokenized BLEU score using sacrebleu script (Post2018).Most latency metrics for the simultaneous translation arebased on delay vector g , which measures how many sourcetokens were read before outputting the t th target token. Toaddress the re-translation scenario where target content canchange, we use content delay similar to Arivazhagan et al.(2019a). The content delay measures the delay with respectto when the token finalizes at a particular position. For ex-ample, in the Figure 1, the rd token appears as may at step 4,however, it is finalized at step 5 as could . The delay vector g is modified based on this content delay and used in AverageLagging (AL) (Ma et al. 2019) to compute the latency. Pair ReTA FReTNADeEn 31.7 31.0EnDe 31.8 32.2EnFr 41.2 38.2Table 2: Performance of offline models on test set ofDeEn,EnDe and EnFr pairs with greedy decoding and maxiterations set to nine for FReTNA . The proposed FReTNA , ReTA (Arivazhagan et al. 2020b),and Wait-k (Ma et al. 2019) models are implemented usingthe Fairseq framework (Ott et al. 2019). All the models useTransformer as the base architecture with settings similar toArivazhagan et al. (2020a). The text sequences are processedusing word piece vocabulary (Sennrich, Haddow, and Birch2016). All the models are trained on 4*NVIDIA P40 GPUsfor 300K steps with the batch size of 4096 tokens. The ReTA is trained using prefix augmented training data and FReTNA uses distilled training dataset as described in Kim and Rush(2016). The hyperparameter η described in the Section 2.4is set to . ∗ k , where k = { , · · · , K } . The Appendixcontains more details about the implementation and hyper-parameters settings. In this section, we report the results of our experiments con-ducted on the DeEn, EnDe and the EnFr language pairs. Inorder to test our FReTNA system, we compare it with the re-cent approaches in re-translation ( ReTA , Arivazhagan et al.(2019a)), and streaming based systems ( Wait-k , Ma et al.(2019)). Unlike traditional translation systems, where theaim is to achieve a higher BLEU score, simultaneous trans-lation is focused on balancing the quality-latency and thetime-latency trade-offs. Thus, we compare all the three ap-proaches based on these two trade-offs: (1) Quality v/s La-tency and (2) Inference time v/s Latency. The latency is de-termined by the AL. The inference time signifies the amountof the time taken to compute the output (normalized per sen-tence). Quality v/s Latency: The Figures 2 and 3 shows the qual-ity v/s latency trade-off for DeEn, EnDe, and EnFr languagepairs.We report the results on both the NE and NCNE stabil-ity metrics (Section 2.5). The Re-translation models havesimilar results with NCNE < . and NE < . metrics.However, with NE < . , the models have slightly inferiorresults since it imposes a stricter constraint for stability.The proposed FReTNA model performance is slightly in-ferior in the low latency range and better in the medium tohigh latency range compared to Wait-k model for DeEn andEnFr language pairs. For EnDe, our models perform betterin all the latency ranges as compared to the Wait-k model.The slight inferior performance of FReTNA over ReTA isattributed to the complexity of anticipating multiple targetigure 4: Inference time v/s Latency plot for DeEn, EnDe, and EnFr language pairs of FReTNA , ReTA , Wait-k models fordifferent latency range.Step ReTA Model FReTNA Model Input Sequence : Berichten zufolge hofft Indien dar¨uber hinaus auf einenVertrag zur Verteidigungszusammenarbeit zwischen den beiden Nationen.1 India Reports2 India is India reportedly hopes3 India is also India is hopes reportedly to hoping for one .4 India is also re India is hopes reportedly to hoping for one defensetreaty5 India is also reporte India is reportedly reportedly hoping for one defensetreaty between the two nations .6 India is also reportedly India is also reportedly hoping for a defense treatybetween the two nations .7 India is also reportdly hoping -8 India is also reportedly hoping for -9 India is also reportedly hoping for a -10 India is also reportedly hoping for a treaty -11 India is also reportedly hoping for a treaty on -12 India is also reportedly hoping for a treaty on defense -13 India is also reportedly hoping for a treaty on defensecooperation -14 India is also reportedly hoping for a treaty on defensecooperation between the two nations. -Table 3: Sample translation process by ReTA and FReTNA models on DeEn pair.tokens simultaneously with limited source context. How-ever, FReTNA slightly outperforms ReTA from medium tohigh latency ranges for EnDe language pair. Inference Time v/s Latency: The Figure 4 shows the in-ference time v/s latency plots for DeEn, EnDe, and EnFr lan-guage pairs. Since our model simultaneously generates alltarget tokens, it has much lower inference time compared tothe ReTA and Wait-k models. Generally, the streaming basedsimultaneous translation models such as Wait-k have lowerinference time compared to re-translation based approachessuch as ReTA , since the former models append the (par-tial) translation whereas the later models sequentially gen-erate the (partial) translation from scratch for every newlyarrived source information. Even though our FReTNA modelis based on re-translation, it has lower inference time com-pared to the Wait-k and ReTA models since we adopt a non- autoregressive model to generate all the tokens in the (par-tial) translation parallelly.For the comparison purpose, we also trained offline ReTA and FReTNA models for the three language pairs, and theresults are reported in Table 2. The BLEU scores of SMTand offline models of ReTA and FReTNA are comparable.Thus, we can conclude that our proposed FReTNA approachis better than ReTA and Wait-k in terms of inference time inall the latency ranges, while maintaining the property of su-perior translation quality of re-translation over the streamingbased approaches. Impact of positional bias: We evaluate the FReTNA model with and without including positional bias(FReTNA pos vs FReTNA non pos) introduced in Eq.5 and 7 to see whether positional bias can help the modelto generate smoother translations. As shown in Figureigure 5: Positional bias versus Non-Positional bias.5 FReTNA non pos has more flickers compared to theFReTNA pos model since it’s not able to cross the NCNEcutoff of 0.2 in the low latency range. The lower perfor-mance of FReTNA non pos in the low latency range is dueto predicting more tokens (insertion policy) than requiredwith less source information. Later, when more sourceinformation is available, then some of the tokens have to bedeleted (deletion policy), causing more flickers in the finaltranslation output. From Figure 5, we can see that positionalbias reduces flickers in the translation and very useful inlow latency range. Sample Translation Process: In the Table 3, we comparethe process of generating the target sequence using ReTA and FReTNA models. The examples are collected by running in-ference using these two models on the DeEn test set. The ReTA generates the target tokens from scratch at every stepin an autoregressive manner which leads to a high inferencetime. On the other hand, our FReTNA model generates thetarget sequence parallelly by inserting/deleting multiple to-kens at each step. We included only one example here dueto space constraints; more examples can be found in the Ap-pendix. Simultaneous Translation: The earlier works in stream-ing simultaneous translation such as Cho and Esipova(2016); Gu et al. (2016); Press and Smith (2018) lack theability to anticipate the words with missing source context.The Wait-k model introduced by Ma et al. (2019) brought inmany improvements by introducing a simultaneous transla-tion module which can be easily integrated into most of thesequence to sequence models. Arivazhagan et al. (2019b)introduced MILk which is capable of learning an adaptiveschedule by using hierarchical attention; hence it performsbetter on the latency quality trade-off. Wait-k and MILk areboth capable of anticipating words and achieving specifiedlatency requirements. Re-translation: Re-translation is a simultaneous transla-tion task in which revisions to the partial translation beyondstrictly appending of tokens are permitted. Re-translationis originally investigated by ? Niehues et al. (2018b). Morerecently, Arivazhagan et al. (2020b) extends re-translationstrategy by prefix augmented training and proposes a suit-able evaluation framework to assess the performance of there-translation model. They establish re-translation to be asgood or better than state-of-the-art streaming systems, evenwhen operating under constraints that allow very few revi-sions. Non-Autoregressive Models: Breaking the autoregres-sive constraints and monotonic (left-to-right) decoding or-der in classic neural sequence generation systems has beeninvestigated. Stern, Shazeer, and Uszkoreit (2018); Wang,Zhang, and Chen (2018) design partially parallel decod-ing schemes which output multiple tokens at each step. Guet al. (2017) propose a non-autoregressive framework whichuses discrete latent variables, and it is later adopted in Lee,Mansimov, and Cho (2018) as an iterative refinement pro-cess. Ghazvininejad et al. (2019) introduces the maskedlanguage modelling objective from BERT (Devlin et al.2018) to non-autoregressively predict and refine the trans-lations. Welleck et al. (2019); Stern et al. (2019); Gu, Liu,and Cho (2019) generate translations non-monotonically byadding words to the left or right of previous ones or by in-serting words in arbitrary order to form a sequence. Gu,Wang, and Zhao (2019) propose a non-autoregressive Trans-former model based on Levenshtein distance to support in-sertions and deletions. This model achieves a better perfor-mance and decoding efficiency compared to the previousnon-autoregressive models by iteratively doing simultane-ous insertion and deletion of multiple tokens.We leverage the non-autoregressive language generationprinciples to build efficient re-translation systems havinglow inference time. The existing re-translation model achieves better or com-parable performance to the streaming simultaneous trans-lation models; however, high inference time remains as achallenge. In this work, we propose a new approach forre-translation based simultaneous translation by leverag-ing non-autoregressive language generation. Specifically, weadopt the Levenshtein Transformer since it is inherentlytrained to find corrections to the existing (partial) translation.We also propose a new stability metric which is more sensi-tive to the flickers in the output stream. As observed fromthe experimental results, the proposed approach achievescomparable translation quality with a significantly less com-putation time compared to the previous autoregressive re-translation approaches. References Arivazhagan, N.; Cherry, C.; I, T.; Macherey, W.; Baljekar,P.; and Foster, G. 2019a. Re-Translation Strategies For LongForm, Simultaneous, Spoken Language Translation.rivazhagan, N.; Cherry, C.; Macherey, W.; Chiu, C.-C.;Yavuz, S.; Pang, R.; Li, W.; and Raffel, C. 2019b. Mono-tonic infinite lookback attention for simultaneous machinetranslation.Arivazhagan, N.; Cherry, C.; Macherey, W.; and Foster, G.2020a. Re-translation versus Streaming for SimultaneousTranslation. In Proceedings of the 17th International Con-ference on Spoken Language Translation ICASSP 2020 - 2020 IEEE International Conference onAcoustics, Speech and Signal Processing (ICASSP) , 7919–7923.Cho, K.; and Esipova, M. 2016. Can neural machinetranslation do simultaneous translation? arXiv preprintarXiv:1606.02012 .Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2018.Bert: Pre-training of deep bidirectional transformers for lan-guage understanding. arXiv preprint arXiv:1810.04805 .Ghazvininejad, M.; Levy, O.; Liu, Y.; and Zettlemoyer,L. 2019. Mask-predict: Parallel decoding of conditionalmasked language models. arXiv preprint arXiv:1904.09324 .Gu, J.; Bradbury, J.; Xiong, C.; Li, V. O.; and Socher, R.2017. Non-autoregressive neural machine translation. arXivpreprint arXiv:1711.02281 .Gu, J.; Liu, Q.; and Cho, K. 2019. Insertion-based decodingwith automatically inferred generation order. Transactionsof the Association for Computational Linguistics 7: 661–676.Gu, J.; Neubig, G.; Cho, K.; and Li, V. O. 2016. Learning totranslate in real-time with neural machine translation. arXivpreprint arXiv:1610.00388 .Gu, J.; Wang, C.; and Zhao, J. 2019. Levenshtein trans-former. In Advances in Neural Information Processing Sys-tems , 11181–11191.Kim, Y.; and Rush, A. M. 2016. Sequence-Level KnowledgeDistillation. arXiv e-prints arXiv:1606.07947.Lee, J.; Mansimov, E.; and Cho, K. 2018. Deterministic non-autoregressive neural sequence modeling by iterative refine-ment. arXiv preprint arXiv:1802.06901 .Levenshtein, V. I. 1966. Binary codes capable of correctingdeletions, insertions and reversals. Soviet Physics Doklady Proceedings of the 57th Annual Meeting ofthe Association for Computational Linguistics Interspeech 2018, 19th Annual Confer-ence of the International Speech Communication Associa-tion, Hyderabad, India, 2-6 September 2018 , 1293–1297.ISCA. doi:10.21437/Interspeech.2018-1055. URL https://doi.org/10.21437/Interspeech.2018-1055.Niehues, J.; Pham, N.-Q.; Ha, T.-L.; Sperber, M.; andWaibel, A. 2018b. Low-Latency Neural Speech Transla-tion. In Proc. Interspeech 2018 , 1293–1297. doi:10.21437/Interspeech.2018-1055. URL http://dx.doi.org/10.21437/Interspeech.2018-1055.Ott, M.; Edunov, S.; Baevski, A.; Fan, A.; Gross, S.; Ng, N.;Grangier, D.; and Auli, M. 2019. fairseq: A Fast, ExtensibleToolkit for Sequence Modeling. In Proceedings of NAACL-HLT 2019: Demonstrations .Post, M. 2018. A Call for Clarity in Reporting BLEUScores. In Proceedings of the Third Conference on MachineTranslation: Research Papers arXiv preprint arXiv:1810.13409 .Sennrich, R.; Haddow, B.; and Birch, A. 2016. NeuralMachine Translation of Rare Words with Subword Units.In Proceedings of the 54th Annual Meeting of the Associ-ation for Computational Linguistics (Volume 1: Long Pa-pers) arXiv preprint arXiv:1902.03249 .Stern, M.; Shazeer, N.; and Uszkoreit, J. 2018. BlockwiseParallel Decoding for Deep Autoregressive Models.In Bengio, S.; Wallach, H.; Larochelle, H.; Grau-man, K.; Cesa-Bianchi, N.; and Garnett, R., eds., Advances in Neural Information Processing Systems31 , 10086–10095. Curran Associates, Inc. URLhttp://papers.nips.cc/paper/8212-blockwise-parallel-decoding-for-deep-autoregressive-models.pdf.Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones,L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. At-tention is all you need. In Advances in neural informationprocessing systems , 5998–6008.Wang, C.; Zhang, J.; and Chen, H. 2018. Semi-Autoregressive Neural Machine Translation. In Proceedingsof the 2018 Conference on Empirical Methods in Naturalanguage Processing arXiv preprintarXiv:1902.02192arXiv preprintarXiv:1902.02192