[PDF] Improving Fluency of Non-Autoregressive Machine Translation

Abstract

Non-autoregressive (nAR) models for machine translation (MT) manifest superior decoding speed when compared to autoregressive (AR) models, at the expense of impaired fluency of their outputs. We improve the fluency of a nAR model with connectionist temporal classification (CTC) by employing additional features in the scoring model used during beam search decoding. Since the beam search decoding in our model only requires to run the network in a single forward pass, the decoding speed is still notably higher than in standard AR models. We train models for three language pairs: German, Czech, and Romanian from and into English. The results show that our proposed models can be more efficient in terms of decoding speed and still achieve a competitive BLEU score relative to AR models.

Full PDF

aa r X i v : . [ c s . C L ] A p r Improving Fluency of Non-Autoregressive Machine Translation

Zdenˇek Kasner and

Jindˇrich Libovick´y and

Jindˇrich Helcl

Charles University, Faculty of Mathematics and Physics,Institute of Formal and Applied Linguistics,Malostransk´e n´amˇest´ı 25, 118 00 Prague, Czech Republic { kasner, libovicky, helcl } @ufal.mff.cuni.cz Abstract

Non-autoregressive (nAR) models for ma-chine translation (MT) manifest superior de-coding speed when compared to autoregres-sive (AR) models, at the expense of impairedﬂuency of their outputs. We improve the ﬂu-ency of a nAR model with connectionist tem-poral classiﬁcation (CTC) by employing addi-tional features in the scoring model used dur-ing beam search decoding. Since the beamsearch decoding in our model only requiresto run the network in a single forward pass,the decoding speed is still notably higher thanin standard AR models. We train models forthree language pairs: German, Czech, and Ro-manian from and into English. The resultsshow that our proposed models can be moreefﬁcient in terms of decoding speed and stillachieve a competitive BLEU score relative toAR models.

One of the challenges that the research commu-nity faces today is improving the latency of neuralmachine translation (NMT) models. The decodersin modern NMT models operate autoregressively,which means that the target sentence is generatedin steps from left to right (Bahdanau et al., 2015;Vaswani et al., 2017). In each step, a token is gen-erated and it is supplied as the input for the nextstep.Recently, nAR models for NMT tackled this is-sue by reformulating translation as sequence la-beling. As long as the model and the data ﬁtin a GPU memory, all computation steps can bedone in parallel (Gu et al., 2018; Lee et al., 2018;Libovick´y and Helcl, 2018; Ghazvininejad et al.,2019). However, such models suffer from less ﬂu-ent outputs.In phrase-based statistical machine translation(SMT; Koehn, 2009), the translation ﬂuency is handled by a language model component, whichis responsible for arranging the phrases selectedby the decoder into a coherent sentence. In ARNMT, there is no external language model. Thedecoder part of the neural model plays the role ofa conditional language model, which estimates theprobability of the translation given the source sen-tence signal as processed by the encoder part.In automatic speech recognition (ASR), Gravesand Jaitly (2014) proposed a beam search algo-rithm which combines an n -gram language modelwith scores from a model trained using CTC(Graves et al., 2006).In this paper, we adopt and generalize this ap-proach for nAR NMT by extending a CTC-basedmodel by Libovick´y and Helcl (2018). We experi-ment with these models on six language pairs andwe ﬁnd that the generalized decoding algorithmhelps narrowing the performance gap between theCTC-based and the standard AR models. Non-autoregressive models for MT formulate thetranslation problem as sequence labeling. Thestates of the ﬁnal decoder layer are independentlylabeled with target sentence tokens. The modelscan parallelize all steps of the computation andthus reduce the decoding time substantially. ThenAR models were enabled by the invention of theself-attentive Transformer model (Vaswani et al.,2017), which allows arbitrary reordering of thestates in each layer. Most of the nAR models needa prior estimate of the sentence length, either ex-plicitly (Lee et al., 2018) or via a specialized fer-tility model (Gu et al., 2018) and rely on the atten-tion mechanism for re-ordering.We base our work on an alternative approachthat does not depend on the target length estima-tion. Instead, it constrains the upper bound of lgorithm 1

Beam Search Algorithm with CTC B ← { ∅ } ⊲ Beam2: for step i = . . . k · T x do H ← ∅ ⊲ Hypothesis → CTC score4: W ← n -best tokens in step i for hypothesis h ∈ B do for token w ∈ W do s ← P i ( w ) · P ( h ) ⊲ derivation score8: H [ h + w ] ← H [ h + w ] + s B ← select nbest ( H, n ) return B the target sentence length to the source sentencelength multiplied by a ﬁxed number k and usesCTC to compute the training loss (Libovick´y andHelcl, 2018).The architecture consists of three components:encoder, state splitter, and decoder. The encoderis the same as in the Transformer model. Thestate splitter takes each state from the ﬁnal encoderlayer and projects it into k states of the original di-mension, making the sequence k times longer. Thedecoder consists of additional Transformer layerswhich attend to both encoder and state splitter out-puts.CTC enables the model to generate variable-length sequences using a special blank symbol thatis included in the vocabulary. The resulting train-ing loss is a sum of cross entropy of all possibleinterleavings of the reference sequence with theblank symbols. Even though enumerating all thecombinations is intractable, the cross-entropy sumcan be efﬁciently computed using a dynamic pro-gramming forward-backward algorithm.Since each token can be decoded independentlyof other tokens at inference time, the modelreaches a signiﬁcant speedup over the AR models.However, this speedup is achieved at the expenseof the translation quality which manifests mostlyin the reduced ﬂuency. We tackle the reduced ﬂuency problem using beamsearch and employing additional features in itsscoring model. Our approach is inspired by sta-tistical MT and ASR.

Unlike greedy decoding, which can be per-formed in parallel by selecting tokens with the highest probability in each step independently,beam search operates sequentially. However, thespeedup gained from the parallelization is pre-served because the output probability distributionsare still conditionally independent and thus can becomputed in a single pass through the network – asopposed to the AR models, which need to re-runthe entire stack of decoder layers every step.The beam search algorithm for the CTC-basedmodel (Graves and Jaitly, 2014) is shown in Algo-rithm 1. Unlike standard beam search in NMT, thealgorithm needs to deal with the issue that a singlehypothesis may have various derivations, depend-ing on the positions of the blank symbols. Thescore of a single derivation is the product of theconditionally independent probabilities of the out-put tokens (line 7).The beam search score of a hypothesis is thenthe sum of the scores of its derivations formed inthe current beam search step (line 8).

For selecting n best hypotheses (line 9 in Algo-rithm 1), we employ a linear model to computethe score: score = log P ( y | x ) + w · Φ ( y ) (1)where P ( y | x ) is the CTC score of the generatedsentence y given a source sentence x , Φ is a fea-ture function of y and w is a trainable featureweight vector.We use structured perceptron for beam searchto learn the feature weights (Huang et al., 2012).During training, we run the beam search algorithmand if the reference translation falls off the beam,we apply the perceptron update rule: w ← w + α ( Φ ( y ) − Φ (ˆ y )) (2)where α is the learning rate, Φ ( y ) are the featurevalues of the preﬁx of the reference translation inthe given time step, and Φ (ˆ y ) are the feature val-ues of the highest-scoring hypothesis in the beam.Alternatively, we found that applying the percep-tron update rule multiple times with all hypothesesthat scored higher than the reference leads to fasterconvergence. In order to stabilize the training, wedo not train the weight of the CTC score and set itto 1.In the following paragraphs, we describe the fea-tures Φ used within our beam search algorithm. ethod German WMT15 Romanian WMT16 Czech WMT18 Decodingtime [ms]en → de de → en en → ro ro → en en → cs cs → enNon-autoregressive 21.67 25.57 19.88 28.99 16.27 17.63 233Transformer, greedy 29.84 32.62 25.89 33.54 21.57 27.89 1664Transformer, beam 5 30.23 33.43 26.46 34.06 22.20 28.49 3848Ours, beam 1 22.68 26.44 19.74 29.65 16.98 18.78 337Ours, beam 5 25.50 29.45 22.46 33.01 19.31 23.33 408Ours, beam 10 25.93 30.05 23.33 33.29 19.47 23.95 526Ours, beam 20 26.03 30.15 24.11 33.51 19.58 24.32 1097 Table 1: Quantitative results of the models in terms of BLEU score and average decoding times per sentencein milliseconds. Results on WMT14 English-German translation and results without back-translation are in theAppendix.

Language Model.

The main component improv-ing the ﬂuency is a language model (LM). For efﬁ-ciency, we use an n -gram LM. Since the hypothe-ses contain blank symbols, the beam may consistof hypotheses of different lengths. Because shortersequences are favored by the LM, we divide thelog-probability of each hypothesis by its length inorder to normalize the scores. Blank/non-blank symbols.

To guide the decod-ing towards sentences of correct length, we com-pute the ratio of blank vs. non-blank symbols asfollows: max (cid:18) , − δ (cid:19) where δ is a hyperparameter that thresholds the pe-nalization for too high blank/non-blank symbol ra-tio. Based on the distribution properties of the ra-tio, we use δ = 4 . Trailing blank symbols.

We observed that theoutputs produced by the CTC-based model tend tobe too short. To prevent that, we count the trailingblank symbols: max (0 , − source length ) . We perform experiments on three language pairsin both directions: English-Romanian, English-German, and English-Czech.For training the base NMT models, we useWMT parallel data, which consists of 0.6M sen-tences for English-Romanian, 4.5M sentences for http://statmt.org/wmt19/translation-task.html English-German, and 57M sentences for English-Czech.Further, we use the WMT monolingual data:20M sentences for English, German and Czechand 2.2M sentences for Romanian for training theLM and for back-translation.We preprocess all data using SentencePiece (Kudo and Richardson, 2018). We train theSentencePiece models with a vocabulary size of50,000.We implement the proposed architecture us-ing Neural Monkey (Helcl and Libovick´y, 2017).The parameters we used for the training are listedin Appendix A. We will release the code upon pub-lication.We used the AR baselines trained on the paralleldata for generating back-translated synthetic train-ing data (Sennrich et al., 2016). When trainingon back-translated data, authentic parallel data areupsampled to match the size of the back-translateddata. We thus train our ﬁnal models using the mixof authentic and backtranslated data, so both ARbaselines and the proposed models use the sameamount of data for training. If we only used theparallel data for training the neural models andkept the monolingual data only for the languagemodel, the proposed model would have beneﬁtedfrom having access to more data than the AR base-lines.We train a 5-gram KenLM model (Heaﬁeld,2011) on the monolignual data tokenized usingthe same SentencePiece vocabulary as the paralleldata.For the perceptron training, we split the valida- https://github.com/google/sentencepiece https://github.com/ufal/neuralmonkey

123 0 10 20 30 40 50 d ec od i ng ti m e i n s ec ond s Figure 1: Comparison of the CPU decoding time of theautoregressive (AR), non-autoregressive (nAR) Trans-former models and the proposed method with beamsize of 10. tion data for each language pair in halves and useone half as the training set and the second half as aheld-out set. We use the score on the held-out setduring the perceptron training as an early-stoppingcriterion. The scoring model is initialized withzero weights for all features and a ﬁxed weight of1 for the CTC score.

We evaluate our models on the standard WMTtest sets that were previously used for evalu-ation of nAR NMT. We use newstest2015for English-German, newstest2016 for English-Romanian, and newstest2018 for English-Czech(Bojar et al., 2015, 2016, 2018). We computethe BLEU scores (Papineni et al., 2002) as im-plemented in SacreBLEU (Post, 2018). We alsomeasure the average decoding time for a singlesentence.Table 1 shows the measured quantitative re-sults of the experiments. We observe that thebeam search greatly improves the translation qual-ity over the CTC-based nAR models (“Non-autoregressive” vs. “Ours”). Additionally, wehave control over the speed/quality trade-off by ei-ther lowering or increasing the beam size.Increasing the beam size from 1 to 5 systemat-ically increases the translation quality by at least3 BLEU points. Decoding with a beam size of 20matches the quality of greedy autoregressive de-coding while maintaining . × speedup.Figure 1 plots the time required to translate a https://github.com/mjpost/sacreBLEU Beam Size 1 5 10 20 c + l + r + t c + l + r c + l c Table 2: BLEU scores for English-to-German transla-tion for different beam sizes and feature sets: CTCscore ( c ), language model ( l ), ratio of the blank sym-bols ( r ), and the number of trailing blank symbols ( t ). sentence with respect to its length. As expected,beam search decoding is more time-consumingthan the CTC-based labeling (greedy). However,our method is still substantially faster than the ARmodel, especially for longer sentences.Table 2 shows how features used in the scoringmodel contribute to the BLEU score. We can seethat combining the features is beneﬁcial and thatthe improvement is substantial with larger beamsizes. The feature weights were trained separatelyfor each beam size.Our cursory manual evaluation indicates that ad-ditional features help to tackle the most signiﬁcantproblems of nAR NMT – repeated or malformedwords and too short sentences (see Appendix C forexamples). The earliest work on nAR translation includeswork by Gu et al. (2018) and Lee et al. (2018)which are the closest to our model beside our base-line. Unlike our approach, they do not includestate splitting. Gu et al. (2018) used a latent fertil-ity model to copy a sequence of embeddings whichis then used for the target sentence generation. Leeet al. (2018) use two decoders. The ﬁrst decodergenerates a candidate translation, which is then it-eratively reﬁned by the second decoder a denois-ing auto-encoder or a masked LM (Ghazvininejadet al., 2019).Junczys-Dowmunt et al. (2018) exploit the au-toregressive architectures (Bahdanau et al., 2015;Vaswani et al., 2017) and try to optimize the de-coding speed. Using model quantization and statememoization they achieve a two-times speedup.

We introduced a MT model with beam search thatcombines nAR CTC-based NMT model with an -gram LM and other features.We performed experiments on six languagepairs and evaluated the models on the standardWMT sets. Our approach narrows the quality gapbetween the nAR and AR models while still main-taining a substantial speedup.The experiments show that the main beneﬁt ofthe proposed approach is the opportunity to bal-ance the trade-off between translation quality andtranslation speed. The autoregressive models arestill superior in translation quality for most ofthe language pairs, even though by a narrow mar-gin. In contrast, the non-autoregressive models arevery fast, but often lack in translation quality. Ourapproach enhances constant-time neural networkrun with a fast beam search utilizing a scoringmodel to improve the translation quality. By al-tering the beam size, we can adjust the speed andthe quality ratio to achieve acceptable results bothin terms of speed and translation quality.

Acknowledgements

This research has been supported by the fromthe European Union’s Horizon 2020 researchand innovation programme under grant agreementNo. 825303 (Bergamot), Czech Science Foundtiongrant No. 19-26934X (NEUREM3), and CharlesUniversity grant No. 976518, and has been us-ing language resources distributed by the LIN-DAT/CLARIN project of the Ministry of Educa-tion, Youth and Sports of the Czech Republic(LM2015071). This research was partially sup-ported by SVV project number 260 453.

References

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2015. Neural machine translation by jointlylearning to align and translate. In , San Diego, CA, USA.Ondˇrej Bojar, Rajen Chatterjee, Christian Federmann,Yvette Graham, Barry Haddow, Matthias Huck, An-tonio Yepes, Philipp Koehn, Varvara Logacheva,Christof Monz, Matteo Negri, Aurelie N´ev´eol, Mar-iana Neves, Martin Popel, Matt Post, Raphael Ru-bino, Carolina Scarton, Lucia Specia, Marco Turchi,Karin Verspoor, and Marcos Zampieri. 2016. Find-ings of the 2016 conference on machine translation(WMT16). In

Proceedings of the First Conferenceon Machine Translation, Volume 2: Shared Task Pa-pers , volume 2, pages 131–198, Berlin, Germany.Association for Computational Linguistics. Ondˇrej Bojar, Rajen Chatterjee, Christian Federmann,Barry Haddow, Matthias Huck, Chris Hokamp,Philipp Koehn, Varvara Logacheva, Christof Monz,Matteo Negri, Matt Post, Carolina Scarton, LuciaSpecia, and Marco Turchi. 2015. Findings of the2015 workshop on statistical machine translation. In

Proceedings of the Tenth Workshop on StatisticalMachine Translation , pages 1–46, Lisbon, Portugal.Association for Computational Linguistics.Ondˇrej Bojar, Christian Federmann, Mark Fishel,Yvette Graham, Barry Haddow, Matthias Huck,Philipp Koehn, and Christof Monz. 2018. Find-ings of the 2018 conference on machine translation(WMT18). In

Proceedings of the Third Confer-ence on Machine Translation , pages 272–307, Brus-sels, Belgium. Association for Computational Lin-guistics.Marjan Ghazvininejad, Omer Levy, Yinhan Liu, andLuke Zettlemoyer. 2019. Mask-predict: Parallel de-coding of conditional masked language models. In

Proceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the9th International Joint Conference on Natural Lan-guage Processing (EMNLP-IJCNLP) , pages 6111–6120, Hong Kong, China. Association for Computa-tional Linguistics.Alex Graves, Santiago Fern´andez, Faustino Gomez,and J¨urgen Schmidhuber. 2006. Connectionisttemporal classiﬁcation: labelling unsegmented se-quence data with recurrent neural networks. In

Pro-ceedings of the 23rd International Conference onMachine Learning , pages 369–376, Pittsburgh, PA,USA. ACM.Alex Graves and Navdeep Jaitly. 2014. Towards end–to-end speech recognition with recurrent neural net-works. In

International Conference on MachineLearning , pages 1764–1772, Bejing, China. PMLR.Jiatao Gu, James Bradbury, Caiming Xiong, VictorO. K. Li, and Richard Socher. 2018. Non-autore-gressive neural machine translation. In , Vancouver, BC, Canada.Kenneth Heaﬁeld. 2011. KenLM: Faster and smallerlanguage model queries. In

Proceedings of the SixthWorkshop on Statistical Machine Translation , pages187–197, Edinburgh, United Kingdom. Associationfor Computational Linguistics.Jindˇrich Helcl and Jindˇrich Libovick´y. 2017. NeuralMonkey: An open-source tool for sequence learn-ing.

The Prague Bulletin of Mathematical Linguis-tics , (107):5–17.Liang Huang, Suphan Fayong, and Yang Guo. 2012.Structured perceptron with inexact search. In

Pro-ceedings of the 2012 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics: Human Language Technologies , pages142–151, Montr´eal, Canada. Association for Com-putational Linguistics.arcin Junczys-Dowmunt, Kenneth Heaﬁeld, HieuHoang, Roman Grundkiewicz, and Anthony Aue.2018. Marian: Cost-effective high-quality neuralmachine translation in C++. In

Proceedings of the2nd Workshop on Neural Machine Translation andGeneration , pages 129–135, Melbourne, Australia.Association for Computational Linguistics.Diederik P. Kingma and Jimmy Ba. 2015. Adam: Amethod for stochastic optimization. In , SanDiego, CA, USA.Philipp Koehn. 2009.

Statistical machine translation .Cambridge University Press, Cambridge, UK.Taku Kudo and John Richardson. 2018. SentencePiece:A simple and language independent subword tok-enizer and detokenizer for neural text processing. In

Proceedings of the 2018 Conference on EmpiricalMethods in Natural Language Processing: SystemDemonstrations , pages 66–71, Brussels, Belgium.Association for Computational Linguistics.Jason Lee, Elman Mansimov, and Kyunghyun Cho.2018. Deterministic non-autoregressive neural se-quence modeling by iterative reﬁnement. In

Pro-ceedings of the 2018 Conference on Empirical Meth-ods in Natural Language Processing , pages 1173–1182, Brussels, Belgium. Association for Computa-tional Linguistics.Jindˇrich Libovick´y and Jindˇrich Helcl. 2018. End–to-end non-autoregressive neural machine transla-tion with connectionist temporal classiﬁcation. In

Proceedings of the 2018 Conference on EmpiricalMethods in Natural Language Processing , pages3016–3021, Brussels, Belgium. Association forComputational Linguistics.Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic eval-uation of machine translation. In

Proceedings of40th Annual Meeting of the Association for Com-putational Linguistics , pages 311–318, Philadelphia,PA, USA. Association for Computational Linguis-tics.Matt Post. 2018. A call for clarity in reporting BLEUscores. In

Proceedings of the Third Conference onMachine Translation, Volume 1: Research Papers ,pages 186–191, Brussels, Belgium. Association forComputational Linguistics.Rico Sennrich, Barry Haddow, and Alexandra Birch.2016. Improving neural machine translation mod-els with monolingual data. In

Proceedings of the54th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers) , pages86–96, Berlin, Germany. Association for Computa-tional Linguistics.Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is allyou need. In I. Guyon, U. V. Luxburg, S. Bengio,H. Wallach, R. Fergus, S. Vishwanathan, and R. Gar-nett, editors,

Advances in Neural Information Pro-cessing Systems 30 , pages 6000–6010. Curran Asso-ciates, Inc.

Appendix: Parameters

The autoregressive baseline models use roughlythe same set of hyperparameters as the Trans-former base model (Vaswani et al., 2017). En-coder and decoder have 6 layers each, model di-mension is 512, and the dimension of the feed-forward layer is 2,048. We use 16 attention headsin both self-attention and encoder-decoder atten-tion. During training, we use label smoothing of0.1 and we use dropout rate of 0.1. We use Adamoptimizer (Kingma and Ba, 2015) with parameters β = 0 . , β = 0 . , and ǫ = 10 − with ﬁxedlearning rate of − . Due to the GPU memorylimitations, we use batches of 20 sentences each,but we accumulate the gradients and perform theonly update the model parameters every 10 steps.(This makes our batch to have an effective size of200 sentences.)The hyperparameters of the CTC-based modelswere selected to be as comparative as possible tothe autoregressive models, with the following ex-ceptions. The splitting factor between the encoderand the decoder was selected to be k = 3 , follow-ing the setup of Libovick´y and Helcl (2018). Welowered the number of attention heads between theencoder and the decoder to 8 instead of 16. Wechanged the hyperparameter because it lead to bet-ter results in preliminary experiments. For train-ing, instead of batching by a ﬁxed number of sen-tences, we use batches of maximum size of 400tokens. We use the same delayed update intervalof 10 steps per update. Appendix: Additional Results

Quantitative results without the use back-translation, i.e., when the monolingual data areused only for training the target-side languagemodel are shown in in Table 4.Quantitative results on WMT14 English-to-German Data for comparison with related workare presented in Table 3.Method German WMT14en → de de → enNon-autoregressive 19.55 23.04Transformer, greedy 27.29 31.06Transformer, beam 5 27.71 31.85Ours, beam 1 20.59 24.11Ours, beam 5 23.61 27.19Ours, beam 10 24.27 27.83Ours, beam 20 24.41 28.14 Table 3: Quantitative results of the models in terms ofBLEU on the WTM14 data.

C Appendix: Examples

We include a few selected examples from theEnglish-to-German (Table 5), German-to-English(Table 6), and Czech-to-English (Table 7) systemoutputs.Method German WMT15 Romanian WMT16 Czech WMT18 Decodingtime [ms]en → de de → en en → ro ro → en en → cs cs → enNon-autoregressive 19.71 21.64 18.45 25.48 13.92 14.87 314Transformer, greedy 26.39 28.56 19.91 27.33 16.00 22.72 1637Transformer, beam 5 26.99 29.39 20.81 27.99 17.08 23.54 4093Ours, beam 1 20.81 22.68 18.45 26.52 14.86 16.11 326Ours, beam 5 23.29 25.96 20.88 29.67 17.16 20.87 398Ours, beam 10 23.99 26.19 21.52 29.88 17.20 21.52 518Ours, beam 20 24.01 26.59 22.02 29.94 17.24 21.87 1162 Table 4: Quantitative results in terms of BLEU without the use of back-translation. ource On account of their innate aggressiveness, songs of that sort were no longer played on the console.nAR Aufgrund ihrergeboren Aggressivit¨ativit¨at wurden Lieder dieser Art nicht mehr auf der Konsole gespielt. (cid:30)

Two unrelated words are connected (red), malformed word with repeated subwords (blue). nAR + LM Aufgrund ihrer angeborenen Aggressivit¨at wurden Lieder dieser Art nicht mehr auf der Konsole gespielt (cid:30)

Correct but too literal adjective was chosen.

AR Aufgrund ihrer angeborenen Aggressivit¨at wurden Songs dieser Art nicht mehr auf der Konsole gespielt.Reference Aufgrund ihrer ureigenen Aggressivit¨at wurden Songs dieser Art nicht mehr auf der Konsole gespielt.Source Ailinn didn’t understand.nAR A hat nicht. → Fail to copy infrequent proper name. nAR + LM Aili hat nicht verstanden. → Non-LM features ensured more text is copied, but still incorrect.

AR Ailinn verstand es nicht. → Correct.

Reference Ailinn verstand das nicht.Source Further trails are signposted, which lead up towards Hochrh¨on and offer an extensive hike.nAR Weitere Wege sindschilder, die nach Hochrh¨on und eine ausgedehnte Wanderung. (cid:30)

Two unrelated words are connected (red), missing verb in the second clause (blue). nAR + LM Weitere Wege sind ausgeschilder, die in Hochrh¨on und eine ausgedehnte Wanderung. (cid:30)

Connected words got corrected, the second clause (blue) still does not make sense.

AR Weitere Wege sind ausgeschildert, die in Richtung Hochrh¨on f¨uhren und eine ausgedehnte Wanderungbieten. → Correct.

Reference Weitere Wege sind ausgeschildert, die Richtung Hochrh¨on hinaufsteigen und zu einer ausgedehnten Wan-derung einladen.

Table 5: Manually selected examples of system outputs for English-to-German translation containing the mostfrequent error types. ‘nAR’ is the purely non-autoregressive system, ‘nAR + LM’ is the proposed system withbeam size 20.

Source Aber diese Selbstzufriedenheit ist unangebracht.nAR But comp complacency is misguided.nAR + LM But complacency is misguided.AR But this complacency is inappropriate.Reference But such complacency is misplaced.Source Als ich also sehr, sehr ¨ubergewichtig wurde und Symptome von Diabetes zeigte, sagte mein Arzt ”Siem¨ussen radikal sein.nAR So when I very, very overweight and and showed symptoms of diabetes, my my doctor said ”You mustbe radical.nAR + LM So when I became very, very overweight and showed symptoms of diabetes, my doctor said ”You mustbe radical.AR So when I was very, very overweight and showed symptoms of diabetes, my doctor said ”You must beradical.Reference So when I became very, very overweight and started getting diabetic symptoms, my doctor said, ’You’vegot to be radical.

Table 6: German-to-English examples. ource Probl´emem mohou b´yt tak´e jednor´azov´e pleny.nAR Singleaperslso be problem.nAR + LM One can diapers be the problem.AR Single diapers may also be the problem.Reference Disposable incontinence pants may also be a problem.Source Pere se ve mnˇe adolescentn´ı potˇreba uchechtnout se s obdivem nad t´ım, s jak´ym v´aˇzn´ym t´onem je miv´yklad pod´av´an.nAR I adolescent need tohuck with admiration the serious tone my interpret.nAR + LM I have a adolescent need to chuck with wonderation of the serious tone my interpret.AR I’m asking for an adolescent need to laugh at the admiration of the serious tone of my interpretation.Reference I feel the adolescent need to chuckle with admiration for the serious tone with which my comment ishandled.ource Probl´emem mohou b´yt tak´e jednor´azov´e pleny.nAR Singleaperslso be problem.nAR + LM One can diapers be the problem.AR Single diapers may also be the problem.Reference Disposable incontinence pants may also be a problem.Source Pere se ve mnˇe adolescentn´ı potˇreba uchechtnout se s obdivem nad t´ım, s jak´ym v´aˇzn´ym t´onem je miv´yklad pod´av´an.nAR I adolescent need tohuck with admiration the serious tone my interpret.nAR + LM I have a adolescent need to chuck with wonderation of the serious tone my interpret.AR I’m asking for an adolescent need to laugh at the admiration of the serious tone of my interpretation.Reference I feel the adolescent need to chuckle with admiration for the serious tone with which my comment ishandled.