[PDF] A Comparison of Approaches to Document-level Machine Translation

Abstract

Document-level machine translation conditions on surrounding sentences to produce coherent translations. There has been much recent work in this area with the introduction of custom model architectures and decoding algorithms. This paper presents a systematic comparison of selected approaches from the literature on two benchmarks for which document-level phenomena evaluation suites exist. We find that a simple method based purely on back-translating monolingual document-level data performs as well as much more elaborate alternatives, both in terms of document-level metrics as well as human evaluation.

Full PDF

aa r X i v : . [ c s . C L ] J a n A Comparison of Approaches to Document-level Machine Translation

Zhiyi Ma Sergey Edunov Michael Auli

Facebook AI Research { mazhiyi, edunov, michaelauli } @fb.com Abstract

Document-level machine translation condi-tions on surrounding sentences to produce co-herent translations. There has been much re-cent work in this area with the introduction ofcustom model architectures and decoding algo-rithms. This paper presents a systematic com-parison of selected approaches from the litera-ture on two benchmarks for which document-level phenomena evaluation suites exist. Weﬁnd that a simple method based purely onback-translating monolingual document-leveldata performs as well as much more elaboratealternatives, both in terms of document-levelmetrics as well as human evaluation.

Machine translation made a lot of progresswith the invention of better model architec-tures (Bahdanau et al., 2015; Gehring et al., 2017;Vaswani et al., 2017) and data augmentationtechniques (Sennrich et al., 2016a; Edunov et al.,2018). This has been followed by reports of modelperformance approaching or exceeding human-level accuracy (Wu et al., 2016; Hassan et al.,2018). However, these claims usually onlyhold when models are evaluated on the sentence-level and have been disproven when the sametranslations are evaluated on the document-level (Toral et al., 2018).Neural networks have made it easier to incor-porate more context into models compared toengineered features. For example, modern lan-guage models are able to exploit long-range con-text by obtaining better perplexity through condi-tioning predictions on entire Wikipedia articles in-stead of the limited context provided by individualsentences (Merity et al., 2016; Baevski and Auli,2019; Dai et al., 2019).For machine translation, there has been a lotof interest in document-level translation resulting in custom architectures (Jean et al., 2015, 2017;Zhang et al., 2018), data augmentation techniquesto address the scarcity of parallel data withdocument boundaries (Junczys-Dowmunt, 2019;Voita et al., 2019a), and better decoding algo-rithms to capture document context with languagemodels (Yu et al., 2020).In this paper, we present a comparison of var-ious approaches and evaluate their performancein a common setup in terms of BLEU, document-level consistency metrics, as well as human judg-ments. Our work complements another recentcomparison study (Lopes et al., 2020) by focusingon methods that leverage additional monolingualdata.Experimental results show that a simple base-line trained only on back-translated document-level data can perform very competitively com-pared to both DocRepair (Voita et al., 2019a)and neural noisy channel modeling (Yu et al.,2020), two recently introduced document-level ap-proaches leveraging monolingual data in a muchmore elaborate and compute-intensive way.

There has been a lot of work on custom modelarchitectures to integrate document context intotranslation models. Most work focuses on improv-ing context representations, such as context-awareencoders (Voita et al., 2018; Zhang et al., 2018),context-aware decoders (Voita et al., 2019b), andhierarchical history representations (Wang et al.,2017; Miculicich et al., 2018), as well as the ap-plication of memory networks (Maruf and Haffari,2018). Pretraining has also been shown to be ef-fective for document-level translation (Liu et al.,2020).Since document-level bitext is scarce, therehave been several studies on applying data aug-entation methods using monolingual document-level data. Adding monolingual data can be effec-tive, either by creating synthetic bitext using back-translation (Junczys-Dowmunt, 2019) or learn-ing to correct inconsistencies on the document-level using round-trip translations (Voita et al.,2019a). Noisy channel modeling (Yu et al., 2017;Yee et al., 2019) has also been applied to make useof language models trained on monolingual docu-ments to capture cross-sentence context (Yu et al.,2020).Evaluation of document-level machine trans-lation is also an active area of research.Scherrer et al. (2019) studied the effect on transla-tion quality by manipulating discourse-level prop-erties in data, and document-level consistency met-rics on test sets (Voita et al., 2019b; M¨uller et al.,2018) have been promoted as a good indicator ondocument-level translation quality.Lopes et al. (2020) presents a systematic studyon document-level translation methods which fo-cuses on model architectures. We complementtheir work by also studying methods which lever-age additional monolingual document-level data.

Next, we outline a number of approaches todocument-level translation as well as a few sim-ple baselines. Table 1 provides an overview andshows the amount of context modeled by eachmethod.

This is a stan-dard sequence to sequence model trained on pairsof individual sentences and this approach does notmodel any document-level context.

Back-translation (SentBT).

To improve thesentence-level baseline (Sent) we consider back-translation as a stronger baseline. Back-translation uses additional target-side monolin-gual data to improve machine translation mod-els (Bojar and Tamchyna, 2011; Sennrich et al.,2016b; Edunov et al., 2018). This is done by gen-erating a synthetic source for the target monolin-gual data via a model trained to translate from thetarget language to the source language. We denotetraining with just synthetic sentence-level data asSentBT, and including true bitext sentence-leveldata as Sent + SentBT.

In this setup, the model takes as in-put both a document and a mask specifying whichsource sentence is to be translated. It then trans-lates this speciﬁc target sentence only. To trans-late the next source sentence, the mask is changedand so forth. The mask is implemented as alearnable embedding representing binary valuesfor each token and it is added to the encoder out-put. Doc2Sent models the full source document,but no context beyond the current target sentencepreﬁx is modeled on the target side. This setupenables us to understand the importance of havingtarget-side document context.

Window2Window.

This models a sliding win-dow of a ﬁxed number of sentences N , both in thesource and the target. The window is adjusted aftera source sentence is translated and we concatenatethe last generated sentence to form the ﬁnal gener-ation, i.e., the previous N − sentences are treatedas context only. In our experiments we set N = 2 . Doc2Doc.

Tiedemann and Scherrer (2017) andthen Junczys-Dowmunt (2019) proposed a sim-ple but very effective document-level translationapproach by training a standard sequence to se-quence model on pairs of bitext documents. Doc-uments are split into examples of no more than1,000 subword units and sentences are separatedwith a special token. This makes the sometimesincorrect assumption that the number of sentencesin parallel documents is the same. There is noalignment of sentences when the document is splitinto smaller chunks but Junczys-Dowmunt (2019)ﬁnds that often the correct number of target sen-tences is predicted at inference time. The methodoutperforms very strong sentence-level systemsin human evaluation. Doc2Doc models the fullsource context as well as the target document pre-ﬁx.

Back-translation (DocBT).

We can also back-translate monolingual document-level data inthe target language with a sentence-level sys-tem (Junczys-Dowmunt, 2019). The resultingsource sentences are simply concatenated into asource document and the source and target doc-ument pair is treated as a training example. Wedenote training with just synthetic data as DocBT,and including true bitext document-level data as(Doc2Doc + DocBT).ethod source context target contextSent sent sentSentBT sent sentNoisyChannelSent sent sentDoc2Sent doc sentWindow2Window left doc left docDoc2Doc doc left docDocRepair sent docDocBT doc left docNoisyChannelDoc sent/doc left doc

Table 1: Overview of approaches and the context modeled by each. We compare pure sentence-level methods,document-level techniques, both with and without data augmentation to leverage monolingual data.

Voita et al. (2019a) proposed a post-editingmethod to ﬁx inconsistencies between sentence-level translations using target side document-levelmonolingual data. This method requires sentence-level translation models for both forward andbackward directions. At training time, mono-lingual target documents are translated to thesource and then back to the target using thesentence-level systems. This yields groups ofround-trip translated sentences from the originalmonolingual document-level data which maycontain inconsistencies with respect to each other.Another model is trained to map these incon-sistent groups of sentences to the original consis-tent documents (DocRepair model). At test time,source documents are ﬁrst translated into inconsis-tent target sentences using a sentence-level model.This is followed by the DocRepair model map-ping the inconsistent target sentence translationsto a coherent document. Since the ﬁrst step is per-formed by a sentence-level system, this methoddoes not use any source document context, how-ever, DocRepair has access to the full context onthe target side.

By leveraging Bayes’ rule,this approach models a mapping from the source x to target y , i.e., p ( y | x ) = p ( x | y ) p ( y ) /p ( x ) ∝ p ( x | y ) p ( y ) where p ( x | y ) and p ( y ) are referred to as the chan-nel model and language model, respectively. Stan-dard sequence to sequence models directly pa-rameterize p ( y | x ) , which is referred to as direct model. As another baseline, we train the chan-nel model and the language model on sentence-level data (Yu et al., 2017; Yee et al., 2019) andrerank the n-best list output of a sentence-level direct model. For reranking, we choosethe hypothesis which maximizes t log( y | x ) + λ s (log p ( x | y ) + log p ( y )) where t is the length oftarget, s is the source length and λ is a tunableweight (Yee et al., 2019). NoisyChannelDoc.

Yu et al. (2020) proposed acontext-aware noisy channel reranking method byusing document-level language models. First, n-best lists for every source sentence are generated,either with a sentence-level direct model or adocument-level direct model. These n-best listsare then reranked with a beam-search that appliesthe document-level language model as well as asentence-level channel model. The beam searchmaximizes λ log q ( y ≤ i | x ≤ i ) + λ log p ( x ≤ i | y ≤ i )+ log p ( y ≤ i ) + λ | y ≤ i | (1)where x ≤ i and y ≤ i are partial source and targetdocuments and | . | denotes the total number of to-kens; λ , λ , λ are hyper-parameters to be opti-mized. Depending on the direct model which per-formed the initial translation, this approach usesthe entire source context or the current source sen-tence. The target document-level language modeluses the entire target preﬁx. We perform experiments on two benchmarks,OpenSubtitles2018 English-Russian (en-ru) andMT’17 English-German (en-de) and next weoutline these benchmarks, the evaluation protocolas well as the model setup.

For this dataset we follow the setup of Voita et al.(2018) and Voita et al. (2019a). The corpus hasbeen originally derived from publicly availableOpenSubtitles2018 (Lison et al., 2019) Englishand Russian data, and consists of three parts: 6Msentence-level bitext examples, 1.5M bitext docu-ments and 30M monolingual Russian documents.The document-level parallel corpus comprises1.5M examples of four sentences each (denoted as1.5M d ). Examples are based on a sliding windowover 2M unique sentences, where sentences area subset of the 6M sentence-level bitext. Themonolingual data consists of 30M documents,each consisting of four consecutive Russiansentences (denoted as 30M md ), when split intosentences this monolingual data comprises 120Mexamples (denoted as 120M m ).The validation and test sets contain 10k doc-uments of four sentences each, constructed simi-larly to the training data but heldout to avoid over-lap. There is no overlap between training bitext ormonolingual data with validation and test set onthe document-level.We use byte pair encoding (BPE; Sennrich et al.2016c) with 24k merges to segment words intosubword units for English and Russian, respec-tively. The dataset is already tokenized and wecompute tokenized case-insensitive BLEU on thedocument-level. BLEU is computed with Moses multi-bleu.perl (Koehn et al., 2007).

WMT17 English-German (en-de).

For thisbenchmark, we follow the setup of M¨uller et al.(2018) whose training data includes the Europarl,Common Crawl, News Commentary and Rapidcorpora, totaling nearly 6M sentence pairs.As monolingual data we use Newscrawl 2017 inGerman which has document boundaries. It con-tains 3.6M documents or 73M sentences (denotedas 73M m ). We preprocess the data by normaliz-ing punctuation, removing non-printable charac-ters and Moses tokenization (Koehn et al., 2007).We use BPE with 32k merges shared between En-glish and German. We used shared vocabular-ies for English and German because these per- Dataset available at https://github.com/lena-voita/good-translation-wrong-in-context formed best in initial experiments. To be con-sistent with the English-Russian document-levelsetup, we split each document of the monolingualdata into separate examples of up to four sentencesor a maximum of 1000 tokens. This results in a to-tal of 19.5M documents (denoted as 19.5M md ).The bitext training data does not provide ex-plicit document boundaries but most of the sen-tences are ordered as in the original documents.We consider a setup, where we simply split thebitext training data into pseudo-documents follow-ing the same strategy as for the monolingual data.This results in 1.3M documents of up to four sen-tences or 1k tokens each (denoted as 1.3M d ).We use newstest2016 for validation, contain-ing 155 documents with 2999 sentences. Astest set we use newstest2017, newstest2018 andnewstest2019, which contain 130 documents with3004 sentences, 122 documents with 2998 sen-tences and 123 documents with 1997 sentences,respectively. We split the data into separate ex-amples, similar to the monolingual data, leadingto 811 documents for the validation data, and 796,799 and 549 documents for each test set, respec-tively. We evaluate with detokenized BLEU (Post,2018) In order to capture translation quality on thedocument-level, we consider two consistency eval-uation sets which are available for the two bench-marks we consider. Both evaluation suites requiredistinguishing sentences which are consistent withthe provided context, typically several preceedingsentences. This is done by simply choosing thesentence which obtains the highest model scoreaccording to the translation model by scoring thepossible translations provided by the challenge set.

Discourse phenomena for English-Russian.

Voita et al. (2019b) evaluate four discourse phe-nomena for English-Russian, namely deixis, lex-ical cohesion, ellipsis inﬂ. and ellipsis verb pre-diction (VP). We provide a short overview of thistest set and refer the reader to the original paperfor further details. The deixis evaluation requiresdiscriminating between formal and informal Rus-sian translations of the English you depending onthe context. Lexical cohesion focuses on consis-tent translation of named entities across the entiredocument. Ellipsis evaluates whether the transla-tion model can disambiguate elliptical structures.llipsis inﬂection (inﬂ.) evaluates whether modelscan correctly predict the morphological form of anoun group which can only be understood fromcontext beyond that sentence. Ellipsis VP tests forthe correct translation of a verb phrase that doesnot exist in Russian. The total number of exam-ples in each test set is 3000, 2000, 500, 500 fordeixis, lexical cohesion, ellipsis inﬂ. and ellipsisVP, respectively. All examples are four sentenceslong.

Pronoun translation for English-German.

M¨uller et al. (2018) presents a large scalecontrastive test set for pronoun translation inEnglish-German which requires document-levelcontext. It tests the ability to identify the correctGerman translation of the English pronoun it aseither es , sie and er . The evaluation set contains12k examples with 4k for each pronoun. Thenumber of context sentences is customizeable, andfor 80% of test examples, document-level contextis required to produce the correct translation. Forsentence-level models, we use no context, andfor document-level models, we use the numberof available sentences in our documents, which istypically four for WMT17 en-de. Human evaluations were performed by certiﬁedprofessional translators that are native speakersof the target language as well as ﬂuent in thesource language. All assessments are conductedon the document-level, using exactly the same dataas used for document-level models, as describedin Section 4.1. To compare multiple systems inEnglish-Russian we use source based direct assess-ment. Raters evaluate correctness and complete-ness on a scale of 1-100 for each translation givena source document. This evaluation has the ben-eﬁt of being independent of the provided humanreferences which may affect the evaluation.We collected three judgements per translation.If any two raters disagree by more than 30 points,we discard the result and request reevaluation ofthe translation. Evaluation was blind and random-ized: human raters did not know the identity ofthe systems and all outputs were shufﬂed to en-sure that each rater provides a similar number ofjudgements for each system.Following the WMT shared task evaluation pro-tocol (Bojar et al., 2018), we normalize the scoresof each rater by the mean and standard deviation of all ratings provided by the rater. We remove raterswho have rated fewer than 10 translations in total.Next, we average the normalized ratings for eachsentence and average all per-translation scores toproduce an aggregate per-system z-score. We ran-domly sampled 200 examples from the standardtest set and 100 examples from the consistencytest set (25 from each discourse phenomenon), andconducted human evaluation for the two sets inde-pendently.We conﬁrm our ﬁndings on English-German,for which we did a system comparison study todirectly compare a few select systems. Human an-notators were presented with a source documentand two candidate translations and were asked tojudge which translation is better. For each trans-lation, we collect three judgements and determinehuman preference based on the system which ispreferred by the majority of raters.

Models are implemented in fairseq (Ott et al.,2019). We use the Adam opti-mizer (Kingma and Ba, 2015), with β = 0 . , β = 0 . and ǫ = 10 − . We use the learning rateschedule described in Vaswani et al. (2017) with4,000 warmup steps, an initial learning rate of − and a maximum learning rate of × − . OpenSubtitles2018 English-Russian.

We usetransformer base models with dropout 0.3, trainfor 300k updates on 8 GPUs and tune the batchsize on the validation set in the range of 128k and512k tokens. We use early stopping when valida-tion loss stops improving and apply checkpoint av-eraging on last 5 checkpoints. For generation, weuse beam search of width 4, following (Voita et al.,2019a), and tune the length penalty on the valida-tion data. . WMT17 English-German.

We train trans-former big models for 300k updates on 32 GPUswith a batch size of 262k tokens, and earlystop based on the validation loss. We use thecheckpoint with the best validation loss withoutaveraging. For generation we use a beam width of5 and tune the length penalty on the same set ofvalues as English-Russian.

Language model.

We use a transformer big de-coder only (Baevski and Auli, 2019), with 12 de- We tried the following values: [0.01, 0.1, 0.3, 0.5, 0.8, 1,2, 4, 8] oder layers, dropout 0.1, embedding dimension512, and without layer normalization (Ba et al.,2016) after the last decoder block. We use a cosinelearning rate scheduler where the learning rate isincreased linearly from − to for 16k warmupsteps (Loshchilov and Hutter, 2016). We tune thenumber of updates in the range [316k, 616k, 916k],use the best checkpoint according to validationloss, and train on 8 GPUs with a batch size of 16ktokens for English-Russian and on 32 GPUs witha batch size of 65.5k tokens for English-German. Back-translation.

Synthetic sources are gener-ated with an ensemble of four models and unre-stricted sampling (Edunov et al., 2018). For mod-els trained with a combination of true bitext andback-translated data, we upsample the true bitextby tuning the upsample ratio over the values [1, 10,20, 40, 60].

We ﬁrst compare various approaches to document-level translation as well as sentence-level base-lines on the English-Russian Opensubtitles 2018benchmark (Table 2). We measure BLEU on thedocument-level test set of Opensubtitles and accu-racy on the consistency evaluation.First, we ﬁnd that sentence-level systems per-form well in terms of BLEU but poorly in terms ofthe document-level consistency evaluation. Thisincludes a system trained purely on sentence-leveldata (Sent), augmented with back-translated data(Sent + SentBT), and noisy channel rerankingwith a sentence-level language model (NoisyChan-nelSent; Yu et al. 2017; Yee et al. 2019).Second, we evaluate document-level systemstrained purely on bilingual document-level train-ing data (1.5M documents) to understand how theamount of context modeled impacts accuracy. Interms of the consistency evaluation, we ﬁnd thatthese systems perform better with more contextmodeled: Doc2Sent uses the entire source contextbut models only a single target sentence. This per-forms least well, although better than the sentence-level baselines. Modeling a sliding window ofsource and target sentences improves on this (Win-dow2Window) and treating the entire document asa consecutive sequence performs best (Doc2Doc).However, in terms of BLEU, all aforemen-tioned document-level systems underperform the sentence-level systems. We suspect that this is be-cause these document-level systems are trained onless bitext data: the 1.5M documents contain only2M unique sentences since documents were cre-ated through a sliding window over the 6M bitextsentences used by Sent (§ 4.1).Third, we evaluate various document-level ap-proaches based on adding 30M monolingualdocuments (30M md ). In terms of the consis-tency evaluation, our reimplementation of DocRe-pair (Voita et al., 2019a) performs very welland outperforms the sentence-level systems, in-cluding the ones based on sentence-level back-translation (SentBT). Noisy channel rerankingwith a document-level language model (Yu et al.,2020) peforms very well in terms of BLEU butless so in terms of the consistency evaluation. Then-best lists to be reranked by the noisy channel ap-proach are based on a sentence-level-system andwe therefore re-generate them with a document-level system (Doc2Sent). This improves the con-sistency evaluation but still does not perform aswell as the other approaches relying on document-level monolingual data.Finally, simply back-translating the monolin-gual documents and training a standard sequence-to-sequence model on this data outperforms allabove approaches on the consistency test set, in-cluding DocRepair which requires two translationsteps at inference time compared to a single back-translation step for DocBT. Interestingly, addingthe true bitext documents (1.5M d , Doc2Doc) doesnot improve over solely back-translated docu-ments (DocBT).Automatic evaluation in terms of BLEU and theconsistency test set results are not in strong agree-ment. We therefore collect judgments from pro-fessional human translators with source-based di-rect assessment (§ 4.3). For this evaluation weretain all systems except for Doc2Sent and Win-dow2Window to make the human study more man-ageable and because these systems were clearlyoutperformed by Doc2Doc.Human judgements (Table 3) on the docu-ments of the consistency test set conﬁrm thatDocBT performs very well compared to the otherdata augmentation-based approaches (DocRepair,NoisyChannelDoc) and the results show a cleardistinction between sentence-level and document-level approaches. However, human preferencesare much less pronounced on the standard test set ethod Training data BLEU ( ↑ ) Consistency test set ( ↑ )avg deixis lex. c. ell. inﬂ. ell. VPSent (Voita et al., 2019a) 6M 33.9 44.3 50.0 45.9 53.0 28.4DocRepair (Voita et al., 2019a) 6M + 30M md m m d md md d md d Table 2: Results on Opensubtitles 2018 English to Russian translation in terms of BLEU on the test set andconsistency evaluation scores (Voita et al., 2019a). (Please note the avg column is an arithmetic average of thefour discourse phenomena.) We indicate the amount and type of training data: no subscript denotes sentence-levelbitext, (m) denotes monolingual data, (d) denotes document-level data.

Method Test set Consistency test setZ scores ( ↑ ) std rank Z scores ( ↑ ) std rankReference Table 3: Human evaluation results for Opensubtitles 2018 English-Russian translation on the test set of the bench-mark as well as on the consistency test set of Voita et al. (2019a). We randomly sampled 200 examples from thestandard test set, and 100 examples from the consistency test set (25 from each discourse phenomenon subset).Results marked with * are statistically signiﬁcantly better than the baseline (Sent) system at p=0.05. with no systems clearly outperforming the others.This is likely because the examples in the consis-tency test set were selected to test for phenomenawhich are not as prevalent in existing test sets. So far we saw that simple back-translation of doc-uments (DocBT) performed competitively to morecomplicated semi-supervised methods. To con-ﬁrm these ﬁndings we perform another experi-ment on WMT17 English-German translation andcompare DocBT to DocRepair and NoisyChan-nelDoc, as well as a few simpler alternatives.Following M¨uller et al. (2018), we measure per-formance in terms of sentence-level detokenizedBLEU on newstest2017-2019. We also compare to the best sentence-level and document-level resultsof M¨uller et al. (2018) whose pronoun contrastivetask we use in our study.The results (Table 4) show that sentence-levelsystems perform poorly on the document-levelmetrics which require modeling context informa-tion. The document-level systems outperform thesentence-level baselines on the contrastive pro-noun task and the simple DocBT method ranksamongst the best systems in the consistency eval-uation. However, additional monolingual datadoes not improve the consistency evaluation overjust training on bitext document data (Doc2Doc).NoisyChannelDoc performs less well than theother document-level methods. This is likely be-cause the n-best lists for reranking were gener- ethod Training data BLEU ( ↑ ) Contrastive reference pronoun2017 2018 2019 total( ↑ ) es er sie M¨uller et al. (2018) Sent 6M 24.6 35.4 - 0.47 0.81 0.22 0.38M¨uller et al. (2018) Doc 6M d m m d md d md d Table 4: Results on WMT17 English to German translation in terms of BLEU on various WMT test sets, and acontrastive test suite evaluating pronoun selection (M¨uller et al. 2018; cf. Table 2).

Test No. Method Objective metric HumanBLEU ( ↑ ) Contrastive score total ( ↑ ) preference ( ↑ )1 Sent 37.9 0.50 0.33DocBT 32.1 0.81 Table 5: Human preferences on WMT17 English-German data. We ask human raters to indicate which system ispreferred on 100 randomly sampled examples from newstest2019, each up to 1000 tokens long (§ 4.1). Resultsmarked by * are statistically signiﬁcantly better than the other system at p = 0 . . ated with sentence-level direct models and usinga document-level direct model would improve re-sults (similar to NoisyChannelDoc + Doc2Sentin Table 2).Similar to before, BLEU does not enable strongconclusions. In particular, DocBT performspoorly on newstest2019 which is a test set that ispurely forward translated, that is, sentences orig-inally written in English are paired with Germanhuman translations and thus BLEU is measuredagainst human translated text (Bojar et al., 2019).This is also the case for DocRepair, whose train-ing data involves roundtrip translation. While re-alistic, for this setup, BLEU has been shown tocorrelate very poorly with human judgements onforward translated test data (Edunov et al., 2020).We therefore also evaluate BLEU on the German-English version of newstest2019 with source andtarget reversed and ﬁnd that DocBT and DocBT+ DocDoc obtains the highest BLEU amongst allsystems on this test set, followed by DocRepair.To draw stronger conclusions about the perfor- mance of DocBT, we perform another smaller hu-man study. We ask professional human transla-tors to give preference ratings for DocBT vs. thesentence-level baseline (Sent) in a ﬁrst evaluationand DocBT vs. NoisyChannelDoc in a secondevaluation. We focus on NoisyChannelDoc in fa-vor of DocRepair because the former achieved bet-ter BLEU. Table 5 clearly shows that DocBT isclearly both over the sentence-level baseline (Sent)as well as the more complicated NoisyChannel-Doc method.

We compared several recent approaches todocument-level translation on two benchmarkdatasets. We ﬁnd that training a standard sequenceto sequence model on back-translated document-level monolingual data presents a very competi-tive baseline. We encourage future research indocument-level translation to compare to this base- Human evaluation for DocRepair vs DocBT is inprogress and will be included in the next version of this paper. ine.Evaluation of document-level translation is chal-lenging and we present results in terms of au-tomatic metrics as well as human evaluation.Document-level consistency evaluation suites areuseful and clearly distinguish systems capableof modeling long-range context from sentence-level systems. However, their construction likelyoveremphasizes phenomena which are not as fre-quent in other datasets.

References

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey EHinton. 2016. Layer normalization. arXiv ,abs/1607.06450.Alexei Baevski and Michael Auli. 2019. Adaptive in-put representations for neural language modeling. In arXiv , volume abs/1809.10853.Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2015. Neural machine translation by jointlylearning to align and translate. In

Proc. of ICLR .Ondrej Bojar and Ales Tamchyna. 2011. Improvingtranslation model by monolingual data. In

Proc. ofWMT .Ondˇrej Bojar, Christian Federmann, Mark Fishel,Yvette Graham, Barry Haddow, Matthias Huck,Philipp Koehn, and Christof Monz. 2018. Find-ings of the 2018 conference on machine translation(WMT18). In

Proc. of WMT .Ondˇrej Bojar, Christian Federmann, Mark Fishel,Yvette Graham, Barry Haddow, Matthias Huck,Philipp Koehn, and Christof Monz. 2019. Find-ings of the 2019 conference on machine translation(wmt19). In

Proc. of WMT .Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G. Car-bonell, Quoc V. Le, and Ruslan Salakhutdinov. 2019.Transformer-xl: Attentive language models beyonda ﬁxed-length context. arXiv , abs/1901.02860.Sergey Edunov, Myle Ott, Michael Auli, and DavidGrangier. 2018. Understanding back-translation atscale. In

Proc. of EMNLP .Sergey Edunov, Myle Ott, Marc’Aurelio Ranzato, andMichael Auli. 2020. On the evaluation of machinetranslation systems trained with back-translation. In

Proc. of ACL .Jonas Gehring, Michael Auli, David Grangier, DenisYarats, and Yann N Dauphin. 2017. ConvolutionalSequence to Sequence Learning. In

Proc. of ICML .Hany Hassan, Anthony Aue, Chang Chen, VishalChowdhary, Jonathan Clark, Christian Federmann,Xuedong Huang, Marcin Junczys-Dowmunt, Will Lewis, Mu Li, Shujie Liu, Tie-Yan Liu, RenqianLuo, Arul Menezes, Tao Qin, Frank Seide, Xu Tan,Fei Tian, Lijun Wu, Shuangzhi Wu, Yingce Xia,Dongdong Zhang, Zhirui Zhang, and Ming Zhou.2018. Achieving human parity on automatic chineseto english news translation. arXiv , abs/1803.05567.S´ebastien Jean, Kyunghyun Cho, RolandMemisevic, and Yoshua Bengio. 2015.On using very large target vocabulary for neural machine translation.In

Proceedings of the 53rd Annual Meeting of theAssociation for Computational Linguistics andthe 7th International Joint Conference on NaturalLanguage Processing (Volume 1: Long Papers) ,pages 1–10, Beijing, China. Association forComputational Linguistics.Sebastien Jean, Stanislas Lauly, Orhan Firat, andKyunghyun Cho. 2017. Does neural machine trans-lation beneﬁt from larger context? arXiv .Marcin Junczys-Dowmunt. 2019. Microsoft translatorat WMT 2019: Towards large-scale document-levelneural machine translation. arXiv , abs/1907.06170.DP Kingma and LJ Ba. 2015. Adam: A method forstochastic optimization. arXiv .Philipp Koehn, Hieu Hoang, Alexandra Birch, ChrisCallison-Burch, Marcello Federico, Nicola Bertoldi,Brooke Cowan, Wade Shen, Christine Moran,Richard Zens, Chris Dyer, Ondrej Bojar, AlexandraConstantin, and Evan Herbst. 2007. Moses: Opensource toolkit for statistical machine translation. In

Proc. of ACL Demo Session .Pierre Lison, J¨org Tiedemann, Milen Kouylekov, et al.2019. Open subtitles 2018: Statistical rescoring ofsentence alignments in large, noisy parallel corpora.In

LREC 2018, Eleventh International Conferenceon Language Resources and Evaluation . EuropeanLanguage Resources Association (ELRA).Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, SergeyEdunov, Marjan Ghazvininejad, Mike Lewis, andLuke Zettlemoyer. 2020. Multilingual denoisingpre-training for neural machine translation. arXivpreprint arXiv:2001.08210 .Ant´onio Lopes, M Amin Farajian, Rachel Baw-den, Michael Zhang, and Andr´e Martins. 2020.Document-level neural mt: A systematic compari-son. In , pages 225–234.Ilya Loshchilov and Frank Hutter. 2016. Sgdr: Stochas-tic gradient descent with warm restarts. arXivpreprint arXiv:1608.03983 .Sameen Maruf and Gholamreza Haffari. 2018. Docu-ment context neural machine translation with mem-ory networks. In

Proc. of ACL .Stephen Merity, Caiming Xiong, James Bradbury, andRichard Socher. 2016. Pointer sentinel mixture mod-els. arXiv , abs/1609.07843.esly Miculicich, Dhananjay Ram, Nikolaos Pappas,and James Henderson. 2018. Document-level neuralmachine translation with hierarchical attention net-works. arXiv preprint arXiv:1809.01576 .Mathias M¨uller, Annette Rios, Elena Voita, andRico Sennrich. 2018. A large-scale test set forthe evaluation of context-aware pronoun transla-tion in neural machine translation. arXiv preprintarXiv:1810.02268 .Myle Ott, Sergey Edunov, Alexei Baevski,Angela Fan, Sam Gross, Nathan Ng,David Grangier, and Michael Auli. 2019.fairseq: A fast, extensible toolkit for sequence modeling.In

Proc. of NAACL: Demonstrations .Matt Post. 2018. A call for clarity in reporting bleuscores.

WMT 2018 , page 186.Yves Scherrer, J¨org Tiedemann, and Sharid Lo´aiciga.2019. Analysing concatenation approaches todocument-level nmt in two different domains. In

Proceedings of the Fourth Workshop on Discoursein Machine Translation (DiscoMT 2019) , pages 51–61.Rico Sennrich, Barry Haddow, and Alexandra Birch.2016a. Improving neural machine translation mod-els with monolingual data. In

Proc. of ACL .Rico Sennrich, Barry Haddow,and Alexandra Birch. 2016b.Improving neural machine translation models with monolingual data.In

Proceedings of the 54th Annual Meeting of theAssociation for Computational Linguistics (Volume1: Long Papers) , pages 86–96, Berlin, Germany.Association for Computational Linguistics.Rico Sennrich, Barry Haddow, and Alexandra Birch.2016c. Neural machine translation of rare wordswith subword units. In

Proceedings of the 54th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers) , pages 1715–1725.J¨org Tiedemann and Yves Scherrer. 2017. Neural ma-chine translation with extended context. In

Proc. ofWorkshop on Discourse in Machine Translation .Antonio Toral, Sheila Castilho, Ke Hu, and AndyWay. 2018. Attaining the unattainable? reassessingclaims of human parity in neural machine translation.In

Proc. of WMT .Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N. Gomez, LukaszKaiser, and Illia Polosukhin. 2017. Attention Is AllYou Need. In

Proc. of NIPS .Elena Voita, Rico Sennrich, and Ivan Titov. 2019a.Context-aware monolingual repair for neural ma-chine translation. In

Proceedings of the 2019 Con-ference on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP) , pages 876–885. Elena Voita, Rico Sennrich, and Ivan Titov. 2019b.When a good translation is wrong in context:Context-aware machine translation improves ondeixis, ellipsis, and lexical cohesion. arXiv preprintarXiv:1905.05979 .Elena Voita, Pavel Serdyukov, Rico Sennrich, and IvanTitov. 2018. Context-aware neural machine trans-lation learns anaphora resolution. In

Proceedingsof the 56th Annual Meeting of the Association forComputational Linguistics (Volume 1: Long Papers) ,pages 1264–1274.Longyue Wang, Zhaopeng Tu, Andy Way, and QunLiu. 2017. Exploiting cross-sentence contextfor neural machine translation. arXiv preprintarXiv:1704.04347 .Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc VLe, Mohammad Norouzi, Wolfgang Macherey,Maxim Krikun, Yuan Cao, Qin Gao, KlausMacherey, et al. 2016. Google’s Neural Ma-chine Translation System: Bridging the Gap be-tween Human and Machine Translation. arXiv ,abs/1609.08144.Kyra Yee, Yann Dauphin, and Michael Auli. 2019.Simple and effective noisy channel modeling forneural machine translation. In

Proceedings of the2019 Conference on Empirical Methods in Natu-ral Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP) , pages 5700–5705.Lei Yu, Phil Blunsom, Chris Dyer, Edward Grefen-stette, and Tom´as Kocisk´y. 2017. The neural noisychannel. In

Proc. of ICLR .Lei Yu, Laurent Sartran, Wojciech Stokowiec, WangLing, Lingpeng Kong, Phil Blunsom, and ChrisDyer. 2020. Better document-level machine trans-lation with bayes’ rule.

Transactions of the Associa-tion for Computational Linguistics , 8:346–360.Jiacheng Zhang, Huanbo Luan, Maosong Sun, FeifeiZhai, Jingfang Xu, Min Zhang, and Yang Liu. 2018.Improving the transformer translation model withdocument-level context. In