Towards Making the Most of Context in Neural Machine Translation
Zaixiang Zheng, Xiang Yue, Shujian Huang, Jiajun Chen, Alexandra Birch
TTowards Making the Most of Context in Neural Machine Translation
Zaixiang Zheng ∗ , Xiang Yue ∗ , Shujian Huang , Jiajun Chen and Alexandra Birch National Key Laboratory for Novel Software Technology, Nanjing University ILCC, School of Informatics, University of Edinburgh { zhengzx,xiangyue } @smail.nju.edu.cn, { huangsj,chenjj } @nju.edu.cn, [email protected] Abstract
Document-level machine translation manages tooutperform sentence level models by a small mar-gin, but have failed to be widely adopted. We arguethat previous research did not make a clear use ofthe global context, and propose a new document-level NMT framework that deliberately models thelocal context of each sentence with the awarenessof the global context of the document in both sourceand target languages. We specifically design themodel to be able to deal with documents contain-ing any number of sentences, including single sen-tences. This unified approach allows our model tobe trained elegantly on standard datasets withoutneeding to train on sentence and document leveldata separately. Experimental results demonstratethat our model outperforms Transformer baselinesand previous document-level NMT models withsubstantial margins of up to 2.1 BLEU on state-of-the-art baselines. We also provide analyses whichshow the benefit of context far beyond the neigh-boring two or three sentences, which previous stud-ies have typically incorporated. Recent studies suggest that neural machine translation(NMT) [Sutskever et al. , 2014; Bahdanau et al. , 2015;Vaswani et al. , 2017] has achieved human parity, espe-cially on resource-rich language pairs [Hassan et al. , 2018].However, standard NMT systems are designed for sentence-level translation, which cannot consider the dependenciesamong sentences and translate entire documents. To ad-dress the above challenge, various document-level NMTmodels, viz., context-aware models, are proposed to lever-age context beyond a single sentence [Wang et al. , 2017;Miculicich et al. , 2018; Zhang et al. , 2018; Yang et al. ,2019] and have achieved substantial improvements over theircontext-agnostic counterparts. ∗ Equal contribution. This work was done when Zaixiang wasvisiting at the University of Edinburgh. Code was released at https://github.com/Blickwinkel1107/making-the-most-of-context-nmt
Feed ForwardAdd & NormCross AttentionAdd & NormAdd & NormRelative Self AttentionLinearSoftmax x N
Word Emb Pos Emb y k = ⟨ y k ,1 , ⋯, y k , t , ⋯, ⟨𝚎𝚘𝚜⟩⟩ y k −1 = ⟨ y k −1,1 , ⋯, y k −1, t , ⋯, ⟨𝚎𝚘𝚜⟩⟩ y = ⟨ y , ⋯, y t , ⋯, ⟨𝚎𝚘𝚜⟩⟩ … Target Document: Y = ⟨ y , ⋯, y k , ⋯, y n ⟩ Decoder
Self Attention x N
Source Current SentenceSource Document Context
Self Attention x N
Context Attention
Target Document Context Cross Attention
Context Attention
Target Current Sentence
Context-aware encoder Context-aware decoder
Figure 1: Illustration of typical Transformer-based context-awareapproaches (some of them do not consider target context (grey line)).
Figure 1 briefly illustrates typical context-aware models,where the source and/or target document contexts are re-garded as an additional input stream parallel to the currentsentence, and incorporated into each layer of encoder and/ordecoder [Zhang et al. , 2018; Tan et al. , 2019]. More specifi-cally, the representation of each word in the current sentenceis a deep hybrid of both global document context and local sentence context in every layer. We notice that these hybridencoding approaches have two main weaknesses: • Models are context-aware, but do not fully exploit thecontext . The deep hybrid makes the model more sen-sitive to noise in the context, especially when the con-text is enlarged. This could explain why previous studiesshow that enlarging context leads to performance degra-dation. Therefore, these approaches have not taken thebest advantage of the entire document context. • Models translate documents, but cannot translate singlesentences . Because the deep hybrid requires global doc-ument context as additional input, these models are nolonger compatible with sentence-level translation basedon the solely local sentence context. As a result, theseapproaches usually translate poorly for single sentencedocuments without document-level context.In this paper, we mitigate the aforementioned two weak-nesses by designing a general-purpose NMT architecturewhich can fully exploit the context in documents of arbitrarynumber of sentences. To avoid the deep hybrid, our architec-ture balances local context and global context in a more delib-erate way. More specifically, our architecture independentlyencodes local context in the source sentence, instead of mix- a r X i v : . [ c s . C L ] S e p ng it with global context from the beginning so it is robust towhen the global context is large and noisy. Furthermore ourarchitecture translates in a sentence-by-sentence manner withaccess to the partially generated document translation as thetarget global context which allows the local context to governthe translation process for single-sentence documents.We highlight our contributions in three aspects: • We propose a new NMT framework that is able to dealwith documents containing any number of sentences, in-cluding single-sentence documents, making training anddeployment simpler and more flexible. • We conduct experiments on four document-level transla-tion benchmark datasets, which show that the proposedunified approach outperforms Transformer baselines andprevious state-of-the-art document-level NMT modelsboth for sentence-level and document-level translation. • Based on thorough analyses, we demonstrate that thedocument context really matters; and the more contextprovided, the better our model translates. This finding isin contrast to the prevailing consensus that a wider con-text deteriorates translation quality.
Context beyond the current sentence is crucial for machinetranslation. Bawden et al. [2018], L¨aubli et al. [2018], M¨uller et al. [2018], Voita et al. [2018] and Voita et al. [2019b] showthat without access to the document-level context, NMT islikely to fail to maintain lexical, tense, deixis and ellipsis con-sistencies, resolve anaphoric pronouns and other discoursecharacteristics, and propose corresponding testsets for eval-uating discourse phenomena in NMT.Most of the current document-level NMT models can beclassified into two main categories, context-aware model, andpost-processing model. The post-processing models intro-duce an additional module that learns to refine the transla-tions produced by context-agnostic NMT systems to be morediscourse coherence [Xiong et al. , 2019; Voita et al. , 2019a].While this kind of approach is easy to deploy, the two-stagegeneration process may result in error accumulation.In this paper, we pay attention mainly to context-awaremodels, while post-processing approaches can be incorpo-rated with and facilitate any NMT architectures. Tiedemannand Scherrer [2017] and Junczys-Dowmunt [2019] use theconcatenation of multiple sentences (usually a small num-ber of preceding sentences) as NMT’s input/output. Goingbeyond simple concatenation, Jean et al. [2017] introduce aseparate context encoder for a few previous source sentences.Wang et al. [2017] includes a hierarchical RNN to summarizesource context. Other approaches using a dynamic memory tostore representations of previously translated contents [Tu etal. , 2018; Kuang et al. , 2018; Maruf and Haffari, 2018]. Mi-culicich et al. [2018], Zhang et al. [2018], Yang et al. [2019],Maruf et al. [2019] and Tan et al. [2019] extend context-aware model to Transformer architecture with additional con-text related modules.While claiming that modeling the whole document is notnecessary, these models only take into account a few sur-rounding sentences [Maruf and Haffari, 2018; Miculicich et al. , 2018; Zhang et al. , 2018; Yang et al. , 2019], or even onlymonolingual context [Zhang et al. , 2018; Yang et al. , 2019;Tan et al. , 2019], which is not necessarily sufficient to trans-late a document. On the contrary, our model can considerthe entire arbitrary long document and simultaneously exploitcontexts in both source and target languages. Furthermore,most of these document-level models cannot be applied tosentence-level translation, lacking both simplicity and flexi-bility in practice. They rely on variants of components specif-ically designed for document context (e.g., encoder/decoder-to-context attention embedded in all layers [Zhang et al. ,2018; Miculicich et al. , 2018; Tan et al. , 2019]), being lim-ited to the scenario where the document context must be theadditional input stream. Thanks to our general-purpose mod-eling, the proposed model manages to do general translationregardless of the number of sentences of the input text.
Sentence-level NMT
Standard NMT models usually modelsentence-level translation (S
ENT N MT ) within an encoder-decoder framework [Bahdanau et al. , 2015]. Here S ENT -N MT models aim to maximize the conditional log-likelihood log p ( y | x ; θ ) over a target sentence y = (cid:104) y , . . . , y T (cid:105) givena source sentence x = (cid:104) x , . . . , x I (cid:105) from abundant parallelbilingual data D s = { x ( m ) , y ( m ) } Mm =1 of i.i.d observations: L ( D s ; θ ) = (cid:80) Mm =1 log p ( y ( m ) | x ( m ) ; θ ) . Document-level NMT
Given a document-level paralleldataset D d = { X ( m ) , Y ( m ) } Mm =1 , where X ( m ) = (cid:104) x ( m ) k (cid:105) nk =1 is a source document containing n sentences while Y ( m ) = (cid:104) y ( m ) k (cid:105) nk =1 is a target document with n sentences, the train-ing criterion for document-level NMT model (D OC N MT ) isto maximize the conditional log-likelihood over the pairs ofdocument translation sentence by sentence by: L ( D d ; θ ) = M (cid:88) m =1 log p ( Y ( m ) | X ( m ) ; θ )= M (cid:88) m =1 n (cid:88) k =1 log p ( y ( m ) k | y ( m ) Gate & NormSelf AttentionAdd & Norm x = ⟨ x , ⋯, x i , ⋯, x I ⟩ Word EmbPos Emb Seg Emb FeedForwardAdd & Norm Segment-Level Relative Attention Gate & NormSelf AttentionAdd & Norm x n = ⟨ x n ,1 , ⋯, x n , i , ⋯, x n , I ⟩ Word EmbPos Emb Seg Emb FeedForwardAdd & Norm Feed ForwardAdd & Norm Cross Attention Add & NormAdd & NormRelative Self Attention Pos EmbWord Emb y k −1 = ⟨⟨𝚋𝚘𝚜⟩, ⋯, y k −1, t , ⋯⟩ y = ⟨⋯⟩ Segment-Level Relative AttentionGate & Norm Self AttentionAdd & Norm N-1 x x k = ⟨ x k ,1 , ⋯, x k , i , ⋯, x k , I ⟩ Word EmbPos Emb Seg Emb FeedForwardAdd & Norm … …… …… … Feed ForwardAdd & Norm Cross Attention Add & NormAdd & NormRelative Self Attention Linear Softmax x N Pos EmbWord Emb Source Document: X = ⟨ x , ⋯, x k , ⋯, x n ⟩ y k = ⟨⟨𝚋𝚘𝚜⟩, ⋯, y k , t , ⋯, y k , T ⟩ y k = ⟨ y k ,1 , ⋯, y k , t , ⋯, ⟨𝚎𝚘𝚜⟩⟩ y k −1 = ⟨ y k −1,1 , ⋯, y k −1, t , ⋯, ⟨𝚎𝚘𝚜⟩⟩ …… … y = ⟨ y , ⋯, y t , ⋯, ⟨𝚎𝚘𝚜⟩⟩ … Target Document: Y = ⟨ y , ⋯, y k , ⋯, y n ⟩ E x t e nd e d H i s t o r y C o n t e x t Encoder Decoder k n − k … L o c a l E n c o d i ng G l o b a l E n c o d i ng Figure 2: Illustration of the proposed model. The local encoding is complete and independent, which also allows context-agnostic generation. (global encoding) and form hybrid contextual represen-tations (context fusion). For single sentence generation,the global encoding will be dynamically disabled and thelocal context can directly flow through to the decoder todominate translation. (Section 4.1) • Once the local and global understanding of the sourcedocument is constructed, the decoder generates targetdocument by sentence basis, based on source represen-tations of the current sentence as well as target globalcontext from previous translated history and local con-text from the partial translation so far. (Section 4.2)This general-purpose modeling allows the proposed modelto fully utilize bilingual and entire document context and gobeyond the restricted scenario where models must have doc-ument context as additional input streams and fail to translatesingle sentences. These two advantages meet our expectationof a unified and general NMT framework. Lexical and Positional Encoding The source input will be transformed to lexical and positionalrepresentations. We use word position embedding in Trans-former [Vaswani et al. , 2017] to represent the order of words.Note that we reset word positions for each sentence, i.e., the i -th word in each sentence shares the word position embedding E wi . Besides, we introduce segment embedding E sk to repre-sent the k -th sentence. Therefore, the representation of i -thword in k -th sentence is given by ˜ x k,i = E [ x k,i ] + E sk + E wi ,where E [ x k,i ] means word embedding of x k,i . Local Context Encoding We construct the local context for each sentence with a stackof standard transformer layers [Vaswani et al. , 2017]. Giventhe k -th source sentence x k , the local encoder leverages N − stacked layers to map it into encoded representations. ˆh lk = MultiHead ( SelfAttn ( h l − k , h l − k , h l − k )) , h lk = LayerNorm ( FeedForward ( ˆh lk ) + ˆh lk ) , where SelfAttn ( Q , K , V ) denotes self-attention, while Q , K , V indicate queries, keys , and values , respectively. MultiHead ( · ) means the attention is performed in a multi-headed fashion [Vaswani et al. , 2017]. We let the input rep-resentations ˜ x k to be the -th layer representations h k , whilewe denote the ( N − -th layer of the local encoder as thelocal context for each sentence, i.e., h Lk = h N − k . Global Context Encoding We add an additional layer on the top of the local context en-coding layers, which retrieves global context from the entiredocument by a segment-level relative attention , and outputsfinal representations based on hybrid local and global contextby gated context fusion mechanism. Segment-level Relative Attention. Given the local repre-sentations of each sentences, we propose to extend the rela-tive attention [Shaw et al. , 2018] from token-level to segment-level to model the inter-sentence global context: h G = MultiHead ( Seg - Attn ( h L , h L , h L )) , where Seg - Attn ( Q , K , V ) denotes the proposed segment-level relative attention. Let us take x k,i as query as an ex-ample, its the contextual representations z k,i by the proposedattention is computed over all words (e.g., x κ,j ) in the docu-ment regarding the sentence (segment) they belong to: z k,i = n (cid:88) κ =0 | x κ | (cid:88) j =1 α κ,jk,i ( W V x κ,j + γ Vk − κ ) ,α κ,jk,i = softmax ( e κ,jk,i ) , here α κ,jk,i is the attention weight of x k,i to x κ,j . The corre-sponding attention logit e κ,jk,i can be computed with respect torelative sentence distance by: e κ,jk,i = ( W Q x k,i )( W K x κ,j + γ Kk − κ ) (cid:62) / (cid:112) d z , (1)where γ ∗ k − κ is a parameter vector corresponding to the rela-tive distance between the k -th and κ -th sentences, providinginter-sentential clues. W Q , W K , and W V are linear projec-tion matrices for the queries, keys and values, respectively. Gated Context Fusion. After the global context is re-trieved, we adopt a gating mechanism to obtain the final en-coder representations h by fusing local and global context: g = σ ( W g [ h L ; h G ]) , h = LayerNorm (cid:0) (1 − g ) (cid:12) h L + g (cid:12) h G (cid:1) , where W g is a learnable linear transformation. [ · ; · ] denotesconcatenation operation. σ ( · ) is sigmoid activation whichleads the value of the fusion gate to be between 0 to 1. (cid:12) indicates element-wise multiplication. The goal of the decoder is to generate translations sentenceby sentence by considering the generated previous sentencesas target global context. A natural idea is to store the hid-den states of previous target translations and allow the selfattentions of the decoder to access to these hidden states asextended history context.To that purpose, we leverage and extend Transformer-XL [Dai et al. , 2019] as the decoder. Transformer-XL is anovel Transformer variant, which is designed to cache andreuse the previous computed hidden states in the last segmentas an extended context, so that long-term dependency infor-mation occurs many words back could propagate through therecurrence connections between segments, which just meetsour requirement of generating document long text. We casteach sentence as a ”segment” in translation tasks and equipthe Transformer-XL based decoder with cross-attention to re-trieve time-dependent source context for the current sentence.Formally, given two consecutive sentences, y k and y k − , the l -th layer of our decoder first employs self-attention over theextended history context: ˜s l − k = [ SG ( s l − k − ); s l − k ] , ¯s lk = MultiHead ( Rel - SelfAttn ( s l − k , ˜s l − k , ˜s l − k )) , ¯s lk = LayerNorm ( ¯s lk + s l − k ) , where the function SG ( · ) stands for stop-gradient. Rel - SelfAttn ( Q , K , V ) is a variant of self-attentionwith word-level relative position encoding. For more specificdetails, please refer to [Dai et al. , 2019]. After that, thecross-attention module fetching the source context fromencoder representation h k is computed as: ˆs lk = MultiHead ( CrossAttn ( ¯s lk , h k , h k )) , s lk = LayerNorm ( FeedForward ( ˆs lk ) + ˆs lk ) . Given the final representations of the last decoder layer s Nk ,the probability of current target sentence y k are computed as: p ( y k | y We list results of experi-ments in Table 1, comparing four context-aware NMT mod-els: Document-aware Transformer [Zhang et al. , 2018,DocT], Hierarchical Attention NMT [Miculicich et al. , 2018,HAN], Selective Attention NMT [Maruf et al. , 2019, SAN]and Query-guided Capsule Network [Yang et al. , 2019,QCN]. As shown in Table 1, by leveraging document context,our proposed model obtains 2.1, 2.0, 2.5, and 1.0 gains oversentence-level Transformer baselines in terms of BLEU scoreon TED Z H -E N , TED E N -D E , News and Europarl datasets, The last two corpora are from Maruf et al. [2019] odel ∆ | θ | v train v test Z H -E N E N -D E TED TED News Europarl avg.S ENT N MT [Vaswani et al. , 2017] 0.0m 1.0 × × et al. , 2018] 9.5m 0.65 × × n/a 24.00 23.08 29.32 25.46HAN [Miculicich et al. , 2018] 4.8m 0.32 × × et al. , 2019] 4.2m 0.51 × × n/a 24.42 24.84 29.75 26.33QCN [Yang et al. , 2019] n/a n/a n/a n/a URS × × Table 1: Experiment results of our model in comparison with several baselines, including increments of the number of parameters overTransformer baseline ( ∆ | θ | ), training/testing speeds ( v train / v test , some of them are derived from Maruf et al. [2019]), and translation resultsof the testsets in BLEU score. Model TestS ENT NMT 17.0D OC NMT (documents as input/output) 14.2HAN [Miculicich et al. , 2018] 15.6O URS H -E N . respectively. Among them, our model archives new state-of-the-art results on TED Z H -E N and Europarl, showing the su-periority of exploiting the whole document context. Thoughour model is not the best on TED E N -D E and News tasks,it is still comparable with QCN and HAN and achieves thebest average performance on English-German benchmarksby at least 0.47 BLEU score over the best previous model.We suggest this could probably because we did not apply thetwo-stage training scheme used in Miculicich et al. [2018] orregularizations introduced in Yang et al. [2019]. In addition,while sacrificing training speed, the parameter increment anddecoding speed could be manageable. Sentence-level Translation. We compare the performanceon single sentence translation in Table 2, which demonstratesthe good compatibility of our proposed model on both doc-ument and sentence translation, whereas the performance ofother approaches greatly leg behind the sentence-level base-line. The reason is while our proposed model does not, theprevious approaches require document context as a separateinput stream. This difference ensures the feasibility in bothdocument and sentence-level translation in this unified frame-work. Therefore, our proposed model can be directly used ingeneral translation tasks with any input text of any number ofsentences, which is more deployment-friendly. Does Bilingual Context Really Matter? Yes. To investi-gate how important the bilingual context is and correspond-ing contributions of each component, we summary the abla-tion study in Table 3. First of all, using the entire documentas input and output directly cannot even generate documenttranslation with the same number of sentences as source doc-ument, which is much worse than sentence-level baseline andour model in terms of document-level BLEU. For source con-text modeling, only casting the whole source document as aninput sequence (Doc2Sent) does not work. Meanwhile, re-set word positions and introducing segment embedding for Model BLEU (BLEU doc )S ENT N MT [Vaswani et al. , 2017] 11.4 (21.0)D OC N MT (documents as input/output) n/a (17.0) Modeling source context Doc2Sent 6.8+ reset word positions for each sentence 10.0+ segment embedding 10.5+ segment-level relative attention 12.2+ context fusion gate 12.4 Modeling target context Transformer-XL decoder [Sent2Doc] 12.4Final model [O URS ] 12.9 (24.4)Table 3: Ablation study on modeling context on TED Z H -E N devel-opment set. ”Doc” means using a entire document as a sequence forinput or output. BLEU doc indicates the document-level BLEU scorecalculated on the concatenation of all output sentences. each sentence alleviate this problem, which verifies one ofour motivations that we should focus more on local sentences.Moreover, the gains by the segment-level relative attentionand gated context fusion mechanism demonstrate retrievingand integrating source global context are useful for documenttranslation. As for target context, employing Transformer-XL decoder to exploit target historically global context alsoleads to better performance on document translation. This issomewhat contrasted to [Zhang et al. , 2018] claiming that us-ing target context leading to error propagation. In the end,by jointly modeling both source and target contexts, our finalmodel can obtain the best performance. Effect of Quantity of Context: the More, the Better. Wealso experiment to show how the quantity of context affectsour model in document translation. As shown in Figure 3,we find that providing only one adjacent sentence as contexthelps performance on document translation, but that the morecontext is given, the better the translation quality is, althoughthere does seem to be an upper limit of 20 sentences. Suc-cessfully incorporating context of this size is something re-lated work has not successfully achieved [Zhang et al. , 2018;Miculicich et al. , 2018; Yang et al. , 2019]. We attribute thisadvantage to our hierarchical model design which leads tomore gains than pains from the increasingly noisy global con-text guided by the well-formed, uncorrupted local context. Effect of Transfer Learning: Data Hungry Remains aProblem for Document-level Translation. Due to the lim-itation of document-level parallel data, exploiting sentence- igure 3: BLEU score w.r.t. H -E N . Model Dev TestTransformer [Vaswani et al. , 2017] 11.4 17.0BERT+MLM [Li et al. , 2019] n/a 20.7O URS URS + source TL 13.9 19.7O URS + source & target TL 14.9 21.3Table 4: Effect of transfer learning (TL). level parallel corpora or monolingual document-level cor-pora draws more attention. We investigate transfer learning(TL) approaches on TED Z H -E N . We pretrain our modelon WMT18 Z H -E N sentence-level parallel corpus with 7msentence pairs, where every single sentence is regarded as adocument. We then continue to finetune the pretrained modelon TED Z H -E N document-level parallel data (source & tar-get TL). We also compare to a variant only whose encoder isinitialized (source TL). As shown in Table 4, transfer learn-ing approach can help alleviate the need for document leveldata in source and target languages to some extent. However,the scarcity of document-level parallel data still prevents thedocument-level NMT from extending at scale. What Does Model Learns about Context? A Case Study. Furthermore, we are interested in what the proposed modellearns about context. In Figure 4, we visualize the sentence-to-sentence attention weights of a source document based onsegment-level relative attention. Formally, the weight of the k -th sentence attending to the κ -th sentence are computed by α κk = | x k | (cid:80) i (cid:80) j α κ,jk,i , where α κ,jk,i is defined by Eq.(1). Asshown in Figure 4, we find very interesting patterns (whichare also prevalent in other cases): 1) first two sentences (blueframe), which contain the main topic and idea of a document,seem to be a very useful context for all sentences; 2) the pre-vious and subsequent adjacent sentences (red and purple di-agonals, respectively) draw dense attention, which indicatesthe importance of surrounding context; 3) although sound-ing contexts are crucial, the subsequent sentence significantlyoutweighs the previous one. This may imply that the lack oftarget future information but the availability of the past in-formation in the decoder forces the encoder to retrieve moreknowledge about the next sentence than the previous one; 4)the model seems to not that care about the current sentence.Probably because the local context can flow through the con-text fusion gate, the segment-level relative attention just fo-cuses on fetching useful global context; 5) the 6-th sentencealso gets attraction by all the others (brown frame), which Figure 4: Visualization of sentence-to-sentence attention based onsegment-level relative attention. Each row represents a sentencewhile each column represents another sentence to be attended. Theweights of each row sum to 1. Model deixis lex.c. ell.infl. ell.VPS ENT N MT URS et al. [2019b] ∗ % ) of discourse phenomena. ∗ different data andsystem conditions, only for reference. may play a special role in the inspected document. Analysis on Discourse Phenomena. We also want to ex-amine whether the proposed model actually learns to uti-lize document context to resolve discourse inconsistenciesthat context-agnostic models cannot handle. We use con-trastive test sets for the evaluation of discourse phenomenafor English-Russian by Voita et al. [2019b]. There are fourtest sets in the suite regarding deixis, lexicon consistency, el-lipsis (inflection), and ellipsis (verb phrase). Each testset con-tains groups of contrastive examples consisting of a positivetranslation with correct discourse phenomenon and negativetranslations with incorrect phenomena. The goal is to figureout if a model is more likely to generate a correct translationcompared to the incorrect variation. We summarize the re-sults in Table 5. Our model is better at resolving discourseconsistencies compared to context-agnostic baseline. Voita et al. [2019b] use a context-agnostic baseline, trained on × larger data, to generate first-pass drafts, and perform post-processings, which is not directly comparable, but would beeasily incorporated with our model to achieve better results. In this paper, we propose a unified local and global NMTframework, which can successfully exploit context regard-less of how many sentence(s) are in the input. Extensive ex-perimentation and analysis show that our model has indeedlearned to leverage a larger context. In future work we willinvestigate the feasibility of extending our approach to otherdocument-level NLP tasks, e.g., summarization. cknowledgements Shujian Huang is the corresponding author. This workwas supported by the National Science Foundation of China(No. U1836221, 61772261, 61672277). Zaixiang Zhengwas also supported by China Scholarship Council (No.201906190162). Alexandra Birch was supported by the Eu-ropean Union’s Horizon 2020 research and innovation pro-gramme under grant agreements No 825299 (GoURMET)and also by the UK EPSRC fellowship grant EP/S001271/1(MTStretch). References [Bahdanau et al. , 2015] Dzmitry Bahdanau, KyunghyunCho, and Yoshua Bengio. Neural machine translation byjointly learning to align and translate. In ICLR , 2015.[Bawden et al. , 2018] Rachel Bawden, Rico Sennrich,Alexandra Birch, and Barry Haddow. Evaluating dis-course phenomena in neural machine translation. In NAACL-HLT , 2018.[Dai et al. , 2019] Zihang Dai, Zhilin Yang, Yiming Yang,Jaime G. Carbonell, Quoc V. Le, and Ruslan R. Salakhut-dinov. Transformer-xl: Attentive language models beyonda fixed-length context. In ACL , 2019.[Hassan et al. , 2018] Hany Hassan, Anthony Aue, ChangChen, Vishal Chowdhary, Jonathan Clark, Christian Fe-dermann, Xuedong Huang, Marcin Junczys-Dowmunt,William Lewis, Mu Li, et al. Achieving human parityon automatic chinese to english news translation. arXivpreprint arXiv:1803.05567 , 2018.[Jean et al. , 2017] S´ebastien Jean, Stanislas Lauly, Orhan Fi-rat, and Kyunghyun Cho. Does neural machine translationbenefit from larger context? CoRR , abs/1704.05135, 2017.[Junczys-Dowmunt, 2019] Marcin Junczys-Dowmunt. Mi-crosoft translator at wmt 2019: Towards large-scaledocument-level neural machine translation. In WMT ,2019.[Kingma and Ba, 2014] Diederik P. Kingma and Jimmy Ba.Adam: A method for stochastic optimization. In ICLR ,2014.[Kuang et al. , 2018] Shaohui Kuang, Deyi Xiong, WeihuaLuo, and Guodong Zhou. Modeling coherence for neu-ral machine translation with dynamic and topic caches. In COLING , 2018.[L¨aubli et al. , 2018] Samuel L¨aubli, Rico Sennrich, andMartin Volk. Has machine translation achieved humanparity? a case for document-level evaluation. In EMNLP ,2018.[Li et al. , 2019] Liangyou Li, Xin Jiang, Qun Liu, HuaweiNoah’, and Ark Lab. Pretrained Language Models forDocument-Level Neural Machine Translation. arXivpreprint , 2019.[Maruf and Haffari, 2018] Sameen Maruf and GholamrezaHaffari. Document context neural machine translationwith memory networks. In ACL , 2018. [Maruf et al. , 2019] Sameen Maruf, Andr´e FT Martins, andGholamreza Haffari. Selective attention for context-awareneural machine translation. In NAACL-HLT , 2019.[Miculicich et al. , 2018] Lesly Miculicich, Dhananjay Ram,Nikolaos Pappas, and James Henderson. Document-levelneural machine translation with hierarchical attention net-works. In EMNLP , 2018.[M¨uller et al. , 2018] Mathias M¨uller, Annette Rios, ElenaVoita, and Rico Sennrich. A Large-Scale Test Set for theEvaluation of Context-Aware Pronoun Translation in Neu-ral Machine Translation. In WMT , 2018.[Papineni et al. , 2002] Kishore Papineni, Salim Roukos,Todd Ward, and Wei-Jing Zhu. Bleu: a method for au-tomatic evaluation of machine translation. In ACL , 2002.[Sennrich et al. , 2016] Rico Sennrich, Barry Haddow, andAlexandra Birch. Neural machine translation of rare wordswith subword units. In ACL , 2016.[Shaw et al. , 2018] Peter Shaw, Jakob Uszkoreit, and AshishVaswani. Self-attention with relative position representa-tions. In NAACL-HLT , 2018.[Sutskever et al. , 2014] Ilya Sutskever, Oriol Vinyals, andQuoc V Le. Sequence to sequence learning with neuralnetworks. In NIPS , 2014.[Szegedy et al. , 2016] Christian Szegedy, Vincent Van-houcke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna.Rethinking the inception architecture for computer vision.In CVPR , 2016.[Tan et al. , 2019] Xin Tan, Longyin Zhang, Deyi Xiong, andGuodong Zhou. Hierarchical modeling of global con-text for document-level neural machine translation. In EMNLP-IJCNLP , 2019.[Tiedemann and Scherrer, 2017] J¨org Tiedemann and YvesScherrer. Neural machine translation with extended con-text. In DiscoMT , 2017.[Tu et al. , 2018] Zhaopeng Tu, Yang Liu, Shuming Shi, andTong Zhang. Learning to remember translation historywith a continuous cache. TACL , 2018.[Vaswani et al. , 2017] Ashish Vaswani, Noam Shazeer, NikiParmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,Łukasz Kaiser, and Illia Polosukhin. Attention is All youNeed. In NIPS , 2017.[Voita et al. , 2018] Elena Voita, Pavel Serdyukov, Rico Sen-nrich, and Ivan Titov. Context-aware neural machine trans-lation learns anaphora resolution. In ACL , 2018.[Voita et al. , 2019a] Elena Voita, Rico Sennrich, and IvanTitov. Context-aware monolingual repair for neural ma-chine translation. In EMNLP-IJCNLP , 2019.[Voita et al. , 2019b] Elena Voita, Rico Sennrich, and IvanTitov. When a good translation is wrong in context:Context-aware machine translation improves on deixis, el-lipsis, and lexical cohesion. In ACL , 2019.[Wang et al. , 2017] Longyue Wang, Zhaopeng Tu, AndyWay, and Qun Liu. Exploiting cross-sentence context forneural machine translation. In EMNLP , 2017.Xiong et al. , 2019] Hao Xiong, Zhongjun He, Hua Wu, andHaifeng Wang. Modeling coherence for discourse neuralmachine translation. In AAAI , 2019.[Yang et al. , 2019] Zhengxin Yang, Jinchao Zhang, FandongMeng, Shuhao Gu, Yang Feng, and Jie Zhou. Enhancingcontext modeling with a query-guided capsule network fordocument-level translation. In EMNLP-IJCNLP , 2019.[Zhang et al. , 2018] Jiacheng Zhang, Huanbo Luan,Maosong Sun, Feifei Zhai, Jingfang Xu, Min Zhang, andYang Liu. Improving the transformer translation modelwith document-level context. In