[PDF] Towards Making the Most of Context in Neural Machine Translation

Abstract

Document-level machine translation manages to outperform sentence level models by a small margin, but have failed to be widely adopted. We argue that previous research did not make a clear use of the global context, and propose a new document-level NMT framework that deliberately models the local context of each sentence with the awareness of the global context of the document in both source and target languages. We specifically design the model to be able to deal with documents containing any number of sentences, including single sentences. This unified approach allows our model to be trained elegantly on standard datasets without needing to train on sentence and document level data separately. Experimental results demonstrate that our model outperforms Transformer baselines and previous document-level NMT models with substantial margins of up to 2.1 BLEU on state-of-the-art baselines. We also provide analyses which show the benefit of context far beyond the neighboring two or three sentences, which previous studies have typically incorporated.

Full PDF

TTowards Making the Most of Context in Neural Machine Translation

Zaixiang Zheng ∗ , Xiang Yue ∗ , Shujian Huang , Jiajun Chen and Alexandra Birch National Key Laboratory for Novel Software Technology, Nanjing University ILCC, School of Informatics, University of Edinburgh { zhengzx,xiangyue } @smail.nju.edu.cn, { huangsj,chenjj } @nju.edu.cn, [email protected] Abstract

Document-level machine translation manages tooutperform sentence level models by a small mar-gin, but have failed to be widely adopted. We arguethat previous research did not make a clear use ofthe global context, and propose a new document-level NMT framework that deliberately models thelocal context of each sentence with the awarenessof the global context of the document in both sourceand target languages. We speciﬁcally design themodel to be able to deal with documents contain-ing any number of sentences, including single sen-tences. This uniﬁed approach allows our model tobe trained elegantly on standard datasets withoutneeding to train on sentence and document leveldata separately. Experimental results demonstratethat our model outperforms Transformer baselinesand previous document-level NMT models withsubstantial margins of up to 2.1 BLEU on state-of-the-art baselines. We also provide analyses whichshow the beneﬁt of context far beyond the neigh-boring two or three sentences, which previous stud-ies have typically incorporated. Recent studies suggest that neural machine translation(NMT) [Sutskever et al. , 2014; Bahdanau et al. , 2015;Vaswani et al. , 2017] has achieved human parity, espe-cially on resource-rich language pairs [Hassan et al. , 2018].However, standard NMT systems are designed for sentence-level translation, which cannot consider the dependenciesamong sentences and translate entire documents. To ad-dress the above challenge, various document-level NMTmodels, viz., context-aware models, are proposed to lever-age context beyond a single sentence [Wang et al. , 2017;Miculicich et al. , 2018; Zhang et al. , 2018; Yang et al. ,2019] and have achieved substantial improvements over theircontext-agnostic counterparts. ∗ Equal contribution. This work was done when Zaixiang wasvisiting at the University of Edinburgh. Code was released at https://github.com/Blickwinkel1107/making-the-most-of-context-nmt

Feed ForwardAdd & NormCross AttentionAdd & NormAdd & NormRelative Self AttentionLinearSoftmax x N

Word Emb Pos Emb y k = ⟨ y k ,1 , ⋯, y k , t , ⋯, ⟨𝚎𝚘𝚜⟩⟩ y k −1 = ⟨ y k −1,1 , ⋯, y k −1, t , ⋯, ⟨𝚎𝚘𝚜⟩⟩ y = ⟨ y , ⋯, y t , ⋯, ⟨𝚎𝚘𝚜⟩⟩ … Target Document: Y = ⟨ y , ⋯, y k , ⋯, y n ⟩ Decoder

Self Attention x N

Source Current SentenceSource Document Context

Self Attention x N

Context Attention

Target Document Context Cross Attention

Context Attention

Target Current Sentence

Context-aware encoder Context-aware decoder

Figure 1: Illustration of typical Transformer-based context-awareapproaches (some of them do not consider target context (grey line)).

Figure 1 brieﬂy illustrates typical context-aware models,where the source and/or target document contexts are re-garded as an additional input stream parallel to the currentsentence, and incorporated into each layer of encoder and/ordecoder [Zhang et al. , 2018; Tan et al. , 2019]. More speciﬁ-cally, the representation of each word in the current sentenceis a deep hybrid of both global document context and local sentence context in every layer. We notice that these hybridencoding approaches have two main weaknesses: • Models are context-aware, but do not fully exploit thecontext . The deep hybrid makes the model more sen-sitive to noise in the context, especially when the con-text is enlarged. This could explain why previous studiesshow that enlarging context leads to performance degra-dation. Therefore, these approaches have not taken thebest advantage of the entire document context. • Models translate documents, but cannot translate singlesentences . Because the deep hybrid requires global doc-ument context as additional input, these models are nolonger compatible with sentence-level translation basedon the solely local sentence context. As a result, theseapproaches usually translate poorly for single sentencedocuments without document-level context.In this paper, we mitigate the aforementioned two weak-nesses by designing a general-purpose NMT architecturewhich can fully exploit the context in documents of arbitrarynumber of sentences. To avoid the deep hybrid, our architec-ture balances local context and global context in a more delib-erate way. More speciﬁcally, our architecture independentlyencodes local context in the source sentence, instead of mix- a r X i v : . [ c s . C L ] S e p ng it with global context from the beginning so it is robust towhen the global context is large and noisy. Furthermore ourarchitecture translates in a sentence-by-sentence manner withaccess to the partially generated document translation as thetarget global context which allows the local context to governthe translation process for single-sentence documents.We highlight our contributions in three aspects: • We propose a new NMT framework that is able to dealwith documents containing any number of sentences, in-cluding single-sentence documents, making training anddeployment simpler and more ﬂexible. • We conduct experiments on four document-level transla-tion benchmark datasets, which show that the proposeduniﬁed approach outperforms Transformer baselines andprevious state-of-the-art document-level NMT modelsboth for sentence-level and document-level translation. • Based on thorough analyses, we demonstrate that thedocument context really matters; and the more contextprovided, the better our model translates. This ﬁnding isin contrast to the prevailing consensus that a wider con-text deteriorates translation quality.

Context beyond the current sentence is crucial for machinetranslation. Bawden et al. [2018], L¨aubli et al. [2018], M¨uller et al. [2018], Voita et al. [2018] and Voita et al. [2019b] showthat without access to the document-level context, NMT islikely to fail to maintain lexical, tense, deixis and ellipsis con-sistencies, resolve anaphoric pronouns and other discoursecharacteristics, and propose corresponding testsets for eval-uating discourse phenomena in NMT.Most of the current document-level NMT models can beclassiﬁed into two main categories, context-aware model, andpost-processing model. The post-processing models intro-duce an additional module that learns to reﬁne the transla-tions produced by context-agnostic NMT systems to be morediscourse coherence [Xiong et al. , 2019; Voita et al. , 2019a].While this kind of approach is easy to deploy, the two-stagegeneration process may result in error accumulation.In this paper, we pay attention mainly to context-awaremodels, while post-processing approaches can be incorpo-rated with and facilitate any NMT architectures. Tiedemannand Scherrer [2017] and Junczys-Dowmunt [2019] use theconcatenation of multiple sentences (usually a small num-ber of preceding sentences) as NMT’s input/output. Goingbeyond simple concatenation, Jean et al. [2017] introduce aseparate context encoder for a few previous source sentences.Wang et al. [2017] includes a hierarchical RNN to summarizesource context. Other approaches using a dynamic memory tostore representations of previously translated contents [Tu etal. , 2018; Kuang et al. , 2018; Maruf and Haffari, 2018]. Mi-culicich et al. [2018], Zhang et al. [2018], Yang et al. [2019],Maruf et al. [2019] and Tan et al. [2019] extend context-aware model to Transformer architecture with additional con-text related modules.While claiming that modeling the whole document is notnecessary, these models only take into account a few sur-rounding sentences [Maruf and Haffari, 2018; Miculicich et al. , 2018; Zhang et al. , 2018; Yang et al. , 2019], or even onlymonolingual context [Zhang et al. , 2018; Yang et al. , 2019;Tan et al. , 2019], which is not necessarily sufﬁcient to trans-late a document. On the contrary, our model can considerthe entire arbitrary long document and simultaneously exploitcontexts in both source and target languages. Furthermore,most of these document-level models cannot be applied tosentence-level translation, lacking both simplicity and ﬂexi-bility in practice. They rely on variants of components specif-ically designed for document context (e.g., encoder/decoder-to-context attention embedded in all layers [Zhang et al. ,2018; Miculicich et al. , 2018; Tan et al. , 2019]), being lim-ited to the scenario where the document context must be theadditional input stream. Thanks to our general-purpose mod-eling, the proposed model manages to do general translationregardless of the number of sentences of the input text.

Sentence-level NMT

Standard NMT models usually modelsentence-level translation (S

ENT N MT ) within an encoder-decoder framework [Bahdanau et al. , 2015]. Here S ENT -N MT models aim to maximize the conditional log-likelihood log p ( y | x ; θ ) over a target sentence y = (cid:104) y , . . . , y T (cid:105) givena source sentence x = (cid:104) x , . . . , x I (cid:105) from abundant parallelbilingual data D s = { x ( m ) , y ( m ) } Mm =1 of i.i.d observations: L ( D s ; θ ) = (cid:80) Mm =1 log p ( y ( m ) | x ( m ) ; θ ) . Document-level NMT

Given a document-level paralleldataset D d = { X ( m ) , Y ( m ) } Mm =1 , where X ( m ) = (cid:104) x ( m ) k (cid:105) nk =1 is a source document containing n sentences while Y ( m ) = (cid:104) y ( m ) k (cid:105) nk =1 is a target document with n sentences, the train-ing criterion for document-level NMT model (D OC N MT ) isto maximize the conditional log-likelihood over the pairs ofdocument translation sentence by sentence by: L ( D d ; θ ) = M (cid:88) m =1 log p ( Y ( m ) | X ( m ) ; θ )= M (cid:88) m =1 n (cid:88) k =1 log p ( y ( m ) k | y ( m )

Gate & NormSelf AttentionAdd & Norm x = ⟨ x , ⋯, x i , ⋯, x I ⟩ Word EmbPos Emb Seg Emb

FeedForwardAdd & Norm

Segment-Level Relative Attention

Gate & NormSelf AttentionAdd & Norm x n = ⟨ x n ,1 , ⋯, x n , i , ⋯, x n , I ⟩ Word EmbPos Emb Seg Emb

FeedForwardAdd & Norm Feed ForwardAdd & Norm

Cross Attention

Add & NormAdd & NormRelative Self Attention

Pos EmbWord Emb y k −1 = ⟨⟨𝚋𝚘𝚜⟩, ⋯, y k −1, t , ⋯⟩ y = ⟨⋯⟩ Segment-Level Relative AttentionGate & Norm

Self AttentionAdd & Norm

N-1 x x k = ⟨ x k ,1 , ⋯, x k , i , ⋯, x k , I ⟩ Word EmbPos Emb Seg Emb

FeedForwardAdd & Norm … …… …… …

Feed ForwardAdd & Norm

Cross Attention

Add & NormAdd & NormRelative Self Attention

Linear

Softmax x N

Pos EmbWord Emb

Source Document: X = ⟨ x , ⋯, x k , ⋯, x n ⟩ y k = ⟨⟨𝚋𝚘𝚜⟩, ⋯, y k , t , ⋯, y k , T ⟩ y k = ⟨ y k ,1 , ⋯, y k , t , ⋯, ⟨𝚎𝚘𝚜⟩⟩ y k −1 = ⟨ y k −1,1 , ⋯, y k −1, t , ⋯, ⟨𝚎𝚘𝚜⟩⟩ …… … y = ⟨ y , ⋯, y t , ⋯, ⟨𝚎𝚘𝚜⟩⟩ … Target Document: Y = ⟨ y , ⋯, y k , ⋯, y n ⟩ E x t e nd e d H i s t o r y C o n t e x t Encoder Decoder k n − k … L o c a l E n c o d i ng G l o b a l E n c o d i ng Figure 2: Illustration of the proposed model. The local encoding is complete and independent, which also allows context-agnostic generation. (global encoding) and form hybrid contextual represen-tations (context fusion). For single sentence generation,the global encoding will be dynamically disabled and thelocal context can directly ﬂow through to the decoder todominate translation. (Section 4.1) • Once the local and global understanding of the sourcedocument is constructed, the decoder generates targetdocument by sentence basis, based on source represen-tations of the current sentence as well as target globalcontext from previous translated history and local con-text from the partial translation so far. (Section 4.2)This general-purpose modeling allows the proposed modelto fully utilize bilingual and entire document context and gobeyond the restricted scenario where models must have doc-ument context as additional input streams and fail to translatesingle sentences. These two advantages meet our expectationof a uniﬁed and general NMT framework.

Lexical and Positional Encoding

The source input will be transformed to lexical and positionalrepresentations. We use word position embedding in Trans-former [Vaswani et al. , 2017] to represent the order of words.Note that we reset word positions for each sentence, i.e., the i -th word in each sentence shares the word position embedding E wi . Besides, we introduce segment embedding E sk to repre-sent the k -th sentence. Therefore, the representation of i -thword in k -th sentence is given by ˜ x k,i = E [ x k,i ] + E sk + E wi ,where E [ x k,i ] means word embedding of x k,i . Local Context Encoding

We construct the local context for each sentence with a stackof standard transformer layers [Vaswani et al. , 2017]. Giventhe k -th source sentence x k , the local encoder leverages N − stacked layers to map it into encoded representations. ˆh lk = MultiHead ( SelfAttn ( h l − k , h l − k , h l − k )) , h lk = LayerNorm ( FeedForward ( ˆh lk ) + ˆh lk ) , where SelfAttn ( Q , K , V ) denotes self-attention, while Q , K , V indicate queries, keys , and values , respectively. MultiHead ( · ) means the attention is performed in a multi-headed fashion [Vaswani et al. , 2017]. We let the input rep-resentations ˜ x k to be the -th layer representations h k , whilewe denote the ( N − -th layer of the local encoder as thelocal context for each sentence, i.e., h Lk = h N − k . Global Context Encoding

We add an additional layer on the top of the local context en-coding layers, which retrieves global context from the entiredocument by a segment-level relative attention , and outputsﬁnal representations based on hybrid local and global contextby gated context fusion mechanism.

Segment-level Relative Attention.

Given the local repre-sentations of each sentences, we propose to extend the rela-tive attention [Shaw et al. , 2018] from token-level to segment-level to model the inter-sentence global context: h G = MultiHead ( Seg - Attn ( h L , h L , h L )) , where Seg - Attn ( Q , K , V ) denotes the proposed segment-level relative attention. Let us take x k,i as query as an ex-ample, its the contextual representations z k,i by the proposedattention is computed over all words (e.g., x κ,j ) in the docu-ment regarding the sentence (segment) they belong to: z k,i = n (cid:88) κ =0 | x κ | (cid:88) j =1 α κ,jk,i ( W V x κ,j + γ Vk − κ ) ,α κ,jk,i = softmax ( e κ,jk,i ) , here α κ,jk,i is the attention weight of x k,i to x κ,j . The corre-sponding attention logit e κ,jk,i can be computed with respect torelative sentence distance by: e κ,jk,i = ( W Q x k,i )( W K x κ,j + γ Kk − κ ) (cid:62) / (cid:112) d z , (1)where γ ∗ k − κ is a parameter vector corresponding to the rela-tive distance between the k -th and κ -th sentences, providinginter-sentential clues. W Q , W K , and W V are linear projec-tion matrices for the queries, keys and values, respectively. Gated Context Fusion.

After the global context is re-trieved, we adopt a gating mechanism to obtain the ﬁnal en-coder representations h by fusing local and global context: g = σ ( W g [ h L ; h G ]) , h = LayerNorm (cid:0) (1 − g ) (cid:12) h L + g (cid:12) h G (cid:1) , where W g is a learnable linear transformation. [ · ; · ] denotesconcatenation operation. σ ( · ) is sigmoid activation whichleads the value of the fusion gate to be between 0 to 1. (cid:12) indicates element-wise multiplication. The goal of the decoder is to generate translations sentenceby sentence by considering the generated previous sentencesas target global context. A natural idea is to store the hid-den states of previous target translations and allow the selfattentions of the decoder to access to these hidden states asextended history context.To that purpose, we leverage and extend Transformer-XL [Dai et al. , 2019] as the decoder. Transformer-XL is anovel Transformer variant, which is designed to cache andreuse the previous computed hidden states in the last segmentas an extended context, so that long-term dependency infor-mation occurs many words back could propagate through therecurrence connections between segments, which just meetsour requirement of generating document long text. We casteach sentence as a ”segment” in translation tasks and equipthe Transformer-XL based decoder with cross-attention to re-trieve time-dependent source context for the current sentence.Formally, given two consecutive sentences, y k and y k − , the l -th layer of our decoder ﬁrst employs self-attention over theextended history context: ˜s l − k = [ SG ( s l − k − ); s l − k ] , ¯s lk = MultiHead ( Rel - SelfAttn ( s l − k , ˜s l − k , ˜s l − k )) , ¯s lk = LayerNorm ( ¯s lk + s l − k ) , where the function SG ( · ) stands for stop-gradient. Rel - SelfAttn ( Q , K , V ) is a variant of self-attentionwith word-level relative position encoding. For more speciﬁcdetails, please refer to [Dai et al. , 2019]. After that, thecross-attention module fetching the source context fromencoder representation h k is computed as: ˆs lk = MultiHead ( CrossAttn ( ¯s lk , h k , h k )) , s lk = LayerNorm ( FeedForward ( ˆs lk ) + ˆs lk ) . Given the ﬁnal representations of the last decoder layer s Nk ,the probability of current target sentence y k are computed as: p ( y k | y

We list results of experi-ments in Table 1, comparing four context-aware NMT mod-els: Document-aware Transformer [Zhang et al. , 2018,DocT], Hierarchical Attention NMT [Miculicich et al. , 2018,HAN], Selective Attention NMT [Maruf et al. , 2019, SAN]and Query-guided Capsule Network [Yang et al. , 2019,QCN]. As shown in Table 1, by leveraging document context,our proposed model obtains 2.1, 2.0, 2.5, and 1.0 gains oversentence-level Transformer baselines in terms of BLEU scoreon TED Z H -E N , TED E N -D E , News and Europarl datasets, The last two corpora are from Maruf et al. [2019] odel ∆ | θ | v train v test Z H -E N E N -D E TED TED News Europarl avg.S

ENT N MT [Vaswani et al. , 2017] 0.0m 1.0 × × et al. , 2018] 9.5m 0.65 × × n/a 24.00 23.08 29.32 25.46HAN [Miculicich et al. , 2018] 4.8m 0.32 × × et al. , 2019] 4.2m 0.51 × × n/a 24.42 24.84 29.75 26.33QCN [Yang et al. , 2019] n/a n/a n/a n/a URS × × Table 1: Experiment results of our model in comparison with several baselines, including increments of the number of parameters overTransformer baseline ( ∆ | θ | ), training/testing speeds ( v train / v test , some of them are derived from Maruf et al. [2019]), and translation resultsof the testsets in BLEU score. Model

TestS

ENT

NMT 17.0D OC NMT (documents as input/output) 14.2HAN [Miculicich et al. , 2018] 15.6O

URS H -E N . respectively. Among them, our model archives new state-of-the-art results on TED Z H -E N and Europarl, showing the su-periority of exploiting the whole document context. Thoughour model is not the best on TED E N -D E and News tasks,it is still comparable with QCN and HAN and achieves thebest average performance on English-German benchmarksby at least 0.47 BLEU score over the best previous model.We suggest this could probably because we did not apply thetwo-stage training scheme used in Miculicich et al. [2018] orregularizations introduced in Yang et al. [2019]. In addition,while sacriﬁcing training speed, the parameter increment anddecoding speed could be manageable. Sentence-level Translation.

We compare the performanceon single sentence translation in Table 2, which demonstratesthe good compatibility of our proposed model on both doc-ument and sentence translation, whereas the performance ofother approaches greatly leg behind the sentence-level base-line. The reason is while our proposed model does not, theprevious approaches require document context as a separateinput stream. This difference ensures the feasibility in bothdocument and sentence-level translation in this uniﬁed frame-work. Therefore, our proposed model can be directly used ingeneral translation tasks with any input text of any number ofsentences, which is more deployment-friendly.

Does Bilingual Context Really Matter? Yes.

To investi-gate how important the bilingual context is and correspond-ing contributions of each component, we summary the abla-tion study in Table 3. First of all, using the entire documentas input and output directly cannot even generate documenttranslation with the same number of sentences as source doc-ument, which is much worse than sentence-level baseline andour model in terms of document-level BLEU. For source con-text modeling, only casting the whole source document as aninput sequence (Doc2Sent) does not work. Meanwhile, re-set word positions and introducing segment embedding for

Model

BLEU (BLEU doc )S ENT N MT [Vaswani et al. , 2017] 11.4 (21.0)D OC N MT (documents as input/output) n/a (17.0) Modeling source context

Doc2Sent 6.8+ reset word positions for each sentence 10.0+ segment embedding 10.5+ segment-level relative attention 12.2+ context fusion gate 12.4

Modeling target context

Transformer-XL decoder [Sent2Doc] 12.4Final model [O

URS ] 12.9 (24.4)Table 3: Ablation study on modeling context on TED Z H -E N devel-opment set. ”Doc” means using a entire document as a sequence forinput or output. BLEU doc indicates the document-level BLEU scorecalculated on the concatenation of all output sentences. each sentence alleviate this problem, which veriﬁes one ofour motivations that we should focus more on local sentences.Moreover, the gains by the segment-level relative attentionand gated context fusion mechanism demonstrate retrievingand integrating source global context are useful for documenttranslation. As for target context, employing Transformer-XL decoder to exploit target historically global context alsoleads to better performance on document translation. This issomewhat contrasted to [Zhang et al. , 2018] claiming that us-ing target context leading to error propagation. In the end,by jointly modeling both source and target contexts, our ﬁnalmodel can obtain the best performance. Effect of Quantity of Context: the More, the Better.

Wealso experiment to show how the quantity of context affectsour model in document translation. As shown in Figure 3,we ﬁnd that providing only one adjacent sentence as contexthelps performance on document translation, but that the morecontext is given, the better the translation quality is, althoughthere does seem to be an upper limit of 20 sentences. Suc-cessfully incorporating context of this size is something re-lated work has not successfully achieved [Zhang et al. , 2018;Miculicich et al. , 2018; Yang et al. , 2019]. We attribute thisadvantage to our hierarchical model design which leads tomore gains than pains from the increasingly noisy global con-text guided by the well-formed, uncorrupted local context.

Effect of Transfer Learning: Data Hungry Remains aProblem for Document-level Translation.

Due to the lim-itation of document-level parallel data, exploiting sentence- igure 3: BLEU score w.r.t. H -E N . Model

Dev TestTransformer [Vaswani et al. , 2017] 11.4 17.0BERT+MLM [Li et al. , 2019] n/a 20.7O

URS

URS + source TL 13.9 19.7O

URS + source & target TL 14.9 21.3Table 4: Effect of transfer learning (TL). level parallel corpora or monolingual document-level cor-pora draws more attention. We investigate transfer learning(TL) approaches on TED Z H -E N . We pretrain our modelon WMT18 Z H -E N sentence-level parallel corpus with 7msentence pairs, where every single sentence is regarded as adocument. We then continue to ﬁnetune the pretrained modelon TED Z H -E N document-level parallel data (source & tar-get TL). We also compare to a variant only whose encoder isinitialized (source TL). As shown in Table 4, transfer learn-ing approach can help alleviate the need for document leveldata in source and target languages to some extent. However,the scarcity of document-level parallel data still prevents thedocument-level NMT from extending at scale. What Does Model Learns about Context? A Case Study.

Furthermore, we are interested in what the proposed modellearns about context. In Figure 4, we visualize the sentence-to-sentence attention weights of a source document based onsegment-level relative attention. Formally, the weight of the k -th sentence attending to the κ -th sentence are computed by α κk = | x k | (cid:80) i (cid:80) j α κ,jk,i , where α κ,jk,i is deﬁned by Eq.(1). Asshown in Figure 4, we ﬁnd very interesting patterns (whichare also prevalent in other cases): 1) ﬁrst two sentences (blueframe), which contain the main topic and idea of a document,seem to be a very useful context for all sentences; 2) the pre-vious and subsequent adjacent sentences (red and purple di-agonals, respectively) draw dense attention, which indicatesthe importance of surrounding context; 3) although sound-ing contexts are crucial, the subsequent sentence signiﬁcantlyoutweighs the previous one. This may imply that the lack oftarget future information but the availability of the past in-formation in the decoder forces the encoder to retrieve moreknowledge about the next sentence than the previous one; 4)the model seems to not that care about the current sentence.Probably because the local context can ﬂow through the con-text fusion gate, the segment-level relative attention just fo-cuses on fetching useful global context; 5) the 6-th sentencealso gets attraction by all the others (brown frame), which Figure 4: Visualization of sentence-to-sentence attention based onsegment-level relative attention. Each row represents a sentencewhile each column represents another sentence to be attended. Theweights of each row sum to 1.

Model deixis lex.c. ell.inﬂ. ell.VPS

ENT N MT URS et al. [2019b] ∗ % ) of discourse phenomena. ∗ different data andsystem conditions, only for reference. may play a special role in the inspected document. Analysis on Discourse Phenomena.

We also want to ex-amine whether the proposed model actually learns to uti-lize document context to resolve discourse inconsistenciesthat context-agnostic models cannot handle. We use con-trastive test sets for the evaluation of discourse phenomenafor English-Russian by Voita et al. [2019b]. There are fourtest sets in the suite regarding deixis, lexicon consistency, el-lipsis (inﬂection), and ellipsis (verb phrase). Each testset con-tains groups of contrastive examples consisting of a positivetranslation with correct discourse phenomenon and negativetranslations with incorrect phenomena. The goal is to ﬁgureout if a model is more likely to generate a correct translationcompared to the incorrect variation. We summarize the re-sults in Table 5. Our model is better at resolving discourseconsistencies compared to context-agnostic baseline. Voita et al. [2019b] use a context-agnostic baseline, trained on × larger data, to generate ﬁrst-pass drafts, and perform post-processings, which is not directly comparable, but would beeasily incorporated with our model to achieve better results. In this paper, we propose a uniﬁed local and global NMTframework, which can successfully exploit context regard-less of how many sentence(s) are in the input. Extensive ex-perimentation and analysis show that our model has indeedlearned to leverage a larger context. In future work we willinvestigate the feasibility of extending our approach to otherdocument-level NLP tasks, e.g., summarization. cknowledgements

Shujian Huang is the corresponding author. This workwas supported by the National Science Foundation of China(No. U1836221, 61772261, 61672277). Zaixiang Zhengwas also supported by China Scholarship Council (No.201906190162). Alexandra Birch was supported by the Eu-ropean Union’s Horizon 2020 research and innovation pro-gramme under grant agreements No 825299 (GoURMET)and also by the UK EPSRC fellowship grant EP/S001271/1(MTStretch).

References [Bahdanau et al. , 2015] Dzmitry Bahdanau, KyunghyunCho, and Yoshua Bengio. Neural machine translation byjointly learning to align and translate. In

ICLR , 2015.[Bawden et al. , 2018] Rachel Bawden, Rico Sennrich,Alexandra Birch, and Barry Haddow. Evaluating dis-course phenomena in neural machine translation. In

NAACL-HLT , 2018.[Dai et al. , 2019] Zihang Dai, Zhilin Yang, Yiming Yang,Jaime G. Carbonell, Quoc V. Le, and Ruslan R. Salakhut-dinov. Transformer-xl: Attentive language models beyonda ﬁxed-length context. In

ACL , 2019.[Hassan et al. , 2018] Hany Hassan, Anthony Aue, ChangChen, Vishal Chowdhary, Jonathan Clark, Christian Fe-dermann, Xuedong Huang, Marcin Junczys-Dowmunt,William Lewis, Mu Li, et al. Achieving human parityon automatic chinese to english news translation. arXivpreprint arXiv:1803.05567 , 2018.[Jean et al. , 2017] S´ebastien Jean, Stanislas Lauly, Orhan Fi-rat, and Kyunghyun Cho. Does neural machine translationbeneﬁt from larger context?

CoRR , abs/1704.05135, 2017.[Junczys-Dowmunt, 2019] Marcin Junczys-Dowmunt. Mi-crosoft translator at wmt 2019: Towards large-scaledocument-level neural machine translation. In

WMT ,2019.[Kingma and Ba, 2014] Diederik P. Kingma and Jimmy Ba.Adam: A method for stochastic optimization. In

ICLR ,2014.[Kuang et al. , 2018] Shaohui Kuang, Deyi Xiong, WeihuaLuo, and Guodong Zhou. Modeling coherence for neu-ral machine translation with dynamic and topic caches. In

COLING , 2018.[L¨aubli et al. , 2018] Samuel L¨aubli, Rico Sennrich, andMartin Volk. Has machine translation achieved humanparity? a case for document-level evaluation. In

EMNLP ,2018.[Li et al. , 2019] Liangyou Li, Xin Jiang, Qun Liu, HuaweiNoah’, and Ark Lab. Pretrained Language Models forDocument-Level Neural Machine Translation. arXivpreprint , 2019.[Maruf and Haffari, 2018] Sameen Maruf and GholamrezaHaffari. Document context neural machine translationwith memory networks. In

ACL , 2018. [Maruf et al. , 2019] Sameen Maruf, Andr´e FT Martins, andGholamreza Haffari. Selective attention for context-awareneural machine translation. In

NAACL-HLT , 2019.[Miculicich et al. , 2018] Lesly Miculicich, Dhananjay Ram,Nikolaos Pappas, and James Henderson. Document-levelneural machine translation with hierarchical attention net-works. In

EMNLP , 2018.[M¨uller et al. , 2018] Mathias M¨uller, Annette Rios, ElenaVoita, and Rico Sennrich. A Large-Scale Test Set for theEvaluation of Context-Aware Pronoun Translation in Neu-ral Machine Translation. In

WMT , 2018.[Papineni et al. , 2002] Kishore Papineni, Salim Roukos,Todd Ward, and Wei-Jing Zhu. Bleu: a method for au-tomatic evaluation of machine translation. In

ACL , 2002.[Sennrich et al. , 2016] Rico Sennrich, Barry Haddow, andAlexandra Birch. Neural machine translation of rare wordswith subword units. In

ACL , 2016.[Shaw et al. , 2018] Peter Shaw, Jakob Uszkoreit, and AshishVaswani. Self-attention with relative position representa-tions. In

NAACL-HLT , 2018.[Sutskever et al. , 2014] Ilya Sutskever, Oriol Vinyals, andQuoc V Le. Sequence to sequence learning with neuralnetworks. In

NIPS , 2014.[Szegedy et al. , 2016] Christian Szegedy, Vincent Van-houcke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna.Rethinking the inception architecture for computer vision.In

CVPR , 2016.[Tan et al. , 2019] Xin Tan, Longyin Zhang, Deyi Xiong, andGuodong Zhou. Hierarchical modeling of global con-text for document-level neural machine translation. In

EMNLP-IJCNLP , 2019.[Tiedemann and Scherrer, 2017] J¨org Tiedemann and YvesScherrer. Neural machine translation with extended con-text. In

DiscoMT , 2017.[Tu et al. , 2018] Zhaopeng Tu, Yang Liu, Shuming Shi, andTong Zhang. Learning to remember translation historywith a continuous cache.

TACL , 2018.[Vaswani et al. , 2017] Ashish Vaswani, Noam Shazeer, NikiParmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,Łukasz Kaiser, and Illia Polosukhin. Attention is All youNeed. In

NIPS , 2017.[Voita et al. , 2018] Elena Voita, Pavel Serdyukov, Rico Sen-nrich, and Ivan Titov. Context-aware neural machine trans-lation learns anaphora resolution. In

ACL , 2018.[Voita et al. , 2019a] Elena Voita, Rico Sennrich, and IvanTitov. Context-aware monolingual repair for neural ma-chine translation. In

EMNLP-IJCNLP , 2019.[Voita et al. , 2019b] Elena Voita, Rico Sennrich, and IvanTitov. When a good translation is wrong in context:Context-aware machine translation improves on deixis, el-lipsis, and lexical cohesion. In

ACL , 2019.[Wang et al. , 2017] Longyue Wang, Zhaopeng Tu, AndyWay, and Qun Liu. Exploiting cross-sentence context forneural machine translation. In

EMNLP , 2017.Xiong et al. , 2019] Hao Xiong, Zhongjun He, Hua Wu, andHaifeng Wang. Modeling coherence for discourse neuralmachine translation. In

AAAI , 2019.[Yang et al. , 2019] Zhengxin Yang, Jinchao Zhang, FandongMeng, Shuhao Gu, Yang Feng, and Jie Zhou. Enhancingcontext modeling with a query-guided capsule network fordocument-level translation. In

EMNLP-IJCNLP , 2019.[Zhang et al. , 2018] Jiacheng Zhang, Huanbo Luan,Maosong Sun, Feifei Zhai, Jingfang Xu, Min Zhang, andYang Liu. Improving the transformer translation modelwith document-level context. In