Neural Machine Translation with Extended Context
NNeural Machine Translation with Extended Context
J¨org Tiedemann and
Yves Scherrer
Department of Modern LanguagesUniversity of Helsinki
Abstract
We investigate the use of extended contextin attention-based neural machine transla-tion. We base our experiments on trans-lated movie subtitles and discuss the effectof increasing the segments beyond singletranslation units. We study the use of ex-tended source language context as well asbilingual context extensions. The mod-els learn to distinguish between informa-tion from different segments and are sur-prisingly robust with respect to transla-tion quality. In this pilot study, we ob-serve interesting cross-sentential attentionpatterns that improve textual coherence intranslation at least in some selected cases.
Typical models of machine translation handle sen-tences in isolation and discard any information be-yond sentence boundaries. Efforts in making sta-tistical MT aware of discourse-level phenomenaappeared to be difficult (Hardmeier, 2012; Carpuatand Simard, 2012; Hardmeier et al., 2013a). Vari-ous studies have been published that consider tex-tual coherence, document-wide translation con-sistency, the proper handling of referential ele-ments such as pronominal anaphora and otherdiscourse-level phenomena (Guillou, 2012; Russoet al., 2012; Voigt and Jurafsky, 2012; Xiong et al.,2013a; Ben et al., 2013; Xiong and Zhang, 2013;Xiong et al., 2013b; Loaiciga et al., 2014). Thetypical approach in the literature focuses on thedevelopment of task-specific components that areoften tested as standalone modules that need to beintegrated with MT decoders (Hardmeier et al.,2013b). Modest improvements could, for ex-ample, be shown for the translation of pronouns(Le Nagard and Koehn, 2010; Hardmeier and Fed- erico, 2010; Hardmeier, 2014) and the generationof appropriate discourse connectives (Meyer et al.,2012). Textual coherence is also often tackledin terms of translation consistency for domain-specific terminology based on the one-translation-per-discourse principle (Carpuat, 2009; Tiede-mann, 2010; Ma et al., 2011; Ture et al., 2012).Overall, none of the ideas lead to significant im-provements of translation quality. Besides, the de-velopment of task- and problem-specific modelsthat work independently from the general trans-lation task is not very satisfactory. However,the recent success of neural machine translationopens new possibilities for tackling discourse-related phenomena in a more generic way. In thispaper, we present a pilot study that looks at simpleideas for extending the context in the framework ofstandard attention-based encoder-decoder models.The purpose of the paper is to identify the capabil-ities of NMT to discover cross-sentential depen-dencies without explicit annotation or guidance.In contrast to related work that modifies the neuralMT model by an additional context encoder an aseparate attention mechanism (Jean et al., 2017),we keep the standard setup and just modify the in-put and output segments. We run a series of exper-iments with different context windows and discussthe effect of additional information on translationand attention.
Encoder-decoder models with attention have beenproposed by Bahdanau et al. (2014) and havebecome the de-facto standard in neural machinetranslation. The model is based on recurrent neu-ral network layers that encode a given sentence inthe source language into a distributed vector rep-resentation that will be decoded into the target lan-guage by another recurrent network. The attention a r X i v : . [ c s . C L ] A ug odel makes use of the entire encoding sequence,and the attention weights specify the proportionswith which information from different positionsis combined. This is a very powerful mechanismthat makes it possible to handle arbitrarily long se-quences without limiting the capacity of the in-ternal representation. Previous work has shownthat NMT models can successfully learn attentiondistributions that explain intuitively plausible con-nections between source and target language. Thisframework is very well suited for the study weconduct in this paper as we emphasise the capabil-ities of NMT to pick up contextual dependenciesfrom wider context across sentence boundaries.In our work, we rely on the freely avail-able Helsinki NMT system (HNMT) ( ¨Ostlinget al., 2017) that implements a hybrid bidirec-tional encoder with character-level backoff (Lu-ong and Manning, 2016) using recurrent LSTMunits (Hochreiter and Schmidhuber, 1997). Thesystem also features layer normalisation (Ba et al.,2016), variational dropout (Gal and Ghahramani,2016), coverage penalties (Wu et al., 2016), beamsearch decoding and straightforward model en-sembling. The backbone is Theano, which en-ables efficient GPU-based training and decodingwith mini-batches. In our experiments, we focus on the translation ofmovie subtitles and in particular on the translationfrom German to English. The choice of languagesis rather arbitrary and mainly due to better com-prehension for our qualitative inspections. Thereare relevant discourse phenomena that need to beconsidered for English and German, for exam-ple, referential pronouns with grammatical agree-ment requirements. The choice of movie subtitleshas several reasons: First of all, large quantitiesof training data are available, a necessary prereq-uisite for neural MT. Secondly, subtitles exposesignificant discourse relations and cross-sententialdependencies. Referential elements are common,as subtitles usually represent coherent stories withnarrative structures with dialogues and natural in-teractions. Proper translation in this context typi-cally requires more than just the text but also in-formation from the plot and the audiovisual con-text. However, as those types of information arenot available, we hope that extended context at https://github.com/robertostling/hnmt least helps to incorporate more knowledge aboutthe situation and in consequence leads to bet-ter translations, also stylistically. The final ad-vantage of subtitles is the size of the translationunits. Sentences (and sentence fragments) are typ-ically much shorter compared to other genres suchas newspaper texts or other edited written mate-rial. Utterances are even shortened substantiallyfor space limitations. This property supports ourexperiments in which we want to include contextbeyond sentence boundaries. Similar to statisticalMT, neural MT also struggles most with long se-quences and, therefore, it is important to keep thesegments short. On average there are about 8 to-kens per language in each aligned translation unit(which may cover one or more sentences or sen-tence fragments).In particular, we use the publicly availableOpenSubtitles2016 corpus (Lison and Tiedemann,2016) for German and English and reserve 400randomly selected movies for development andtesting purposes. In total, there are 16,910 moviesand TV series in the collection. We tokenized andtruecased the data sets using standard tools fromthe Moses toolbox (Koehn et al., 2007). The fi-nal corpus comprises 13.9 million translation unitswith about 107 million tokens in German and 115million tokens in English. The training data in-cludes 13.5 million training instances and we se-lected the 5,000 first translation units of the test setfor automatic evaluation. Note that we trust thealignment and do not correct any possible align-ment errors in the data. We propose to simply extend the context whentraining models (and translating data). This doesnot lead to any changes in the model itself, and welet the training procedures discover what kind ofinformation is needed for the translation. We eval-uate two models that extend context in differentways:
Extended source:
Include context from the pre-vious sentences in the source language to im-prove the encoder part of the network.
Extended translation units:
Increase the seg-ments to be translated. Larger segments in http://opus.lingfil.uu.se/OpenSubtitles2016.php OURCE TARGET cc sieh cc , cc Bob cc ! -Wo sind sie ? - Where are they ?cc -Wo cc sind cc sie cc ? siehst du sie ? do you see them ?cc siehst cc du cc sie cc ? -Ja . - Yes .
Figure 1: Example of data with extended source language context.
SOURCE TARGET sieh , Bob ! BREAK -Wo sind sie ? look , Bob ! BREAK - Where are they ?-Wo sind sie ? BREAK siehst du sie ? - Where are they ? BREAK do you see them ?siehst du sie ? BREAK -Ja . do you see them ? BREAK - Yes .
Figure 2: Example of data with extended translation units.the source language have to be translated intocorresponding units in the target language.
Model 2+1 (extended source):
In order to keepthe segments as short as possible, we will limitourselves to one contextual unit. Hence, in thefirst setup, we add the source language sentence(s)from the previous translation unit to the sentenceto be translated and mark all tokens (BPE seg-ments in our case) with a special prefix ( cc ) toindicate that they come from contextual informa-tion. We also test a second model without prefix-marked context words but additional sentence-break tokens between the source language units(similar to model 2+2 below). In that case, we donot make a difference between contextual wordsand sentence-internal words, which makes it pos-sible to treat intra-sentential anaphora in the sameway as cross-sentential ones. We run through thetraining data with a sliding window, adding thecontextual history to each sentence in the corpus.Note that we have to make sure that each moviestarts without context. Figure 1 shows a few ex-amples from our test set with the prefix-markupdescribed above.The task now consists in learning the influenceof specific context word sequences on the trans-lation of the focus sentence. An example is theambiguous pronoun “sie” that could be a fem-inine singular or a plural third person pronoun.The use of grammatical gender in German alsomakes it possible to refer to an inanimate an-tecedent. Discourse-level information is needed tomake correct decisions. The question is whetherour model can actually pick this up and whetherattention patterns can show the relevant connec-tions. Model 2+2 (extended translation units):
In thesecond setup, we simply add the previous trans-lation unit to extend context in both source and target during training. With this model, the de-coder also has to generate more content but isprobably less likely to confuse information fromdifferent positions as it simply translates largerunits. Another advantage is that target-language-specific dependencies like grammatical agreementbetween referential expressions may be captured ifthey cannot be determined by the source languagealone. As above, we run through the training datawith a sliding window and create extended train-ing examples, marking the boundaries between thesegments with a special token
BREAK . Figure 2shows the example from the test data.The NMT models that we train rely on subword-units. We apply standard byte-pair encoding(BPE) (Sennrich et al., 2016) for splitting wordsinto segments. For the extended source contextmodels, we set a vocabulary size of 30,000 whentraining BPE codes and apply a vocabulary size of60,000 when training the models (context wordsdouble the vocabulary because of their cc prefix).For the 2+2 model, we train BPE codes from bothlanguages together (with a size of 60,000) and weset a vocabulary threshold of 50 when applyingBPE to the data. We train attention-based models using theHelsinki NMT system with similar parameters butdifferent training data to see the effect of contex-tual information. Our baseline system involves astandard setup where the training examples comefrom the aligned parallel subtitle corpus (1 sourcetranslation unit and 1 target translation unit). Thiswill be the reference in our evaluations and dis-cussions. In all cases, we translate the test set of5,000 sentences with an ensemble model consist-ing of the final four savepoint models after run-ning roughly the same number of training itera-ions with similar amounts of training instancesseen by the model. Savepoint averaging slightlyalleviates the problem that each model will differdue to the stochastic nature of the training pro-cedures, making a direct comparison of the out-comes difficult especially if the observed differ-ences are small.Automatic evaluation metrics are problematic,in particular for assessing discourse-related phe-nomena. However, it is important to verify thatthe context-models are on-par with the baseline.Table 1 shows the BLEU scores and also the alter-native character-level chrF3 measure for all sys-tems (2+1 in its two variants with and without pre-fix markup). The 2+2 model is evaluated on thelast segment in the generated output and ignoresall other parts before. in % BLEU chrF3 (precision) (recall)baseline 27.1 42.9
Table 1: Automatic evaluation: BLEU and chrF3(including precision and recall).The table shows that all models are quite sim-ilar to each other, with a slightly higher BLEUscore for the 2+1 system with sentence breaks.The chrF3 score is also slightly higher for both, the2+1 and 2+2 systems with sentence breaks, due toa higher recall. The differences are small but theresults already show that the system is capable ofhandling larger units without harming the perfor-mance and additional improvements are possible.Let us know look at some details to study the ef-fects of contextual information on translation out-put.
The most difficult part for the model in the 2+1setup is to learn to ignore most of the contextualinformation when generating the target languageoutput. In other words, the attention model needsto learn to focus on words and word sequencesthat are relevant to the translation process. It isinteresting to see that the system is actually ableto do that and produce adequate translations eventhough a lot of extra information is given in thesource. There is certainly some confusion in thebeginning of the training process but the modelfigures out surprisingly quickly what kind of in- formation to consider and what information to dis-card.It is interesting to see, of course, how much ofthe contextual information is still used and where.For this, we looked at the distribution of attentionin the whole data set, for individual sentences andfor individual target words. The total proportionof attention that goes to the contextual history isabout 7.1%. This is small – as expected – butcertainly not negligible. When sorting by contex-tual attention, some sentences actually show quitehigh proportions of attention going to the previ-ous context. They mainly refer to translations thatinclude information from the previous history orrather creative translations that are less faithful tothe original source. An example is given below(context in parentheses): input (Danke , Mr. Vadas .)
Mr. Kralik , kommenSie bitte ins B¨uro . ich m¨ochte Sie sprechen . transl. Mr. Kralik , please come to the office , I wantto talk to you .input (Mr. Kralik , kommen Sie bitte ins B¨uro . ichm¨ochte Sie sprechen .) ja . transl. Yes , I want to speak to you .
The second sentence to be translated ( “ja .” ) isfilled with a repetition from the contextual history.The part “I want to speak to you” is indeed mostlylinked to the German “ich m¨ochte Sie sprechen” from the history. Such repetitions may feel quitenatural (for example, if the speaker is the sameand would like to stress the previous request) andone is tempted to say that the model picks thispossibility up from the data where such examplesoccur. However, such cases seem to occur es-pecially in connection with multiple sentences inthe source context. The following translation il-lustrates another interesting case with two contextsentences. Figure 3 shows the attention patternin which the model replaced the referential “Sie” from the source sentence by “my lady” from theprevious context.Similarly, the following example shows againhow information from the context is merged withthe current sentence to be translated: input (Pirovitch .) - Hm ? - Wollen Sie was Nettes h¨oren ? transl. - You want to hear something nice ?input (- Hm ? - Wollen Sie was Nettes h¨oren ?)
Was denn ? transl. - What do you want to hear ?
The attention heatmap in Figure 4 nicely illus-trates how the translation picks up from the con-versation history. Once again, this kind of mixcould be possible if the speaker stays the same but, ome , my lady . target .Siekommen(.)(Dame)(meine)(,)(Sie)(.)(wünschen)(Sie)(wie) s ou r c e Figure 3: Attention with extended source context.Words from the contextual history are in parenthe-ses.probably, this is not the case and the translation isaltered in such a way that it becomes incorrect inthis context. These observations suggest that ad-ditional information such as speaker identities ordialog turns will be necessary to handle such casescorrectly. - What do you want to hear ? target ?dennWas-(?)(hören)(tes)(Net-)(was)(Sie)(Wollen)(-)(?)(Hm)(-) s ou r c e Figure 4: Another example for attention with ex-tended source context.The examples above constitute rather anecdo-tal evidence and systematic patterns are difficult toextract. We leave it to future work to study variouscases in more detail and to inspect certain proper-ties in connection with specific discourse phenom-ena. In this paper, we inspect instead the distribu-tion of attention for individual target words to seewhat word types depend most on contextual his-tory. For this, we counted the overall attention ofeach word type in our test set and computed theproportion of external attention on average. Thelist of the top ten words (after lowercasing) withfrequency above four is given in Table 2. Those words receive considerably larger external atten-tion (17-26%) than the average (4.9%). word freq external internal prop.% ∅ pos.yeah 35 0.224 0.622 26.5 3.71yes 182 0.212 0.601 26.1 4.22wake 6 0.239 0.684 25.8 6.67anywhere 6 0.223 0.655 25.4 7.67course 35 0.191 0.631 23.2 3.17oh 61 0.199 0.712 21.9 2.08saying 5 0.177 0.690 20.5 5.20tired 9 0.174 0.774 18.3 5.67latham 5 0.169 0.796 17.5 7.80really 13 0.161 0.763 17.4 2.77 average — 0.045 0.891 4.9 —(36) she 98 0.124 0.837 12.9 3.70(62) he 232 0.103 0.851 10.8 4.04(79) it 533 0.089 0.807 10.0 4.81(83) they 135 0.095 0.871 9.9 4.17(97) you 1349 0.084 0.828 9.2 4.28 Table 2: Word types with the highest external at-tention and the rank of some cross-lingually am-biguous pronouns in the list sorted by the propor-tion ( prop. ) of external attention. ∅ pos. gives theaverage token position of the target word.Unfortunately, there is no straightforward inter-pretation of the words that receive substantial at-tention from the extended contextual history, butseveral response particles such as “yes”, “yeah”,“oh”, which glue together interactive dialogues,are in the list. Furthermore, we can see that thewords with significant cross-sentential attentiondoes not consist of sentence-initial words only.The token position varies quite a lot. We alsolist the values of pronouns with significant cross-lingual ambiguity and their rank in the list sortedby the proportion of external attention. The third-person pronouns “he”, “she” and “it” put signif-icant attention (over 10%) on the previous sen-tence(s).Some words are simply not easy to link to par-ticular source language words and, therefore, theirattention may be spread all over the place. There-fore, we also computed the proportion of externalattention at specific positions in the input by con-sidering only the highest internal and the highestexternal attention for each target word in each sen-tence. The list of words with the highest externalattention according to that measure are listed inTable 3.The list is quite similar to the previous one,but one notes that the pronouns all advance in therankings, suggesting a more focused attention ofthese entities. This is an interesting observation,and we will leave further investigations to future ord freq external internal prop.% ∅ pos.yeah 35 0.135 0.242 35.9 3.71wake 6 0.179 0.326 35.4 6.67yes 182 0.091 0.259 26.1 4.22tired 9 0.113 0.364 23.7 5.67oh 61 0.086 0.288 23.1 2.08anywhere 6 0.094 0.326 22.4 7.67dover 6 0.119 0.426 21.9 5.83course 35 0.069 0.271 20.3 3.17speak 15 0.072 0.305 19.1 4.67sure 29 0.065 0.284 18.7 3.59 average — 0.021 0.441 4.5 —(20) she 98 0.062 0.343 15.3 3.70(50) he 232 0.048 0.355 11.9 4.04(64) it 533 0.040 0.332 10.8 4.81(72) you 1349 0.038 0.327 10.4 4.28(79) they 135 0.040 0.356 10.1 4.17 Table 3: Word types with the highest average ofexternal attention peaks.work.
Let us turn now to the second model that workswith larger translation units. Here, the neuralnetwork produces a translation of the entire ex-tended input. This includes the generation of seg-ment break symbols and attention for the entiresequence. Again, the question arises whether themodel learns to look at information outside of thealigned segment. External context is not markedwith specific prefixes anymore and token represen-tations are completely shared in the model. The-oretically, the model can now swap, shuffle ormerge information that comes from different seg-ments. Random inspection does not yield manysuch cases, but we do see a number of caseswhere translations include information from pre-vious parts or where the segment break is placedin a different position than in the reference trans-lation. Often, this is actually due to alignmenterrors in the reference data, such that the trans-lation system is penalised without reason in ourautomatic evaluation. Table 4 shows scores of theextended context translations and we can now seea slight improvement in BLEU and chrF3. Notethat each translation hypothesis and each referencenow refers to two segments with break tokens be-tween them removed. Hence, the scores do notmatch the ones in Table 1.Figure 5 illustrates an example with a large pro-portion of cross-segmental attention. In this case,the model summarises part of segment one withsegment two into one translation, and the attentiongoes mainly to segment one. in % BLEU chrF3 (precision) (recall)baseline* 27.25 44.14 55.61 43.152+2* 27.41 44.54 55.51 43.58
Table 4: BLEU and chrF3 on extended contextsegments (sliding window). Individual segmentsare simply concatenated in the baseline systemwhere necessary. do you want to go ? || I think I 'll wait . target .warteich,-Ja=.besserwarteich,glaubeich?gehenSiewollen s ou r c e Figure 5: Attention with multiple sentences andlarge cross-segment attention. The double bars re-fer to segment breaks.This looks quite acceptable from the point ofview of coherence. Looking at the reference usedfor automatic evaluation, we can actually see amisalignment in the data where “do you want togo ?” should have been aligned to “wollen Siegehen ?” : I don ’t care what you ’ve started . do you want to go ?mir ist egal , was sie angefangen haben .no , I think I ’d better wait .wollen Sie gehen ? ich glaube , ich warte besser .- Yes , I ’ll wait .-Ja , ich warte .
It is also interesting to see that the generation ofthe segment break symbol uses information fromsegment-initial tokens and punctuations such asquestion marks. This also follows the intuitionsabout the decision whether a segment is completeor not.We also computed word-type-specific attentionagain. However, the list of words that put signifi-cant focus on other segments looks quite differentfrom the previous model. The top-ten list is shownin Table 5.We also computed the average attention peakand the proportion of such attention to other seg-ments. The words with highest values are shownin Table 6. Again, we can see response particles ord freq external internal prop.% ∅ pos.exactly 5 0.190 0.644 22.8 2.20shelf 5 0.202 0.692 22.6 8.40upstairs 5 0.186 0.757 19.7 7.60unbelievable 7 0.151 0.641 19.1 2.86yeah 91 0.144 0.667 17.8 1.95hardly 5 0.155 0.740 17.4 2.20cares 5 0.144 0.755 16.0 2.60horns 8 0.134 0.713 15.8 5.25fossils 7 0.137 0.744 15.5 3.57-what 10 0.121 0.660 15.5 1.00 average — 0.028 0.880 3.1 — Table 5: Word types with the highest cross-segmental attention (excluding attention on sen-tence break symbols)).but also some additional adverbials that can haveconnective functions. Pronouns appear quite lowin the ranked list and, therefore, we leave them outin the presentation here. word freq external internal prop.% ∅ pos.-the 5 0.436 0.541 44.6 1.00-what 10 0.358 0.519 40.9 1.00exactly 5 0.171 0.266 39.2 2.20-aye 12 0.345 0.550 38.5 1.00-yes 7 0.281 0.472 37.3 1.00apparently 7 0.308 0.536 36.5 1.00hardly 5 0.178 0.321 35.7 2.20anyway 9 0.241 0.443 35.2 1.00ah 6 0.217 0.407 34.8 1.00ahoy 6 0.304 0.590 34.0 1.00 average — 0.043 0.440 8.9 — Table 6: Word types with the highest average ofcross-segmental attention peaks.Cross-segmental attention peaks are dominatedby tokens with relatively low overall frequency,some of which arise from tokenization errors (e.g.the words starting with a hyphen, typically fromsentence-initial positions). Therefore, we proposeanother type of evaluation, less sensitive to over-all frequency: we only count occurrences of targetwords whose external attention is higher than theinternal attention, and normalize them by the totaloccurrence count of the target word. We discardwords which have majoritarily external attentionin four or less cases. Results are shown in Table 7.In addition to the known response particlesand punctuation signs, we also see pronouns anddemonstrative particles (such as here, what, that )ranked prominently. However, the absolute num-bers are small and only permit tentative conclu-sions. This analysis also allows us to see the di-rection of cross-segmental attention. Items thattend to occur at the beginning of the sentence show word proportion freq ext peak freqyeah 0.077 7 91oh 0.069 7 101yes 0.054 11 204thank 0.049 7 144no 0.025 8 320- 0.023 44 1890good 0.018 5 284here 0.017 6 346? 0.016 29 1812... 0.016 5 316. 0.014 104 7645what 0.012 6 486you 0.009 23 2458that 0.008 6 725’s 0.008 9 1102it 0.005 5 914, 0.004 16 3561i 0.004 10 2372
Table 7: Word types with the highest proportionof cross-segmental attention peaks, with absolutefrequencies of cross-segmental attention peak andoverall absolute word frequencies.attention towards the previous sentence, whereasitems that occur at the end of a sentence (such aspunctuation signs, but also the ‘s token) show at-tention towards the following sentence.We also inspected some translations and theirattention distributions in order to study the effectof larger translation units on translation quality.One example is the translation in Figure 6. where are they ? || see them ? target ?siedusiehst=?siesind-Wo s ou r c e Figure 6: Attention patterns with referential pro-nouns in extended context.The example illustrates how the model workswhen deciding translations of ambiguous wordslike the German pronoun “sie”. First, when gener-ating “they”, the model looks at the verb for agree-ment constraints and the representation around theplural inflection “sind” of the German equivalentof “are” receives significant attention. Even morenteresting is the translation of “siehst du sie?”,which in isolation is translated to (the intuitivelymost likely translation) “do you see her ?” byour baseline model. In the extended model, thetranslation changes to “them”, which agrees withthe context and is coherent here. Why the auxil-iary verb and the subject pronoun are left out isanother question but that could be due to the col-loquial style of the training data. In any case, thefigure shows that “them” also looks at “sind” inthe previous sentence with a weight (0.031) thatis significantly larger than for other positions inthe previous sentence. This amount seems to con-tribute to the change to plural, which is, of course,satisfactory in this case. Target language contextwill certainly also contribute to this effect but eventhe 2+1 model produces “them” in this particularexample without the additional target context butthe same information from the source.However, sometimes the extended model isworse than the baseline with respect to pronountranslation. An example is shown below. In thiscase, the context window is too small and doesnot cover the important reference ( der Sonnenauf-gang/the sunrise ), which appears two sentencesbefore the anaphoric pronoun ( er/it ). But whetheran even larger context model would pick this upcorrectly is not certain. context 2: hast du je den Sonnenaufgang in China gese-hen?reference: ever notice the sunrise in China ?context 1: solltest du .reference: you should .source: er ist wundersch¨on .reference: it ’s beautiful .baseline: it ’s beautiful .extended: he ’s beautiful .
Some translations also become more idiomaticdue to the additional context. Empirical evidenceis difficult to give but here are three examples thatillustrate small changes that make sense: source: los , Fenner !reference: go ahead , Fenner !baseline: go , Fenner !extended: come on , Fenner !source: was Sie nicht sagen !reference: you don ’t say !baseline: what you don ’t say !extended: you don ’t say !source: ganz meiner Meinung .reference: that ’s what I say .baseline: my opinion .extended: I agree .
The example of Figure 6 raises the questionwhether the extended model is able to reliably andsystematically disambiguate pronominal transla-tions. In order to answer this question, we ex-tracted all occurrences of the ambiguous pronoun sie/Sie from our test set (1143 occurrences in 1018sentences, i.e. in every fifth sentence of the testset) and manually evaluated about half of them(565 occurrences in 516 sentences), comparing theoutput of the baseline system with the one of the2+2 system. We distinguish four categories on thebasis of the reference translation: polite impera-tive
Sie , other occurrences of the polite pronoun
Sie , feminine singular sie and plural sie . Figure 8lists the results.
Word category Occurrences Baseline 2+2Polite imperative 101 98.0% 97.0%Polite other 301 94.4% 95.0%Feminine singular 77 85.7% 85.7%Plural 86 69.8% 79.1%All 565 90.1% 91.7%
Table 8: Percentages of correct translations of thepronoun sie/Sie .The table shows that polite forms are most fre-quent in the corpus and also rather easy to trans-late thanks to capitalisation. In the case of imper-atives, they simply are deleted (e.g.,
Kommen Sie! becomes
Come! ), whereas in other contexts theyare consistently translated to you . The remainingerrors are mainly due to entire segments that areleft untranslated, or to erroneous lowercasing ofsentence-initial positions during preprocessing.Distinguishing singular from plural readings isharder: a non-polite form sie can be translatedas she or it in its singular reading (depending onthe grammatical gender of the antecedent), or as they or them in its plural reading (depending oncase). The figures show that the extended model isbetter at correctly predicting they (and them ), butthat correctly predicting she or it is equally hardwith or without context. While the superiority ofthe 2+2 model cannot be established numerically(none of the reported figures are statistically sig-nificant, according to χ tests at p = 0 . ), thereare examples that show corrected output: ontext: du bist nur ein Junge und das sind b¨ose M¨anner.reference: you ’re only a boy , they ’re vicious men .source: such sie , Max .reference: get ’ em , Max .baseline: find her , Max .2+2: find them , Max .context: Sie verstecken sich wie die Ratten im M¨ull .reference: they hide out like rats in the garbage .source: wenn du sie finden willst , musst du ebenso imM¨ull w¨uhlen wie sie .reference: so if you ’re gonna get ’ em , you ’ll have towallow in that garbage right with them .baseline: if you want to find her , you ’ll have to wallowin the trash like her .2+2: if you want to find them , you have to digthrough the garbage as well as them . The decision of translating feminine singularpronouns as sie or it is also improved in somecases by the 2+2 model: context: mehr bedeutet dir die Sache nicht ?reference: is that all my story meant to you ?source: was sonst k¨onnte sie mir bedeuten ?reference: what else could it mean to me ?baseline: what else could she mean to me ?2+2: what else could it mean to me ?context 2: kennst du die alte Mine hier ?reference: know the old mine around here ?context 1: - Davon gibt ’ s hier viele .reference: - There ’s a lot of them here .source: - Sie geh¨ort einem gewissen Sand .reference: - It ’s worked by a man named Sand .baseline: - She owns a certain sand .2+2: - It belongs to a certain sand . However, there is currently not much evi-dence that these improvements are due to cross-segmental attention. It remains to be investigatedif this also holds for the 2+1 model and variantsthereof.
In this paper, we present two simple models thatuse larger context in neural MT, one that addssource language history to the input and one thatconcatenates subsequent segments in the trainingdata. We discuss the effect on translation andthe attention model in particular. We can showthat neural MT is indeed capable of translatingwith wider context and that it also learns to dis-tinguish information coming from different seg-ments or discourse history. We run experimentson German-English subtitle data and we can findvarious examples in which referential expressionsacross sentence boundaries can be handled prop-erly. The current study is our first attempt tomodel discourse-aware neural MT and the out-come is already encouraging. However, evidence so far is rather anecdotal but in the future, we planto run more systematic experiments with detailedanalyses and evaluations. We will look at dif-ferent windows and other ways of encoding dis-course history. We will also study specific dis-course phenomena in more depth trying to findout whether NMT learns to handle them in a lin-guistically plausible way. Finally, this researchalso intends to provide insights into the devel-opment of discourse-aware coverage models forNMT. Indeed, explicit models of coverage havebeen shown to reduce the amount of overtransla-tion and undertranslation, whereas our translationmodels with extended context settings are targetedto make use of overtranslation and undertransla-tion to some extent. Our experiments will hope-fully contribute to a better understanding of the at-tention and coverage dynamics in discourse-awareNMT.
Acknowledgments
We wish to thank the anonymous reviewers fortheir detailed reviews. We would also like toacknowledge the Finnish IT Center for Science(CSC) for providing computational resources andNVIDIA for their support by means of their GPUgrant.
References
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hin-ton. 2016. Layer normalization.
ArXiv e-prints .Dzmitry Bahdanau, Kyunghyun Cho, and YoshuaBengio. 2014. Neural machine translation byjointly learning to align and translate.
CoRR abs/1409.0473.Guosheng Ben, Deyi Xiong, Zhiyang Teng, YajuanL¨u, and Qun Liu. 2013. Bilingual lexical cohesiontrigger model for document-level machine transla-tion. In
Proceedings of the 51st Annual Meeting ofthe Association for Computational Linguistics (Vol-ume 2: Short Papers)
Proceedings of the Workshop on Semantic Eval-uations: Recent Achievements and Future Direc-tions (SEW-2009)
Proceed-ings of the Seventh Workshop on Statistical Ma-chine Translation
Advances in Neural InformationProcessing Systems 29 (NIPS) .Liane Guillou. 2012. Improving pronoun translationfor statistical machine translation. In
Proceedings ofthe Student Research Workshop at the 13th Confer-ence of the European Chapter of the Association forComputational Linguistics
Dis-cours
Discourse in StatisticalMachine Translation , volume 15 of
Studia Linguis-tica Upsaliensia . Acta Universitatis Upsaliensis,Uppsala.Christian Hardmeier and Marcello Federico. 2010.Modelling pronominal anaphora in statistical ma-chine translation. In
Proceedings of the Seventh In-ternational Workshop on Spoken Language Transla-tion (IWSLT) . Paris (France), pages 283–289.Christian Hardmeier, Sara Stymne, J¨org Tiedemann,and Joakim Nivre. 2013a. Docent: A document-level decoder for phrase-based statistical machinetranslation. In
Proceedings of the 51st Annual Meet-ing of the Association for Computational Linguis-tics: System Demonstrations
Proceed-ings of the 2013 Conference on EmpiricalMethods in Natural Language Processing
Neural Computation
ArXiv e-prints, 1704.05135 https://arxiv.org/abs/1704.05135.Philipp Koehn, Hieu Hoang, Alexandra Birch, ChrisCallison-Burch, Marcello Federico, Nicola Bertoldi,Brooke Cowan, Wade Shen, Christine Moran,Richard Zens, Christopher J. Dyer, Ondˇrej Bo-jar, Alexandra Constantin, and Evan Herbst. 2007.Moses: Open Source Toolkit for Statistical MachineTranslation. In
Proceedings of ACL . pages 177–180. Ronan Le Nagard and Philipp Koehn. 2010. Aidingpronoun translation with co-reference resolution. In
Proceedings of the Joint Fifth Workshop on Statisti-cal Machine Translation and MetricsMATR . Associ-ation for Computational Linguistics, Uppsala (Swe-den), pages 252–261.Pierre Lison and J¨org Tiedemann. 2016. Opensub-titles2015: Extracting large parallel corpora frommovie and tv subtitles. In
Proceedings of the 10thInternational Conference on Language Resourcesand Evaluation (LREC-2016) .Sharid Loaiciga, Thomas Meyer, and Andrei Popescu-Belis. 2014. English-french verb phrase align-ment in europarl for tense translation model-ing. In Nicoletta Calzolari, Khalid Choukri,Thierry Declerck, Hrafn Loftsson, Bente Mae-gaard, Joseph Mariani, Asuncion Moreno, JanOdijk, and Stelios Piperidis, editors,
Proceedingsof the Ninth International Conference on Lan-guage Resources and Evaluation (LREC’14)
Proceedings of the 54th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1:Long Papers) . Association for Computational Lin-guistics, Berlin, Germany, pages 1054–1063.Yanjun Ma, Yifan He, Andy Way, and Josef vanGenabith. 2011. Consistent translation usingdiscriminative learning – A translation memory-inspired approach. In
Proceedings of the 49thAnnual Meeting of the Association for Com-putational Linguistics: Human Language Tech-nologies
Proceedings ofthe Tenth Biennial Conference of the Association forMachine Translation in the Americas (AMTA) . SanDiego (California, USA).Robert ¨Ostling, Yves Scherrer, J¨org Tiedemann,Gongbo Tang, and Tommi Nieminen. 2017. TheHelsinki neural machine translation system. In
Proceedings of the Second Conference on MachineTranslation . Association for Computational Lin-guistics, Copenhagen, Denmark.Lorenza Russo, Sharid Lo´aiciga, and Asheesh Gulati.2012. Improving machine translation of null sub-jects in Italian and Spanish. In
Proceedings ofthe Student Research Workshop at the 13th Confer-ence of the European Chapter of the Association foromputational Linguistics
Proceedings of the 54th AnnualMeeting of the Association for Computational Lin-guistics (Volume 1: Long Papers)
Proceed-ings of the ACL 2010 Workshop on DomainAdaptation for Natural Language Process-ing (DANLP)
Proceedings of the 2012 Conference of theNorth American Chapter of the Association forComputational Linguistics: Human LanguageTechnologies
Proceedings of the NAACL-HLT 2012 Workshop on Computational Linguis-tics for Literature
CoRR abs/1609.08144. http://arxiv.org/abs/1609.08144.Deyi Xiong, Yang Ding, Min Zhang, and Chew LimTan. 2013a. Lexical chain based cohesion mod-els for document-level statistical machine trans-lation. In
Proceedings of the 2013 Conferenceon Empirical Methods in Natural Language Pro-cessing
Proceed-ings of the Twenty-Third International Joint Confer-ence on Artificial Intelligence . AAAI Press, Beijing(China), pages 2183–2189.Deyi Xiong and Min Zhang. 2013. A topic-based co-herence model for statistical machine translation. In