[PDF] Continuous Learning in Neural Machine Translation using Bilingual Dictionaries

Abstract

While recent advances in deep learning led to significant improvements in machine translation, neural machine translation is often still not able to continuously adapt to the environment. For humans, as well as for machine translation, bilingual dictionaries are a promising knowledge source to continuously integrate new knowledge. However, their exploitation poses several challenges: The system needs to be able to perform one-shot learning as well as model the morphology of source and target language. In this work, we proposed an evaluation framework to assess the ability of neural machine translation to continuously learn new phrases. We integrate one-shot learning methods for neural machine translation with different word representations and show that it is important to address both in order to successfully make use of bilingual dictionaries. By addressing both challenges we are able to improve the ability to translate new, rare words and phrases from 30% to up to 70%. The correct lemma is even generated by more than 90%.

Full PDF

CContinuous Learning in Neural Machine Translation using BilingualDictionaries

Jan Niehues

Department of Data Science and Knowledge Engineering (DKE)Maastricht University, Maastricht, The Netherlands [email protected]

Abstract

While recent advances in deep learning led tosigniﬁcant improvements in machine transla-tion, neural machine translation is often stillnot able to continuously adapt to the envi-ronment. For humans, as well as for ma-chine translation, bilingual dictionaries are apromising knowledge source to continuouslyintegrate new knowledge. However, their ex-ploitation poses several challenges: The sys-tem needs to be able to perform one-shot learn-ing as well as model the morphology of sourceand target language.In this work, we proposed an evaluation frame-work to assess the ability of neural machinetranslation to continuously learn new phrases.We integrate one-shot learning methods forneural machine translation with different wordrepresentations and show that it is important toaddress both in order to successfully make useof bilingual dictionaries. By addressing bothchallenges we are able to improve the abilityto translate new, rare words and phrases from30% to up to 70%. The correct lemma is evengenerated by more than 90%.

Recent advances in neural machine translation(NMT) have led to astonishing translation qualityof research systems in evaluation campaigns as wellas for commercial systems. These improvementseven led to discussions whether automatic machinetranslation is already on par with human transla-tion (Barrault et al., 2019). One challenge that hasraised less attention is the ability of these systemsto continuously learn over time. In contrast, hu-mans are continuously improving their skills andadapting to an ever-changing environment.There are several reasons why this is necessary:First, nobody is ﬂuent in all possible domains.Even professional translators need to adapt to thespeciﬁc vocabulary of different domains. Secondly, language is not static but developing over time andtranslators need to learn new terms, meanings andexpressions.For humans, one successful approach to adaptto the environment is the usage of a dictionary .Learning translations from a dictionary has severaladvantages: Dictionaries contain minimal exam-ples. We do not need to collect full sentences, butcan directly learn translations from a single phrase.Furthermore, this can even be generalized to otherinﬂected forms of the same lexem. Secondly, it en-ables the system to directly integrate correction. Ifa user sees a speciﬁc problem, the user can interactwith the system by adding a speciﬁc dictionary en-try. This is very important if a speciﬁc terminologyshould be used.Motivated by the success for human translators,in this work we will enable NMT to also success-fully integrate knowledge from bilingual dictionar-ies. Thereby, we will focus on learning transla-tions that could not be learned from parallel data.This poses several interesting research challengesas shown in the example in Table 1. When traininga system on the proceedings of the European Parlia-ment, it might never have seen the word giraffe andneeds to learn the translation from the dictionary.First of all, we have to address one-shot learning.The system needs to be able to continuously learnnew dictionary entries and then should directly beable to translate all occurrences of this phrase.Secondly, the model must be aware of the mor-phology of the source and target language. In adictionary only the base form of a word is given.In the example only the lemma giraffe is in the dic-tionary, but not the plural form giraffes . Therefore,we must enable the system to translate different lex-emes of a lemma by knowing only the translationof the base form. This involves analysing the mor- In this work the dictionary entries can consist of a singleword or whole phrases a r X i v : . [ c s . C L ] F e b ource: Tell us, what have you got against giraffes ?Dictionary: giraffe → GiraffeReference: was haben Sie eigentlich gegen

Giraffen ?Annotation: Tell us, what have you got against

Table 1: Example of dictionary usage phological form of the source word, transferringthe information about the form to the target andﬁnally generating the correct morphological formof the target word based on the dictionary entry aswell as on the morphological form of the sourceword. In German the plural of the dictionary entry

Giraffe is Giraffen .In order to assess the approaches on this chal-lenging condition, it is essential to deﬁne an ap-propriate evaluation scheme. While the ability tocontinuously learn new translations is essential inmany practical applications, the newly learned ter-minology will only occur rarely. Therefore, stan-dard methods for evaluating machine translationare not able to measure the effect appropriately.In order to address these challenges, we developthe following contributions:• We developed a targeted evaluation approachfor the continuous learning of new translations(Section 2)• We showed that character-based representa-tion is essential to inﬂect unknown words cor-rectly. (Section 3)• We show that only the combination of wordrepresentation and one-shot learning enablesthe successful integration of bilingual dictio-naries (Section 3)

The ﬁrst important research question that needs tobe addressed in the targeted continuous learningscenario is the evaluation approach. While the eval-uation of machine translation is well-established(e.g. using BLEU (Papineni et al., 2002)), newlearned words are typically rare words and there-fore their inﬂuence on a BLEU score calculated onall words is very limited.In order to have a valid evaluation approach, theevaluation should focus on phrases that cannot belearned from the parallel data. These are typicallyvery rare phrases. Furthermore, we want to trans-late them in a real world situation. Therefore, the evaluation data should not be synthetic sentences.Finally, the approach should be using the standardparallel data without the need of collecting addi-tional parallel data.A ﬁrst attempt would be to use existing test dataand select sentences where dictionary entries areneeded as e.g. done in (Dinu et al., 2019). However,if we limit ourselves to phrases that do not occur inthe parallel data or only a few times, the number ofoccurring words in the test sets are too low to drawany conclusions.Therefore, we evaluate our approach by propos-ing a new test-train split of existing parallel data. Ina ﬁrst step, we ﬁlter a large background dictionaryfor entries that help to translate phrases that onlyoccur a few times in the existing parallel data. In asecond step, we select some of the sentences withtheir matching dictionaries entries as the new testsets. An overview of the process is shown in Figure1. Finally, we speciﬁcally evaluate the ability ofthe translation system to translate the dictionaryentries.In addition, it is important to ensure that the pro-posed methods do not have negative side effectson the overall translation quality. Therefore, wealso evaluate the model using standard evaluationmetrics on well-established test sets and on theproposed test set. Due to the weakness of thesemetrics to measure improvements in rare words,we do not expect that the proposed methods im-prove on these metrics, but it is important that theperformance measured in these metrics does notdecrease signiﬁcantly.

In a ﬁrst step, we create a large background dic-tionary for each considered language pair by ex-tracting a bilingual dictionary from the EnglishWiktionary. Therefore, we extracted the translationfrom a Wiktionary dump using wiktextract .Secondly, we match the dictionary entries to the https://dumps.wikimedia.org/enwiktionary/20200501/enwiktionary-20200501-pages-articles.xml.bz2 https://github.com/tatuylonen/wiktextract igure 1: Overview of the evaluation approach: Based on the parallel data and dictionary, a new split of the data isgenerated targeted corpus. Therefore, we lemmatize all dic-tionary entries as well as both sides of the paral-lel data. This is done to also ﬁnd matches for allmorphological variants of the dictionary entries. Fi-nally, we calculate the statistics mentioned in Table2 for each dictionary entry about its matches to theparallel data.In a third step, we ﬁlter the dictionary based onthe statistics. We only select words that are rarein the corpus. If the words are common and occuroften in the training data, a dictionary entry wouldnot be helpful. Secondly, we want to analyse theability of the system to generate different morpho-logical forms. Therefore, we only consider entriesthat occur at least with two different morphologicalvariants on the target side. Finally, in this workwe focus on words that are not ambiguous. Weleave an integration of word sense disambiguationto also handle ambiguous dictionary entries for fu-ture work. Therefore, we only consider phrases,where both, the source and target phrase, occur lessthan times with a different translation than theone given in the dictionary.Statistic ThresholdOccurrences ≤ k ≤ target inﬂected phrases ≥ only source/target match < Table 2: Dictionary ﬁltering

Finally we generate a split of the corpus into train-ing, validation and test sets based on the selecteddictionary entries as shown in Figure 1. The modelneeds to learn how to use the dictionary. Therefore,several training sentences need to be annotated withdictionary entries. Furthermore, we want to evalu-ate the ability of the model to translate phrases ithas seen a few times in training (

Few-Shot learning )as well as words it only has seen in the dictionary(

One-shot learning ). Therefore, we split the entriesin the dictionary equally into three sets (

Test (yel-low) , Mix (orange) and

Train (green) ). All sentencepairs associated with entries from the

Test set areadded to the newly created test set.In a second step, we select all the sentences fromthe remaining training sentences, where an entryfrom the

Mix set occurs. For each entry, half thesentences are added to the test set and a quarter tothe validation and training set.Finally, all sentences with entries from

Train areequally distributed to the training and validationset. Since we want to concentrate on modellingthe morphology when using the dictionary and notthe translation ambiguity for dictionary entries, weremoved all sentences from the training where thesource entry from the dictionary occurs, but thetarget sentence does not contain the target entry.Due to our selection of the dictionary, where wefocus on words that have only very few differenttranslations (less than times a different one),we only removed very few sentences here. Allemaining sentences with no annotations (most ofthe sentences) were used for training. When evaluating we want to focus on the system’sability to translate the phrases from the dictionary.Therefore, we measure the accuracy of translatingthe dictionary entries in addition to the commonlyused BLEU score. In addition to calculating theaccuracy by comparing the inﬂected words of hy-pothesis and reference (

Exact match ), we calculatefurther statistics to analyse the approaches.In addition, we measure the ability of the systemto at least create the correct lemma by ignoringerrors made due to wrong inﬂection of the words.Therefore, for each sentence, we compare the targetlemmatized phrase of the dictionary entry with thelemmatized version of the generated translation.We will refer to this metric as

Lemma match

Finally, we are especially interested in the abil-ity of the model to generate the correct inﬂectedform. For many words, this is quite straightforwardsince it is the same as the lemma. Therefore, wealso measure the exact match on the subset of thedictionary entries, where the target side of the dic-tionary is different from the inﬂect form occurringin the reference. To generate the correct transla-tion, in this case the model really needs to changethe output. We will refer to this metric as

Morph.Adjustment .In addition to these three evaluation scores, wealso investigate the performance on the differenttypes of entries. We evaluate all metrics on allentries and independently on the one-shot (

OneS )and few-shot (

FewS ) entries.

To successfully integrate the dictionary into theNMT system, we need to address two challenges:First, we need to enable the system to perform one-shot learning. It should be able to translate a phraseafter seeing it only once in the bilingual dictionary.Furthermore, it needs to be possible to continuouslyadd new translations. Secondly, we need to modelthe morphology of the dictionary entries. We needto use the dictionary for different inﬂected forms ofthe word and also generate various inﬂected formsof the target phrase.

In order to achieve one-shot learning, we need tocombine the dictionary with our neural machinetranslation system. The combination should ensurefast learning, so a single dictionary entry is enoughto learn the translation. Furthermore, it needs to beﬂexible, so new dictionary entries can be continu-ously added to the system and it is able to performlife-long learning by using the newly added entries.One large advantage of deep learning approachesis that they are able to easily incorporate additionalinformation. By annotating the input with addi-tional information, the model is able to learn au-tomatically how to make use of this additional in-formation. This has been successfully done, forexample, for the translation of other MT systems(Niehues et al., 2016), for domain information(Kobus et al., 2017) or information about formality(Sennrich et al., 2016a).For the integration of additional knowledgeabout speciﬁc phrases, we follow similar ap-proaches presented in Pham et al. (2018) and Dinuet al. (2019). The main idea is that we annotateeach source phrase, for which a dictionary transla-tion is available with this translation. This is doneby appending the translation to the source phrasewithin the sentence as shown in Table 1. Sincethis is done during training and testing, the systemis able to learn to copy and modify these sugges-tions. No further adaptation to the architecture ofthe NMT system is necessary. The system willlearn how to exploit these systems and can transferthis knowledge to new translations that have notbeen seen in training. Therefore, the translationsneed only to be added once to the dictionary, whichenables the system to perform one-shot learningas well as to continuously learn new translation byextending the dictionary.The main difference to previous work is that weare focusing on very rare words and morphologicalvariants of the dictionary phrases. Therefore, weinvestigate the matching of the dictionary entriesas well as the number of necessary entries.In order to ﬁnd the dictionary entries for a givensource sentence, we ﬁrst lemmatize the sentence.In a second step, we then match the dictionary tothe lemmatized sentences. Finally, we map backthe found entries to the original sentence.In annotating the source sentence, we followthe related work and append the translation to thesource phrase. As shown in Figure 1, we replacehe source word giraffes by the entry . In contrast to the original work, wedo not have the inﬂect target words, but only putthe lemmatized target string to the sentence. Forthe source side, we keep the inﬂected form for thesource sentence so the system is able to extract im-portant morphological information from the source(e.g. grammatical number) and map it to the target.This is done for the training and test data. Thenthe baseline neural machine translation system istrained normally on the annotated sentences. Wedid not adapt the architecture since in Dinu et al.(2019) the standard transformer based system wasable to learn to copy the suggested translations intothe target side.While the system should learn to also use dic-tionary entries it has not seen during training, thesystem needs enough examples in order to learnhow to use dictionary entries in general. Since weare concentrating on very rare words, the numberof dictionary entries in the parallel data is relativelysmall. For larger corpora, we therefore explorewhether it is helpful to annotate additional phrases.This was done by also extracting phrases that occurmore often ( add. Annot ). However, we did use thesame split and also evaluated our approach only onthe rare phrases.

A second challenge when building a machine trans-lation system for the targeted scenario is the gener-ation of the correct inﬂected word form. Since wehave seen the new words only in the dictionary, wewill often need to generate different inﬂected wordforms that we have neither seen in the dictionarynor in the corpus.While there have been attempts to generate un-known inﬂected word forms for dictionary entries(e.g. Niehues and Waibel (2011)) prior to neuralmachine translation, the ability to represent partsof the words in neural machine translation offer aunique opportunity to model morphological inﬂec-tion. Therefore, in this work, we concentrated onthe word representation used in the NMT system.Thereby, we always use the same representationfor the source and the target language. The mostcommonly used word representation used in state-of-the-art neural machine translation systems arebyte-pair-encodings (BPE) (Sennrich et al., 2016b).A second successful approach to represent words ina neural machine translation system are character- based representations, where each word is split intoits characters.While there have been several works on compar-ing these two representations(e.g. Sennrich (2017),they are mostly concentrating on generating theoverall best translation performance. However, inthis work, we will focus on the rare words. Sinceonly for these words we need to learn how to gener-ate different inﬂected forms. For the more frequentwords, this is often not that important since all wordforms occur several times in the corpus.Besides the generation of unknown inﬂectedforms, the word representation is also importantwhen learning to copy the annotations to the target.If we look at the example dictionary entry con-centric → konzentrisch , the lemma konzentrisch got split into the subwords konzent@@ ris@@ ch while the inﬂect form konzentrischer into kon@@zentr@@ ischer . In this case there is no overlap inthe subwords between the lemma and the inﬂectedform. Therefore, it is difﬁcult for the system tolearn from the suggested translation. In contrast,when looking at the character-based representation,the model can copy the lemma and only has to learnto add additional tokens at the end.In a ﬁrst step, we compared character-based andsub-word based models. Thereby, we highlighttheir ability to generate new inﬂected forms of rarewords. For both we used exactly the same NMTarchitecture. The only difference is that the inputand output length for the character-based models issigniﬁcantly larger since the number of charactersis higher than the number of subwords.We will see that the character-based models aresigniﬁcantly better in generating the different in-ﬂected forms for rare words. However, a majorchallenge is the training time. Due to the signiﬁcantlonger sequence length, also the training and decod-ing time is much slower. Therefore, we also pro-pose a combination of word-based and character-based models.In the mixed representation, we split each wordthat occurs less than k times into its characters,while the other words are kept as they are. Sinceonly frequent words are not split into characters,no further subword segmentation for these words isperformed. Thereby, we can speed up the process-ing due to a short sequence length, but still have theability to learn how to inﬂect rare words. Some dic-tionary entries contain phrases with many frequentwords. In order to be able to better inﬂect theseords, in a second approach we in addition splitalso all words within a dictionary phrase into char-acters. We refer to this technique as Mix+Annot . We evaluate the approaches on three different datasizes and on two different language pairs (English-German and English-Czech). Since we are focus-ing on the generation of different morphologicalforms, we always use the morphologically rich lan-guage as the target language.

For English-to-German we created two datasetswith different sizes. A ﬁrst series of experiments isrun on the TED (Cettolo et al., 2012) corpus. Wesplit the corpus into training, validation and test setsas described in Section 2. In addition, we evaluatethe system also on the ofﬁcial test sets tst2014 , tst2015 and and report average metrics forthese test sets.For the second system, we use the Europarl cor-pus (Koehn, 2005). This corpus is around 10 timesbigger than the TED corpus as shown in Table 3. Inaddition to the target test set, we also tested the sys-tems on the test2006 and test2007, which are themost recent ofﬁcial test sets from the same domainused for the WMT.Finally, we also tested the techniques on a differ-ent language pair. For this we choose English toCzech and also use the Europarl corpus for theseexperiments. Since there is no ofﬁcial in-domaincorpus available, we tested the systems also on thenewstest2019 test set.As shown in Table 3, the parameters mentionedin Section 2 lead to a reasonable test set size for allcorpora. As mentioned in Section 3.1, we evaluatethe system on Europarl with different amounts oftraining annotations. All data sets with their splitsare available for further experiments .EN-DE EN-CSTED Europarl EuroparlTrain 198K 1.9M 636K- Annot 1.6K 1.2K 2.7K- add. Annot 14.5K 24.3KValid 1610 1196 2000Test 3181 2140 5360 Table 3: Data size in number of sentences https://nlp-dke.github.io/data/rareWordNMT/ All data was processed using the Stanza toolkit (Qiet al., 2020) for tokenization and lemmatization.The lemmatization was only used for matchingthe dictionary entries, the translation systems werebuilt on the inﬂected words. If BPE is applied, weused a BPE size of 20K. For the mixed representa-tion, words occurring less than k = 50 times wererepresented as individual characters.We use the standard transformer architecture(Vaswani et al., 2017) and increase the numberof layers to eight. The layer size is 512 and theinner size is 2048. Furthermore, we apply worddropout (Gal and Ghahramani, 2016) with p = 0 . .We use the same learning rate schedule as in theoriginal work and the implementation presentedin (Pham et al., 2019) . All systems were alwaystrained from scratch with random initialization. A ﬁrst series of experiments were performed on theTED task. We evaluated the one-shot learning ap-proach by source sentence annotation as well as thethree different word representations described inSection 3.2. In a ﬁrst step, we evaluated the transla-tion performance using BLEU (mteval-v14.pl) andcharacTER (Wang et al., 2016) on the continuouslearning test set as well as on the ofﬁcial test set(Table 4).The baseline systems using no one-shot learningdo not annotate the source at all and are trainedon the standard parallel data. If we take a look atthe ofﬁcial test set, we see systems using character-based representation (

Character and

Mix ) performslightly better than the subword-based models.This might be due to the fact that the TED trainingdata is rather small. Secondly, the one-shot learningapproach has no inﬂuence on the translation perfor-mance of this test set. This is not surprising, sinceonly 94 phrases in the 4343 sentences of the testsets were annotated. Therefore, we also evaluatedour approach on the dedicated continuous-learningtest set (

CL test ), created by the new train-test split.The improvements by character-based represen-tation on the CL test set are even larger. This mightbe due to the fact that there are more rare wordsin these sentences and therefore the advantages ofthe character-based models is stronger. Secondly,in this case, the one-shot approach improvementsimprove the translation quality. Since the improve- https://github.com/nlp-dke/NMTGMinor epresen-tation One-Shot CL Test ofﬁcial TestBLEU ↑ characTER ↓ BLEU ↑ characTER ↓ BPE No 25.97 44.09 26.17 44.62Character No 28.12 42.79 26.57 44.27Mix No 27.44 42.79 26.83 44.28BPE Annot 26.00 41.74 26.21 44.73Character Annot 28.92 40.16 26.72 43.96Mix Annot 28.93 40.96 26.8 44.44

Table 4: Translation quality on TED tasks

Representation One-Shot Exact match Lemma match Morph. AdjustmentAll OneS FewS All OneS FewS All OneS FewSBPE No 34 22 53 31 27 62 29 22 43Character No 48 40 60 55 47 68 45 43 48Mix No 42 35 54 49 40 63 38 34 46BPE Annot 48 34 69 62 46 88 33 24 50Character Annot 76 74 78 92 91 93 62 61 64Mix Annot 75 72 79 92 91 94 59 56 65

Table 5: Rare word accuracy on TED tasks ments for the BPE-based system are only measuredby characTER and not by BLEU might indicate thatfor this system it is more challenging to generatethe correct inﬂected form.To better analyse this, we also perform a detailedevaluation as described in Section 2.3 and shownin Table 5. First of all, the experiments show thedifﬁculty of the task. The baseline system is onlyable to translate 34% of the phrases correctly. Forthe one-shot subset this even drops to 22%.Secondly, the experiments show that the chal-lenge can only successfully be addressed by mod-elling both: one-shot learning and word represen-tation. On the last two lines using character-basedword representation and one-shot learning are ableto achieve high accuracy. We see an improvementby 50% percent absolutely, which is a relative im-provement by more than 300%. Furthermore, forthese models there is no longer a clear differencebetween the one-shot and few-shot examples (Com-parison of Columns

OneS and

FewS ).By looking at them separately, we see thatonly using one-shot learning improves the qual-ity slightly. However, even when ignoring the wordinfection, the model often is not able to produce thecorrect lemma. The example in Section 3.2, mo-tivates one challenge when learning to copy withdifferent subword segmentations. If we only usecharacter-based representations, we see improve- ments, especially for phrases that do not occur intraining. In this case, the model is more often ableto ﬁnd the correct translation based on translationsof other words. However, a similar performancebetween the few-shot and one-shot learning is onlyachieved by combining both techniques.Finally, when only looking at the words wherethe lemma is different from the inﬂected form, westill see open research challenges. While we alsocould improve the accuracy from around 20% or30% to nearly 60%, it is still the most difﬁcult case.While there is no clear difference between thecharacter-based model and the mixed model onthe output quality, there is a clear difference intraining speed. For the full training on 64 epochs,the character-based model needs 14h, while themixed representation only needs around 4h. Whilethis is still slower than the subword-based model(2.5h), it still allows for a fast training of the model.Therefore, we only compared the mixed and thesub-word based representation for the remainingexperiments on larger corpora.

In a second set of experiments, we evaluated theapproach on the larger data set on two differentlanguage pairs. In addition to the two word repre-sentation from the last experiment (

BPE and

Mix ),we also applied

Mix+Annotate , where we also rep-ang. Represen- One-Shot CL Test ofﬁcial Testtation BLEU ↑ characTER ↓ BLEU ↑ characTER ↓ Ger. BPE No 28.74 47.30 25.30 48.75Mix No 30.83 45.80 25.52 48.48BPE Annot 28.74 47.47 25.49 48.64Mix Annot 31.63 44.64 25.45 48.60BPE add. Annot 28.81 47.24 25.45 48.71Mix add. Annot 31.44 44.75 25.50 48.57Mix+Annot add. Annot 31.76 44.11 25.64 48.41Czech BPE No 34.25 39.43 16.2 57.15BPE Annot 34.73 38.39 15.57 57.59Mix+Annot Annot 34.86 38.16 16.62 57.70BPE add. Annot 34.74 38.89 15.7 57.37Mix+Annot add. Annot 35.21 37.95 16.63 57.65Lang. Represen- One-Shot Exact match Lemma match Morph. Adjustmenttation All OneS FewS All OneS FewS All OneS FewSGer. BPE No 32 28 42 39 33 50 28 23 37Mix No 42 38 48 50 38 58 37 34 43BPE Annot 47 40 61 61 52 80 35 39 47Mix Annot 66 65 68 83 81 88 51 47 56BPE add. Annot 51 49 55 65 62 70 37 36 38Mix add. Annot 65 63 69 81 78 88 51 40 56Mix+Annot add. Annot 72 72 72 92 91 94 58 56 60Czech BPE No 34 25 53 44 32 67 33 24 51BPE Annot 46 33 70 63 48 92 42 30 67Mix+Annot Annot 64 61 69 92 91 95 60 58 65BPE add. Annot 45 31 71 61 45 91 41 29 58Mix+Annot add. Annot 66 63 72 92 89 95 63 60 70

Table 6: Translation Performance on the Europarl data set resent all words within dictionary entries as char-acters as described in Section 3.2. Furthermore,we also investigate add. Annot , where additionaldictionary entries were used for more training ex-amples. The results are shown in Table 6.The overall picture for these experiments andthe previous experiments is quite similar. For allthree scenarios, the quality of the various systemson the ofﬁcial test sets is relatively similar, howeverthe systems differ when looking especially at theaccuracy of translating the dictionary entries. Onlywhen combining one-shot learning with character-based representation, we are able to successfullytranslate the dictionary entries. Independent of thelanguage pair and data size, we are able to achievean accuracy of around 70% and an accuracy ofaround 90% when only looking at the lemmas only.Furthermore, the model performs as good in one-shot learning as in few-shot learning.However, beside the evidence that the approach works on various language pairs and data sizes, theadditional experiments give some more insights.First, although the data is larger, we do not seea difference between the models using additionalannotation and the models using only the baselineannotation. So it seems to be sufﬁcient to havearound 1000 examples in order to learn to copy thesuggestions from the source sentence.Furthermore, although there are no longer clearimprovements for character-based representationon the overall translation performance, also forthis experiment with larger data size these repre-sentations are essential for the dictionary integra-tion. This is highlighted by the improvements ofusing characters for all words in dictionary entries(

Mix+Annot ) instead of only for rare words (

Mix ). Related work

In recent years, several different approaches to in-tegrate additional data into neural machine trans-lation have been suggested. If this is parallel data,ﬁne-tuning on the additional, better matching data(Luong and Manning, 2015; Lavergne et al., 2011)is often successful. If the additional data is pro-vided in other forms, different techniques havebeen investigated.For human feedback, Turchi et al. (2017) sug-gested to use ﬁne-tuning on the human generatedpost edits. Pham et al. (2018) used phrase pairsextracted by statistical machine translation to an-notate translations of rare phrases. In the similarscenario Li et al. (2019) used a neural network tostore the external phrase pairs.Even more work has been done to integrate dic-tionaries into neural machine translation. A ﬁrstwork by Arthur et al. (2016) used the additionaldictionary to inﬂuence the softmax probabilitiesof the neural machine translation. Another possi-bility is to include the dictionary as an additionalknowledge source during training using posteriorregularization (Zhang et al., 2017). A different ap-proach is chosen by Zhang and Zong (2016) usingthe dictionary as additional training sentences orgenerating synthetic sentences. In contrast to thiswork, these do not allow the integration of newwords after training the NMT system.Several authors investigate the integration of thedictionary as an additional constraint during thecoding process (Chatterjee et al., 2017; Hokampand Liu, 2017; Hasler et al., 2018). This leads toa larger complexity in decoding that has been ad-dressed by Post and Vilar (2018). However, the dic-tionary is typically a hard constraint which makesit difﬁcult to learn words forms that do not occurin the dictionary.Most similar to this work is the approach byDinu et al. (2019), which like this work and Phamet al. (2018) annotates the source sentence with pos-sible translations. They showed that state-of-the-artmodels no longer need architecture changes, butcan directly learn to copy form the source sentences.In this work, we additionally focus on generatingnew word morphological forms not occurring inthe dictionary. We investigated different word rep-resentations and analysed their inﬂuence on theability to copy the dictionary entries.

By introducing the new continuous learning testset using a different train-test split for existing cor-pora we could highlight the challenges of state-of-the-art neural machine translation systems. Whilethey achieve very good performance, they are stillchallenged by new emerging terms. The baselinesystem was only able to correctly translate 20 to 30percent of these phrases.Our integration of bilingual dictionaries into thesystems improves the translation performance tocorrectly translate the words by up to 70%. In90% of the cases at least the lemma of the wordis predicted correctly. Furthermore, in this case,we see no difference in accuracy between wordsonly seen in the dictionary and words also seen afew times in the parallel data. However this is onlypossible by modelling both: enabling the model toperform one-shot learning and modeling the differ-ent morphological forms of the rare phrases. Theﬁrst one is addressed by annotating the source sen-tence with dictionary translation while the secondone is addressed by using character-based models.By combining character-based and word-based rep-resentations we are able to model the different mor-phological variants of a word as well as enablingthe system for fast training.As mentioned before, this work concentrates onthe morphological variants of the dictionary entriesand ignores ambiguities due to different possibletranslation. In the future, we intend to address thisby including word sense disambiguation into thetranslation process.

References

Philip Arthur, Graham Neubig, and Satoshi Nakamura.2016. Incorporating Discrete Translation Lexiconsinto Neural Machine Translation. In

Proceedings ofthe 2016 Conference on Empirical Methods in Natu-ral Language Processing , pages 1557–1567, Austin,Texas. Association for Computational Linguistics.Lo¨ıc Barrault, Ondˇrej Bojar, Marta R. Costa-juss`a,Christian Federmann, Mark Fishel, Yvette Gra-ham, Barry Haddow, Matthias Huck, Philipp Koehn,Shervin Malmasi, Christof Monz, Mathias M¨uller,Santanu Pal, Matt Post, and Marcos Zampieri. 2019.Findings of the 2019 Conference on Machine Trans-lation (WMT19). In

Proceedings of the Fourth Con-ference on Machine Translation (Volume 2: SharedTask Papers, Day 1) , pages 1–61, Florence, Italy. As-sociation for Computational Linguistics.M. Cettolo, C. Girardi, and M. Federico. 2012. WIT 3 :eb Inventory of Transcribed and Translated Talks.In

Proceedings of the 16th EAMT Conference , pages261–268, Trento, Italy.Rajen Chatterjee, Matteo Negri, Marco Turchi, Mar-cello Federico, Lucia Specia, and Fr´ed´eric Blain.2017. Guiding Neural Machine Translation Decod-ing with External Knowledge. In

Proceedings of theSecond Conference on Machine Translation , pages157–168, Copenhagen, Denmark. Association forComputational Linguistics.Georgiana Dinu, Prashant Mathur, Marcello Federico,and Yaser Al-Onaizan. 2019. Training Neural Ma-chine Translation to Apply Terminology Constraints.In

Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics , pages3063–3068, Florence, Italy. Association for Compu-tational Linguistics.Yarin Gal and Zoubin Ghahramani. 2016. A Theoreti-cally Grounded Application of Dropout in RecurrentNeural Networks. In

Proceedings of the 30th Inter-national Conference on Neural Information Process-ing Systems , NIPS’16, pages 1027–1035, Barcelona,Spain. Curran Associates Inc.Eva Hasler, Adri`a de Gispert, Gonzalo Iglesias, andBill Byrne. 2018. Neural Machine Translation De-coding with Terminology Constraints. In

Proceed-ings of the 2018 Conference of the North Ameri-can Chapter of the Association for ComputationalLinguistics: Human Language Technologies, Vol-ume 2 (Short Papers) , pages 506–512, New Orleans,Louisiana. Association for Computational Linguis-tics.Chris Hokamp and Qun Liu. 2017. Lexically Con-strained Decoding for Sequence Generation UsingGrid Beam Search. In

Proceedings of the 55th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers) , pages 1535–1546, Vancouver, Canada. Association for Computa-tional Linguistics.Catherine Kobus, Josep Crego, and Jean Senellart.2017. Domain Control for Neural Machine Trans-lation. In

Proceddings of Recent Advances in Natu-ral Language Processing (RANLP 2017) , pages 372–378, Varna, Bulgaria.Philipp Koehn. 2005. Europarl: A Parallel Corpus forStatistical Machine Translation. In

Proceedings TheTenth Machine Translation Summit (MT Summit X) ,Phuket, Thailand.Thomas Lavergne, Hai-Son Le, Alexandre Allauzen,and Franc¸ois Yvon. 2011. LIMSI’s experiments indomain adaptation for IWSLT11. In

Proceedings ofthe International Workshop on Spoken LangugageTranslation , pages 62–67, San Francisco, CA.Ya Li, Xinyu Liu, Dan Liu, Xueqiang Zhang, and J. Liu.2019. Learning Efﬁcient Lexically-ConstrainedNeural Machine Translation with External Memory.

ArXiv , abs/1901.11344. Minh-Thang Luong and Christopher D. Manning. 2015.Stanford Neural Machine Translation Systems forSpoken Language Domains. In

Proceedings ofthe Twelfth International Workshop on Spoken Lan-guage Translation (IWSLT 2015) , Da Nang, Viet-nam.Jan Niehues, Eunah Cho, Thanh Le Ha, and AlexWaibel. 2016. Pre-translation for neural machinetranslation. In

COLING 2016 - 26th InternationalConference on Computational Linguistics, Proceed-ings of COLING 2016: Technical Papers , pages1828–1836, Osaka, Japan.Jan Niehues and Alex Waibel. 2011. Using Wikipediato translate domain-speciﬁc terms in SMT. In

Proceedings of the 8th International Workshop onSpoken Language Translation (IWSLT 2011) , pages230–237.Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic eval-uation of machine translation. In

Proceedings ofthe 40th Annual Meeting of the Association for Com-putational Linguistics , pages 311–318, Philadelphia,Pennsylvania, USA. Association for ComputationalLinguistics.Ngoc-Quan Pham, Thai-Son Nguyen, Jan Niehues,Markus M¨uller, Sebastian St¨uker, and AlexanderWaibel. 2019. Very Deep Self-Attention Networksfor End-to-End Speech Recognition. In

Proceed-ings of the 20th Annual Conference of the Inter-national Speech Communication Association (Inter-speech 2019) , Graz, Austria.Ngoc-Quan Pham, Jan Niehues, and Alex Waibel.2018. Towards one-shot learning for rare-wordtranslation with external experts. In

Proceedings ofthe Second Workshop on Neural Machine Transla-tion , Melbourne, Australia.Matt Post and David Vilar. 2018. Fast Lexically Con-strained Decoding with Dynamic Beam Allocationfor Neural Machine Translation. In

Proceedings ofthe 2018 Conference of the North American Chap-ter of the Association for Computational Linguistics:Human Language Technologies, Volume 1 (Long Pa-pers) , pages 1314–1324, New Orleans, Louisiana.Association for Computational Linguistics.Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton,and Christopher D. Manning. 2020. Stanza: APython Natural Language Processing Toolkit forMany Human Languages. In

Proceedings of the58th Annual Meeting of the Association for Compu-tational Linguistics: System Demonstrations .Rico Sennrich. 2017. How Grammatical is Character-level Neural Machine Translation? Assessing MTQuality with Contrastive Translation Pairs. In

Pro-ceedings of the 15th Conference of the EuropeanChapter of the Association for Computational Lin-guistics: Volume 2, Short Papers , pages 376–382,Valencia, Spain. Association for Computational Lin-guistics.ico Sennrich, Alexandra Birch, and Barry Haddow.2016a. Controlling Politeness in Neural MachineTranslation via Side Constraints. In

Proceedings ofthe 2012 Conference of the North American Chap-ter of the Association for Computational Linguis-tics: Human Language Technologies (NAACL-HLT2016) , pages 35–40, San Diego, California, USA.Rico Sennrich, Barry Haddow, and Alexandra Birch.2016b. Neural Machine Translation of Rare Wordswith Subword Units. In

Proceedings of the 54th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers) , pages 1715–1725, Berlin, Germany. Association for Computa-tional Linguistics.Marco Turchi, Matteo Negri, Amin Farajian, and Mar-cello Federico. 2017. Continuous Learning fromHuman Post-Edits for Neural Machine Translation.

The Prague Bulletin of Mathematical Linguistics ,108.Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is Allyou Need. In

Advances in Neural Information Pro-cessing Systems , volume 30, pages 5998–6008. Cur-ran Associates, Inc.Weiyue Wang, Jan-Thorsten Peter, Hendrik Rosendahl,and Hermann Ney. 2016. CharacTer: TranslationEdit Rate on Character Level. In

Proceedings of theFirst Conference on Statistical Machine Translation(WMT 2016) , pages 505–510, Berlin, Germany.Jiacheng Zhang, Yang Liu, Huanbo Luan, Jingfang Xu,and Maosong Sun. 2017. Prior Knowledge Integra-tion for Neural Machine Translation using PosteriorRegularization. In