[PDF] Facilitating Terminology Translation with Target Lemma Annotations

Abstract

Most of the recent work on terminology integration in machine translation has assumed that terminology translations are given already inflected in forms that are suitable for the target language sentence. In day-to-day work of professional translators, however, it is seldom the case as translators work with bilingual glossaries where terms are given in their dictionary forms; finding the right target language form is part of the translation process. We argue that the requirement for apriori specified target language forms is unrealistic and impedes the practical applicability of previous work. In this work, we propose to train machine translation systems using a source-side data augmentation method that annotates randomly selected source language words with their target language lemmas. We show that systems trained on such augmented data are readily usable for terminology integration in real-life translation scenarios. Our experiments on terminology translation into the morphologically complex Baltic and Uralic languages show an improvement of up to 7 BLEU points over baseline systems with no means for terminology integration and an average improvement of 4 BLEU points over the previous work. Results of the human evaluation indicate a 47.7% absolute improvement over the previous work in term translation accuracy when translating into Latvian.

Full PDF

FFacilitating Terminology Translation with Target Lemma Annotations

Toms Bergmanis †‡ and M¯arcis Pinnis †‡†

Tilde / Vien¯ıbas gatve 75A, Riga, Latvia ‡ Faculty of Computing, University of Latvia / Rain¸ a bulv. 19, Riga, Latvia { firstname.lastname } @tilde.lv Abstract

Most of the recent work on terminology in-tegration in machine translation has assumedthat terminology translations are given alreadyinﬂected in forms that are suitable for the tar-get language sentence. In day-to-day work ofprofessional translators, however, it is seldomthe case as translators work with bilingual glos-saries where terms are given in their dictionaryforms; ﬁnding the right target language formis part of the translation process. We arguethat the requirement for apriori speciﬁed tar-get language forms is unrealistic and impedesthe practical applicability of previous work. Inthis work, we propose to train machine trans-lation systems using a source-side data aug-mentation method that annotates randomly se-lected source language words with their tar-get language lemmas. We show that systemstrained on such augmented data are readilyusable for terminology integration in real-lifetranslation scenarios. Our experiments on ter-minology translation into the morphologicallycomplex Baltic and Uralic languages show animprovement of up to 7 BLEU points overbaseline systems with no means for terminol-ogy integration and an average improvementof 4 BLEU points over the previous work. Re-sults of the human evaluation indicate a 47.7%absolute improvement over the previous workin term translation accuracy when translatinginto Latvian. Translation into morphologically complex lan-guages involves 1) making a lexical choice fora word in the target language and 2) ﬁnding itsmorphological form that is suitable for the morpho-syntactic context of the target sentence. Most of therecent work on terminology translation, however, Relevant materials and code: https://github.com/tilde-nlp/terminology_translation has assumed that the correct morphological formsare apriori known (Hokamp and Liu, 2017; Postand Vilar, 2018; Hasler et al., 2018; Dinu et al.,2019; Song et al., 2020; Susanto et al., 2020; Dou-gal and Lonsdale, 2020). Thus previous work hasapproached terminology translation predominantlyas a problem of making sure that the decoder’soutput contains lexically and morphologically pre-speciﬁed target language terms. While useful insome cases and some languages, such approachescome short of addressing terminology translationinto morphologically complex languages whereeach word can have many morphological surfaceforms.For terminology translation to be viable for trans-lation into morphologically complex languages,terminology constraints have to be soft . That is,terminology translation has to account for variousnatural language phenomena, which cause wordsto have more than one manifestation of their rootmorphemes. Multiple root morphemes complicatethe application of hard constraint methods, suchas constrained-decoding (Hokamp and Liu, 2017).That is because even after the terminology con-straint is striped from the morphemes that encodeall grammatical information, the remaining rootmorphemes still can be too restrictive to be used ashard constraints because, for many words, there canbe more than one root morpheme possible. An il-lustrative example is the consonant mutation in theLatvian noun v¯acietis (“the German”) which under-goes the mutation t → ˇs, thus yielding two variantsof its root morpheme v¯acieˇs- and v¯aciet- (Bergma-nis, 2020). If either of the forms is used as a hardconstraint for constrained decoding, the other oneis excluded from appearing in the sentence’s trans-lation.We propose a necessary modiﬁcation for themethod introduced by Dinu et al. (2019), whichallows training neural machine translation (NMT) a r X i v : . [ c s . C L ] J a n N Src.: faulty engine or in transmission[..]

LV Trg.: atteice dzin¯ej¯a vai transmisijas [..]

ETA: faulty | w engine | s dzin¯ej ¯a | t or | wtransmission | s transmisij as | t [..] TLA: faulty | w engine | s dzin¯ej s | t or | wtransmission | s transmisij a | t [..] Table 1: Examples of differences in input data in ETA(Dinu et al., 2019) and TLA (this work). Differences ofinline annotations are marked in bold. | w, | s, | t denotethe values of the additional input stream and stand forregular words, source language annotated words, targetlanguage annotations respectively. systems that are capable of applying terminologyconstraints: instead of annotating source-side termi-nology with their exact target language translations,we annotate randomly selected source languagewords with their target language lemmas. First ofall, preparing training data in such a way relaxesthe requirement for access to bilingual terminologyresources at the training time. Second, we showthat the model trained on such data does not learnto simply copy inline annotations as in the caseof Dinu et al. (2019), but learns copy-and-inﬂect behaviour instead, thus addressing the need for soft terminology constraints.Our results show that the proposed approach notonly relaxes the requirement for apriori speciﬁedtarget language forms but also yields substantialimprovements over the previous work (Dinu et al.,2019) when tested on the morphologically complexBaltic and Uralic languages. To train NMT systems that allow applying termi-nology constraints Dinu et al. (2019) prepare train-ing data by amending source language terms withtheir exact target annotations (

ETA ). To informthe NMT model about the nature of each token(i.e., whether it is a source language term, its targetlanguage translation or a regular source languageword), the authors use an additional input stream—source-side factors (Sennrich and Haddow, 2016).Their method, however, is limited to cases in whichthe provided annotation matches the required targetform and can be copied verbatim, thus performingpoorly in cases where the surface forms of termsin the target language differ from those used toannotate source language sentences (Dinu et al.,2019). This constitutes a problem for the method’spractical applicability in real-life scenarios. In this

Train Test

ATS WMT17+IATEEN-DE 27.6M 768 581EN-ET 2.4M 768 -EN-LV 22.6M 768 -EN-LT 22.1M 768 -

Table 2: Training and evaluation data sizes in num-bers of sentences. WMT2017 + IATE stands for theEnglish-German test set from the news translation taskof WMT2017 which is annotated with terminologyfrom the IATE terminology database. work, we propose two changes to the approach ofDinu et al. (2019).

First , when preparing train-ing data, instead of using terms found in eitherIATE or Wiktionary as done by Dinu et al. (2019),we annotate random source language words. Thisrelaxes the requirement for curated bilingual dictio-naries for training data preparation. Second , ratherthan providing exactly those target language formsthat are used in the target sentence, we use targetlemma annotations (

TLA ) instead (see Table 1 forexamples). We hypothesise that in order to ben-eﬁt from such annotations, the NMT model willhave to learn copy-and-inﬂect behaviour instead ofsimple copying as proposed by Dinu et al. (2019).Our work is similar to work by Exel et al. (2020)in which authors also aim to achieve copy-and-inﬂect behaviour. However, authors limit their an-notations to only those terms for which their baseforms differ by no more than two characters fromthe forms required in the target language sentence.Thus wordforms undergoing longer afﬁx change orinﬂections accompanied by such linguistic phenom-ena as consonant mutation, consonant gradation orother stem change are never included in trainingdata.

Languages and Data.

As our focus is on mor-phologically complex languages, in our experi-ments we translate from English into Latvian andLithuanian (Baltic branch of the Indo-European lan-guage family) as well as Estonian (Finnic branchof the Uralic language family). For comparabil-ity with the previous work, we also use English-German (Germanic branch of the Indo-Europeanlanguage family). For all language pairs, we useall data that is available in the Tilde Data Libararywith an exception for English-Estonian for which https://iate.europa.eu igure 1: Example of forms used in human evaluation. we use data from WMT 2018. The size of the par-allel corpora after pre-processing using the TildeMT platform (Pinnis et al., 2018) and ﬁltering tools(Pinnis, 2018) is given in Table 2.To prepare data with TLA, we ﬁrst lemmatiseand part-of-speech (POS) tag the target languageside of parallel corpora. For lemmatisation andPOS tagging, we use pre-trained Stanza (Qi et al.,2020) models. We then use fast align (Dyer et al.,2013) to learn word alignments between the targetlanguage lemmas and source language inﬂectedwords. We only annotate verbs or nouns. To gen-erate sentences with varying proportions of anno-tated and unannotated words, we ﬁrst generate asentence level annotation threshold uniformly atrandom from the interval [0 . , . . Similarly, foreach word in the source language sentence, we gen-erate another number uniformly at random fromthe interval [0 . , . . If the latter is larger than thesentence level annotation threshold, we annotatethe respective word with its target language lemma.We use the original training data and annotateddata with a proportion of 1:1. We follow Dinu et al.(2019) to prepare ETA and replicate their results.For validation during training, we use develop-ment sets from the WMT news translation sharedtasks. For EN-ET and EN-DE, we used the datafrom WMT 2018, for EN-LV – WMT 2017, andfor EN-LT – WMT 2019. MT Model and Training.

For the most part, weuse the default conﬁguration of the Transformer(Vaswani et al., 2017) NMT model implementationof the Sockeye NMT toolkit (Hieber et al.). Theexception is the use of source-side factors (Sen- https://github.com/stanfordnlp/stanza https://github.com/clab/fast_align nrich and Haddow, 2016) with the dimensionalityof 8 for systems using inline target lemma anno-tations. We train all models using early stoppingwith the patience of 10 based on their developmentset perplexity (Prechelt, 1998). Evaluation Methods and Data.

In previouswork, methods were tested on general domain data annotated with exact surface forms of general-domain words from IATE and Wiktionary. Al-though data constructed in such a way is not only ar-tiﬁcial but also gives an oversimpliﬁed view on ter-minology translation, we do use the data from IATEto validate our re-implementation of the methodfrom Dinu et al. (2019). Other than that, we teston the Automotive Test Suite (ATS): a data setcontaining translations of the same 768 sentencesin English, Estonian, German, Latvian, and Lithua-nian. ATS contains about 1.1k term occurrencesfrom a glossary prepared by professional transla-tors. When annotating terms in the source text, weuse only the dictionary forms of term translations,since in practical applications having access to thecorrect inﬂections (surface forms) is unrealistic.We compare our work with an NMT system with-out means for terminology integration ( Baseline )and the previous work by Dinu et al. (2019) (

ETA ).Although our preliminary experiments with con-strained decoding (Post and Vilar, 2018) ( CD ) con-ﬁrmed the ﬁndings by Dinu et al. (2019) that strictenforcement of constraints leads to lower-than-baseline quality, we nevertheless include them forcompleteness sake.Similarly to the previous work, we use two auto- https://github.com/mtresearcher/terminology_dataset https://github.com/tilde-nlp/terminology_translation ATE Automotive Test Suite

EN-DE EN-DE EN-ET EN-LV EN-LTBLEU Acc. BLEU Acc. BLEU Acc. BLEU Acc. BLEU Acc.

Baseline CD . † † TLA † †‡ †‡ †‡ Table 3: Results of automatic evaluation metrics BLEU and term translation accuracy (Acc.). The numericallyhighest score in each column is given in bold; † and ‡ indicate statistically signiﬁcant improvements of BLEU overBaseline and ETA respectively (all p < . ) . Correct Wrong lexeme Wrong inﬂect. Other κ free Basel.

ETA

TLA

Baseline Equal TLA κ free ETA Equal TLA κ free Table 4: Results of human evaluation: term (on the left) and sentence (on the right) translation quality judgementsin %. Sentence comparison is pairwise contrasting TLA vs Baseline and TLA vs ETA. κ -free: inter-annotatoragreement according to free marginal kappa (Randolph, 2005). matic means for evaluation: BLEU (Papineni et al.,2002) and lemmatised term exact match accuracy.We use BLEU as an extrinsic evaluation metricas we expect that, when successful, the methodsfor terminology translation should yield substan-tial overall translation quality improvements dueto correctly translated domain-speciﬁc terms. Forsigniﬁcance testing, we use pairwise bootstrap re-sampling (Koehn, 2004). We use lemmatised termexact match accuracy as an intrinsic metric becauseit directly measures the adequacy of terminologytranslation (i.e., whether or not the correct lexemeappears in the target sentence).We are aware that the automatic evaluation meth-ods are merely an approximation of translationquality. For example, we use lemmatised termexact match accuracy to measure term use in targetlanguage translations; however, it does not capturewhether the term is inﬂected correctly. Thus humanevaluation is in place. We use the EN-LV languagepair to compare TLA against baseline and ETA. Weuse a 100 sentences large randomly selected ATSsubset that contains 147 terms of the original testsuite. We employ four professional translators andLatvian native speakers to compare each system’stranslations according to their overall translationquality and judge individual term translation qual-ity. Speciﬁcally, given the original sentence andits two translations (in a randomised order), ratersare asked to answer “which system’s translation isbetter overall?”. Raters are also given a list of the terms being evaluated and their reference transla-tions (from the term collection) and are asked toclassify translations as either “Correct”, “Wronglexeme”, “Wrong inﬂection”, or “Other”. Figure 1gives an example of the forms presented to ratersduring the human evaluation of term and overalltranslation quality. We report inter-annotator agree-ment using free marginal kappa, κ free (Randolph,2005). Automatic Evaluation.

We ﬁrst validate our re-implementation of ETA by testing on the English-German WMT 2017 test set annotated with termsfrom IATE as used by Dinu et al. (2019). Results(see columns 2 and 3 of Table 3) are similar to thoseof the previous work: on this data set, ETA yieldsminor translation quality improvements over thebaseline (+0.2 BLEU) and considerable improve-ment (+14.5%) in term translation accuracy.When evaluated on the ATS, systems using TLAalways yield results that are better than the base-line both in terms of BLEU scores (+1.4–7 BLEU)and term translation accuracy (29.8%–47.8%) (seecolumns 4-11 of Table 3). Results also show thatwhen compared to ETA, systems integrating ter-minology using TLA achieve statistically signiﬁ-cant improvements in terms of BLEU scores forthree out of four languages-pairs. An exception isEN-DE, for which both systems, ETA and TLA,perform similarly. Analysing reference translationsof the EN-DE language pair, we ﬁnd that as manys 87% of the German terms are used in their dic-tionary forms, which explains the comparable per-formance of systems trained using ETA and TLAon EN-DE.Results also conﬁrm the ﬁnding of the previ-ous work by Dinu et al. (2019) and Exel et al.(2020), that the strict enforcement of constraints byconstrained decoding leads to lower-than-baselineBLEU scores on all data sets for all languages.BLEU scores are abysmal when translating into themorphologically complex languages as for theselanguages citation form seldom happens to bethe form required in the target language sentence.This result further illustrates why terminology con-straints have to be soft when translating into mor-phologically complex languages.

Human Evaluation.

Results of human evalua-tion of EN-LV systems are summarised in Table 4.First, we note that on this dataset, the baseline sys-tem translates terms correctly 55% of the time, yetit makes mistakes by choosing the wrong lexemefor most of the other cases (Table 4, left). The sys-tem using ETA, on the other hand, has a muchlower rate of correctly translated terms – 45%,which roughly corresponds to the proportion ofLatvian terms in the reference translations that areused in their dictionary forms (47%). The remain-ing cases are mistranslated by choosing the wronginﬂected form. The system using TLA, in compari-son, does very well as it gets terminology transla-tions right 93% of the time. Examining the caseswhere terms had been mistranslated by choosingthe wrong lexeme, we ﬁnd that most of these casesare multi-word terms with some other word in-serted between their constituent parts. The high κ - free values indicate almost perfect inter-annotatoragreement suggesting that the task of term trans-lation quality evaluation has been easy and resultsare reliable.The overall sentence translation quality judge-ments (Table 4, right) also favour translations pro-duced by the system using TLA deeming it bet-ter than or on par with the baseline system andsystem using ETA 97% of the time. The systemusing TLA is strictly favoured over its ETA coun-terpart for 61% of the translations. Again, anno-tators have reached an almost perfect agreement( κ free = 0 . ) when comparing the systems usingTLA and ETA, suggesting that the task has beeneasy. These results clearly show that at least for theEN-LV language pair and the test set considered here, systems using TLA improve term translationquality by correctly choosing adequate translationsand morpho-syntactically appropriate inﬂections. Productivity of NMT models.

Terminologytranslation frequently involves the translation ofniche lexemes with rare or even unseen inﬂections.Thus the model’s ability to generate novel word-forms is critical for high-quality translations. Toverify if our NMT models are lexically and mor-phologically productive, we analysed Latvian trans-lations of ATS produced by the system using TLAand looked for wordforms that are not present ineither source or target language side of the train-ing data. We found 72 such wordforms. Of those45 or 62.5% were valid wordforms that were notpresent in training data, of which 28 were novel in-ﬂections related to ATS terminology use, while theremaining 17 where novel forms of general words.We interpret this as some evidence that the NMTmodel, when needed, generates novel wordforms.The remaining 27 or 37.5% were not valid, albeitsometimes plausible, Latvian language words, com-mon types of errors being literal translations andtransliterations of English words as well as wordsthat would have been correct, if not for errors withconsonant mutation.

We proposed TLA—a ﬂexible and easy-to-implement method for terminology integration inNMT. Using TLA does not require access to bilin-gual terminology resources at system training timeas it annotates ordinary words with lemmas of theirtarget language translations. This simpliﬁes datapreparation greatly and also relaxes the require-ment for apriori speciﬁed target language formsduring the translation, making our method prac-tically viable for terminology translation in real-life scenarios. Results from experiments on threemorphologically complex languages demonstratedsubstantial and systematic improvements over thebaseline NMT systems without means for termi-nology integration and the previous work both interms of automatic and human evaluation judgingterm and overall translation quality.

Acknowledgements eferences

Toms Bergmanis. 2020.

Methods for morphologylearning in low(er)-resource scenarios . Ph.D. the-sis, The University of Edinburgh.Georgiana Dinu, Prashant Mathur, Marcello Federico,and Yaser Al-Onaizan. 2019. Training neural ma-chine translation to apply terminology constraints.In

Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics , pages3063–3068, Florence, Italy. Association for Compu-tational Linguistics.Duane K. Dougal and Deryle Lonsdale. 2020. Im-proving NMT quality using terminology injection.In

Proceedings of the 12th Language Resourcesand Evaluation Conference , pages 4820–4827, Mar-seille, France. European Language Resources Asso-ciation.Chris Dyer, Victor Chahuneau, and Noah A Smith.2013. A simple, fast, and effective reparameteriza-tion of ibm model 2. In

Proceedings of the 2013Conference of the North American Chapter of theAssociation for Computational Linguistics: HumanLanguage Technologies , pages 644–648.Miriam Exel, Bianka Buschbeck, Lauritz Brandt, andSimona Doneva. 2020. Terminology-constrainedneural machine translation at SAP. In

Proceedingsof the 22nd Annual Conference of the European As-sociation for Machine Translation , pages 271–280,Lisboa, Portugal. European Association for MachineTranslation.Eva Hasler, Adri`a de Gispert, Gonzalo Iglesias, andBill Byrne. 2018. Neural machine translation decod-ing with terminology constraints. In

Proceedings ofthe 2018 Conference of the North American Chap-ter of the Association for Computational Linguistics:Human Language Technologies, Volume 2 (Short Pa-pers) , pages 506–512, New Orleans, Louisiana. As-sociation for Computational Linguistics.Felix Hieber, Tobias Domhan, Michael Denkowski,and David Vilar. Sockeye 2: A toolkit for neuralmachine translation. In ,page 457.Chris Hokamp and Qun Liu. 2017. Lexically con-strained decoding for sequence generation using gridbeam search. In

Proceedings of the 55th AnnualMeeting of the Association for Computational Lin-guistics (Volume 1: Long Papers) , pages 1535–1546,Vancouver, Canada. Association for ComputationalLinguistics.Philipp Koehn. 2004. Statistical signiﬁcance tests formachine translation evaluation. In

Proceedings ofthe 2004 conference on empirical methods in naturallanguage processing , pages 388–395. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic eval-uation of machine translation. In

Proceedings of the40th annual meeting of the Association for Compu-tational Linguistics , pages 311–318.M¯arcis Pinnis. 2018. Tilde’s parallel corpus ﬁlteringmethods for wmt 2018. In

Proceedings of the ThirdConference on Machine Translation: Shared TaskPapers , pages 939–945.M¯arcis Pinnis, Andrejs Vasil¸jevs, Rihards Kalnin¸ ˇs,Roberts Rozis, Raivis Skadin¸ ˇs, and Valters ˇSics.2018. Tilde MT platform for developing client spe-ciﬁc MT solutions. In

Proceedings of the EleventhInternational Conference on Language Resourcesand Evaluation (LREC 2018) , Miyazaki, Japan. Eu-ropean Language Resources Association (ELRA).Matt Post and David Vilar. 2018. Fast lexically con-strained decoding with dynamic beam allocation forneural machine translation. In

Proceedings of the2018 Conference of the North American Chapter ofthe Association for Computational Linguistics: Hu-man Language Technologies, Volume 1 (Long Pa-pers) , pages 1314–1324.Lutz Prechelt. 1998. Early stopping-but when? In

Neural Networks: Tricks of the trade , pages 55–69.Springer.Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton,and Christopher D Manning. 2020. Stanza: Apython natural language processing toolkit for manyhuman languages. In

Proceedings of the 58th An-nual Meeting of the Association for ComputationalLinguistics: System Demonstrations , pages 101–108.Justus J Randolph. 2005. Free-marginal multiraterkappa (multirater κ free): An alternative to ﬂeiss’ﬁxed-marginal multirater kappa. In Presented at theJoensuu Learning and Instruction Symposium , vol-ume 2005.Rico Sennrich and Barry Haddow. 2016. Linguisticinput features improve neural machine translation.In

Proceedings of the First Conference on MachineTranslation: Volume 1, Research Papers , pages 83–91, Berlin, Germany. Association for ComputationalLinguistics.Kai Song, Kun Wang, Heng Yu, Yue Zhang,Zhongqiang Huang, Weihua Luo, Xiangyu Duan,and Min Zhang. 2020. Alignment-enhanced trans-former for constraining nmt with pre-speciﬁed trans-lations. AAAI.Raymond Hendy Susanto, Shamil Chollampatt, andLiling Tan. 2020. Lexically constrained neural ma-chine translation with levenshtein transformer. In

Proceedings of the 58th Annual Meeting of the Asso-ciation for Computational Linguistics , pages 3536–3543.shish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In