On Romanization for Model Transfer Between Scripts in Neural Machine Translation
OOn Romanization for Model Transfer Between Scriptsin Neural Machine Translation
Chantal Amrhein and Rico Sennrich , Department of Computational Linguistics, University of Zurich School of Informatics, University of Edinburgh { amrhein,sennrich } @cl.uzh.ch Abstract
Transfer learning is a popular strategy to im-prove the quality of low-resource machinetranslation. For an optimal transfer of theembedding layer, the child and parent modelshould share a substantial part of the vocab-ulary. This is not the case when transferringto languages with a different script. We ex-plore the benefit of romanization in this sce-nario. Our results show that romanization en-tails information loss and is thus not alwayssuperior to simpler vocabulary transfer meth-ods, but can improve the transfer between re-lated languages with different scripts. We com-pare two romanization tools and find that theyexhibit different degrees of information loss,which affects translation quality. Finally, weextend romanization to the target side, show-ing that this can be a successful strategy whencoupled with a simple deromanization model.
Neural Machine Translation (NMT) has opened upnew opportunities in transfer learning from high-resource to low-resource language pairs (Zophet al., 2016; Kocmi and Bojar, 2018; Lakew et al.,2018). While transfer learning has shown greatpromise, the transfer between languages with dif-ferent scripts brings additional challenges. For asuccessful transfer of the embedding layer, boththe parent and the child model should use the sameor a partially overlapping vocabulary (Aji et al.,2020). It is common to merge the two vocabular-ies by aligning identical subwords and randomlyassigning the remaining subwords from the childvocabulary to positions in the parent vocabulary(Lakew et al., 2018, 2019; Kocmi and Bojar, 2020).This works well for transfer between languagesthat use the same script, but if the child languageis written in an unseen script, most vocabulary po-sitions are replaced by random subwords. This significantly reduces the transfer from the embed-ding layer. Gheini and May (2019) argue that ro-manization can improve transfer to languages withunseen scripts. However, romanization can alsointroduce information loss that might hurt transla-tion quality. In our work, we study the usefulnessof romanization for transfer from many-to-manymultilingual MT models to low-resource languageswith different scripts. Our contributions are thefollowing:- We show that romanized MT is not generallyoptimal, but can improve transfer between re-lated languages that use different scripts.- We study information loss from different ro-manization tools and its effect on MT quality.- We demonstrate that romanization on the tar-get side can also be effective when combinedwith a learned deromanization model.
Initial work on transfer learning for NMT has as-sumed that the child language is known in advanceand that the parent and child model can use a sharedvocabulary (Nguyen and Chiang, 2017; Kocmi andBojar, 2018). Lakew et al. (2018) argue that this isnot feasible in most real-life scenarios and proposeusing a dynamic vocabulary. Most studies havesince opted to replace unused parts of the parentvocabulary with unseen subwords from the childvocabulary (Lakew et al., 2019; Kocmi and Bojar,2020); others use various methods to align embed-ding spaces (Gu et al., 2018; Kim et al., 2019).Recently, Aji et al. (2020) showed that transfer ofthe embedding layer is only beneficial if there is anoverlap between the parent and child vocabularysuch that embeddings for identical subwords canbe aligned. Such alignments are very rare if thechild language uses an unseen script. a r X i v : . [ c s . C L ] S e p heini and May (2019) train a universal vocab-ulary on multiple languages by romanizing lan-guages written in a non-Latin script. Their many-to-one parent model can be transferred to new sourcelanguages without exchanging the vocabulary. Inour work, we extend this idea to many-to-manytranslation settings using subsequent deromaniza-tion of the output. We study the trade-off between agreater vocabulary overlap and information loss asa result of romanization. Based on experiments ona diverse set of low-resource languages, we showthat romanization is helpful for model transfer torelated languages with different scripts. Romanization describes the process of mappingcharacters in various scripts to Latin script. Thismapping is not always reversible. The goal is toapproximate the pronunciation of the text in theoriginal script. However, depending on the roman-ization tool, more or less information encoded inthe original script is lost. We compare two toolsfor mapping our translation input to Latin script: uroman (Hermjakob et al., 2018) is a tool foruniversal romanization that can romanize almost allcharacter sets. It is unidirectional; mappings fromLatin script back to other scripts are not available. uconv is a command-line tool similar to iconv that can be used for transliteration. It preservesmore information from the original script, which isexpressed with diacritics. uconv is bi-directionalfor a limited number of script pairs.Below is an example of the same Chinese sen-tence romanized with uroman and uconv : 她 到 塔 皓 湖 去 了 uroman : ta dao ta hao hu qu le uconv : t¯a d`ao tˇa h`ao h´u q`u le“She went to Lake Tahoe.”The two tools exhibit different degrees of infor-mation loss. uroman ignores tonal informationand consequently collapses the representations of 塔 (Pinyin tˇa ; ‘tower’) and 她 (Pinyin t¯a ; ‘she’).Romanization with uconv retains this distinctionbut it still adds ambiguity and loses the distinc-tion between 她 (Pinyin t¯a ; ‘she’) and 他 (Pinyin t¯a ; ‘he’), among others. While uconv exhibitsless information loss, its use of diacritics limitssubword sharing between languages. We measure https://github.com/isi-nlp/uroman https://linux.die.net/man/1/uconv character-level overlap between English and roman-ized Arabic, Russian and Chinese with chrF scores(Popovi´c, 2015) and find they are much higher for uroman (9.6, 18.8 and 13.3) compared to uconv (6.8, 18.1 and 7.2 respectively). Romanization is not necessarily reversible withsimple rules due to information loss. Therefore,previous work on romanized machine translationhas focused on source-side romanization only (Duand Way, 2017; Wang et al., 2018; Aqlan et al.,2019; Briakou and Carpuat, 2019; Gheini and May,2019). We argue that romanization can also beapplied on the target side, followed by an additionalderomanization step. This step can be performedby a character-based Transformer (Vaswani et al.,2017) that takes data romanized with uroman or uconv as input and is trained to map it back tothe original script. We provide more details on ourderomanization systems in Appendix A.2. We use OPUS-100 (Zhang et al., 2020) , anEnglish-centric dataset that includes parallel datafor 100 languages. It provides up to 1 million sen-tence pairs for every X-EN language pair as wellas 2,000 sentence pairs for development and test-ing each. There is no overlap between any of thedata splits across any of the languages, i.e. everyEnglish sentence occurs only once.We pretrain our multilingual models on 5 high-resource languages that cover a range of differentscripts { AR, DE, FR, RU, ZH } ↔
EN. For ourtransfer learning experiments, we choose 7 addi-tional languages that are either:(a)
Not closely related to any of the pretraininglanguages and written in an unseen script ,e.g. Marathi is not related to any of our pre-training languages and written in Devanagariscript.(b)
Closely related to a pretraining language andwritten in an unseen script , e.g. Yiddish is re-lated to German and written in Hebrew script. https://github.com/EdinburghNLP/opus-100-corpus cript related to Table 1: Overview of all languages, the script they are written in, other languages in this set they are closely relatedto (considering lexical similarity) and number of X ↔ EN sentence pairs. (*) means artificial low-resource settingswere created. (c) Written in
Latin script but closely related toa pretraining language in non-Latin script ,e.g. Maltese is related to Arabic and writtenin Latin script.Our selection of low-resource languages covers awide range of language families and training datasizes. Table 1 gives an overview of the selectedlanguages.
We use nematus (Sennrich et al., 2017) to trainour models and SacreBLEU (Post, 2018) to evalu-ate them. We compute statistical significance withpaired bootstrap resampling (Koehn, 2004) usinga significance level of 0.05 (sampling 1,000 timeswith replacement from our 2,000 test sentences).Our subword vocabularies are computed with bytepair encoding (Sennrich et al., 2016) using the Sen-tencePiece implementation (Kudo and Richardson,2018). We use a character coverage of 0.9995 to en-sure the resulting models do not consist of mostlysingle characters. Bilingual Baselines:
We follow the recom-mended setup for low-resource translation in Sen-nrich and Zhang (2019) to train our bilingual base-lines for the low-resource pairs (original script).For our bilingual low-resource models, we uselanguage-specific vocabularies of size 2,000. https://github.com/EdinburghNLP/nematus BLEU+case.mixed+lang.XX-XX+numrefs.1+smooth.exp+tok.13a+version.1.4.2
Pretrained multilingual models:
We pretrainthree multilingual standard Transformer Base ma-chine translation models (Vaswani et al., 2017):One keeps the original, non-Latin script for Arabic,Russian and Chinese (orig). The others ( uroman and uconv ) apply the respective romanization tothese parent languages. We follow Johnson et al.(2017) for multilingual training by prepending atarget language indicator token to the source input.For our pretrained models, we use a shared vocab-ulary of size 32,000. An overview of our modelhyperparameters is given in Appendix A.1.
Finetuning:
We finetune our pretrained modelsindependently for every low-resource language X.For finetuning on a child X ↔ EN pair, we use thesame preprocessing as for the respective parent,i.e. we keep original script, use uroman , or use uconv for romanization. We reuse 250,000 sen-tence pairs from the original pretraining data andoversample the X ↔ EN data for a total of around650,000 parallel sentences for finetuning. Thiscorresponds roughly to a 3:2 ratio which helps toprevent overfitting. We early stop on the respectiveX ↔ EN development set. For finetuning, we usea constant learning rate of 0.001. The remaininghyperparameters are identical to pretraining.
For our transfer baseline without romanization, wemerge our bilingual baseline vocabulary with thatof the parent model following previous work (Ajiet al., 2020; Kocmi and Bojar, 2020). First, werig uroman uconvar-en ru-en 33.3 33.5 zh-en
Table 2: X → EN BLEU scores (Papineni et al., 2002)of the multilingual pretrained models trained on origi-nal scripts (orig), romanized with uroman and uconv .Best systems (no other being statistically significantlybetter) marked in bold. align subwords that occur in both vocabularies.Next, we assign the remaining subwords from thebilingual baseline vocabulary to random unusedpositions in the parent vocabulary. With uroman ,we can reuse the parent vocabulary as is. uconv ,however, may produce unseen diacritics, which canresult in a small number of unseen subwords. Ifthat is the case, we perform the same vocabulary re-placement for these subwords as for the vocabularywith the original script.
To study the effects of information loss from ro-manization, we compare the translation quality ofour three pretrained multilingual models. To min-imize the impact of deromanization, we only dis-cuss X → EN directions for languages with non-Latin scripts. The results are presented in Table 2.Whether romanization hurts the translation qualitydepends largely on the language pair. For exam-ple, for ZH → EN, both romanization tools performworse than the model trained on original scripts.This is in line with our previous discussion: Eventhough uconv keeps tonal information, there isstill more ambiguity compared to using Chinesecharacters. The model trained with uconv roman-ization consistently outperforms uroman . Thisindicates that it is more important to minimize in-formation loss than to maximize subword sharing.An additional effect of using romanization, andthus being able to reuse the subword segmentationmodel during transfer, is that compression rates areworse than for dedicated segmentation models (seeTable 3). The resulting longer sequences with po-tentially suboptimal subword splits may also havea negative influence on translation quality. orig uroman (%) uconv (%)ar 67.7 + 2.2 + 9.9de 97.8 - 0.5 - 0.8fr 131.7 - 0.4 - 0.6ru 91.5 + 3.3 - 0.2zh 54.1 + 98.9 + 156.6am 113.0 + 70.4 + 83.1he 40.3 + 17.6 + 20.1mr 42.4 + 36.8 + 35.4mt 176.5 - 1.9 - 1.4sh 168.0 - 4.7 - 5.5ta 138.3 + 20.1 + 22.3yi 54.2 + 12.4 + 39.5
Table 3: Average number of subwords per sentencewith original script data (orig) and % relative changeafter romanization ( uroman and uconv ). Origi-nal script data is segmented with a shared subwordsegmentation model for { AR,DE,EN,FR,RU,ZH } andlanguage-specific models for low-resource languages.For uroman and uconv , all languages are segmentedusing a shared model for { AR,DE,EN,FR,RU,ZH } , ro-manized with the respective tool. Table 4 compares our character-based Transformersto uconv ’s built-in, rule-based deromanization.Relying on uconv ’s built-in deromanization is notoptimal. First, it does not support mappings backinto all scripts. Second, the performance of built-in uconv deromanization varies with the amountof “script code-switching”, e.g. due to hyperlinksor email addresses. Character-based Transformerscan learn to handle mixed script and outperform uconv ’s built-in deromanization.Our models can reconstruct the original scriptmuch better from uconv data than from uroman .This is not surprising considering that uroman causes more information loss and ambiguity. As ashallow measure of the ambiguity introduced, wecan compare the vocabulary size (before subwordsegmentation): With romanization, the total num-ber of types in our training sets decreases on aver-age by 10% for uconv and by 14% for uroman .Preliminary experiments with artificial low-resource settings (Appendix B.1) showed that ad-ditional training data can improve deromanizationbut it performs well even with very small amountsof training data (10,000 sentences). This shows thatour proposed character-based Transformer modelsare powerful enough to learn a mapping back touilt-in learneduconv uconv uromanar 92.7
Table 4: chrF scores of the deromanization to the origi-nal script. Best systems marked in bold. the original script as much as this is possible, giventhe increased ambiguity. This finding is supportedby concurrent work showing that character-basedTransformers are well-suited to a range of stringtransduction tasks (Wu et al., 2020).
Table 5 shows the results from our experiments ontransfer learning with romanization. Romanizingnon-Latin scripts is not always useful. For low-resource languages that use an unseen script butare not related to any of the pretraining languages(a), the performance degrades for uroman and isnot statistically significantly different for uconv .The extremely low BLEU score for EN → AMshows another problem with uroman romaniza-tion: uroman ignores the Ethiopic word spacecharacter which increases the distance betweentranslation and reference.However, for languages that are related to a pre-training language with a different script (groups (b)and (c)), there is an added benefit of using roman-ization. The statistically significant improvementof uconv over uroman strengthens our claim thatit is important to keep as much information as possi-ble from the original script when mapping to Latinscript. Despite potential information loss from ro-manization and error propagation from deromaniza-tion, our results show that romanization has meritwhen applied to related languages that can profitfrom a greater vocabulary overlap.
We analyzed the value of romanization for transfer-ring multilingual models to low-resource languageswith different scripts. While we cannot recommend transfer frommultilingual parentbase orig uroman uconv(a) am-en 14.4 en-am 12.7 13.7 6.5 mr-en 34.3 ta-en 21.9 en-ta 13.5 avg imp - + 6.1 + 4.5 + (b) yi-en 6.9 22.5 24.9 en-yi 9.5 12.0 he-en 22.8 en-he 21.1 24.5 25.2 avg imp - + 6.8 + 9.8 + 11.0 (c) mt-en 46.5 en-mt 35.6 sh-en 40.1 55.5 en-sh 33.8 52.1 52.3 avg imp - + 13.9 + 14.3 + 14.8
Table 5: BLEU scores of the bilingual baselines (notransfer learning) and finetuned models using originalscripts (orig), romanized with uroman and uconv .Average improvement over bilingual baseline is shownper group of languages. Best systems (no other beingstatistically significantly better) marked in bold. romanization as the default strategy for multilin-gual models and transfer learning across scripts be-cause of the information loss inherent to it, we findthat it benefits transfer between related languagesthat use different scripts. The uconv romaniza-tion tool outperforms uroman because it preservesmore information encoded in the original script andconsequently causes less information loss. Further-more, we demonstrated that romanization can alsobe successful on the target side if followed by anadditional, learned deromanization step. We hopethat our results provide valuable insights for futurework in transfer learning and practical applicationsfor low-resource languages with unseen scripts.
Acknowledgements
We thank our colleagues Anne, Annette, Duygu,Jannis, Mathias, No¨emi and the anonymous re-viewers for their helpful feedback. This work wasfunded by the Swiss National Science Foundation(project MUTAMUR; no. 176727). eferences
Alham Fikri Aji, Nikolay Bogoychev, KennethHeafield, and Rico Sennrich. 2020. In neural ma-chine translation, what does transfer learning trans-fer? In
Proceedings of the 58th Annual Meetingof the Association for Computational Linguistics ,pages 7701–7710, Online. Association for Compu-tational Linguistics.Fares Aqlan, Xiaoping Fan, Abdullah Alqwbani, andAkram Al-Mansoub. 2019. Arabic–Chinese Neu-ral Machine Translation: Romanized Arabic as Sub-word Unit for Arabic-sourced Translation.
IEEE Ac-cess , 7:133122–133135.Eleftheria Briakou and Marine Carpuat. 2019. Theuniversity of Maryland’s Kazakh-English neural ma-chine translation system at WMT19. In
Proceedingsof the Fourth Conference on Machine Translation(Volume 2: Shared Task Papers, Day 1) , pages 134–140, Florence, Italy. Association for ComputationalLinguistics.Jinhua Du and Andy Way. 2017. Pinyin as SubwordUnit for Chinese-Sourced Neural Machine Trans-lation. In
Proceedings of the 25th Irish Confer-ence on Artificial Intelligence and Cognitive Sci-ence, Dublin, Ireland, December 7 - 8, 2017 , vol-ume 2086 of
CEUR Workshop Proceedings , pages89–101. CEUR-WS.org.Mozhdeh Gheini and Jonathan May. 2019. A universalparent model for low-resource neural machine trans-lation transfer.Jiatao Gu, Hany Hassan, Jacob Devlin, and Victor O.K.Li. 2018. Universal neural machine translation forextremely low resource languages. In
Proceedingsof the 2018 Conference of the North American Chap-ter of the Association for Computational Linguistics:Human Language Technologies, Volume 1 (Long Pa-pers) , pages 344–354, New Orleans, Louisiana. As-sociation for Computational Linguistics.Ulf Hermjakob, Jonathan May, and Kevin Knight.2018. Out-of-the-box universal Romanization tooluroman. In
Proceedings of ACL 2018, SystemDemonstrations , pages 13–18, Melbourne, Australia.Association for Computational Linguistics.Melvin Johnson, Mike Schuster, Quoc V. Le, MaximKrikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat,Fernanda Vi´egas, Martin Wattenberg, Greg Corrado,Macduff Hughes, and Jeffrey Dean. 2017. Google’smultilingual neural machine translation system: En-abling zero-shot translation.
Transactions of the As-sociation for Computational Linguistics , 5:339–351.Yunsu Kim, Yingbo Gao, and Hermann Ney. 2019.Effective cross-lingual transfer of neural machinetranslation models without shared vocabularies. In
Proceedings of the 57th Annual Meeting of the Asso-ciation for Computational Linguistics , pages 1246–1257, Florence, Italy. Association for ComputationalLinguistics. Diederik P. Kingma and Jimmy Ba. 2015. Adam: AMethod for Stochastic Optimization. In .Tom Kocmi and Ondˇrej Bojar. 2018. Trivial transferlearning for low-resource neural machine translation.In
Proceedings of the Third Conference on MachineTranslation: Research Papers , pages 244–252, Bel-gium, Brussels. Association for Computational Lin-guistics.Tom Kocmi and Ondˇrej Bojar. 2020. Efficientlyreusing old models across languages via transferlearning. In
Proceedings of the 22nd Annual Confer-ence of the European Association for Machine Trans-lation , pages 19–28, Lisboa, Portugal. European As-sociation for Machine Translation.Philipp Koehn. 2004. Statistical significance testsfor machine translation evaluation. In
Proceed-ings of the 2004 Conference on Empirical Meth-ods in Natural Language Processing , pages 388–395, Barcelona, Spain. Association for Computa-tional Linguistics.Taku Kudo and John Richardson. 2018. SentencePiece:A simple and language independent subword tok-enizer and detokenizer for neural text processing. In
Proceedings of the 2018 Conference on EmpiricalMethods in Natural Language Processing: SystemDemonstrations , pages 66–71, Brussels, Belgium.Association for Computational Linguistics.Surafel M Lakew, Aliia Erofeeva, Matteo Negri, Mar-cello Federico, and Marco Turchi. 2018. TransferLearning in Multilingual Neural Machine Transla-tion with Dynamic Vocabulary. In .Surafel M Lakew, Alina Karakanta, Marcello Federico,Matteo Negri, and Marco Turchi. 2019. AdaptingMultilingual Neural Machine Translation to UnseenLanguages. .Toan Q. Nguyen and David Chiang. 2017. Trans-fer learning across low-resource, related languagesfor neural machine translation. In
Proceedings ofthe Eighth International Joint Conference on Natu-ral Language Processing (Volume 2: Short Papers) ,pages 296–301, Taipei, Taiwan. Asian Federation ofNatural Language Processing.Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic eval-uation of machine translation. In
Proceedings ofthe 40th Annual Meeting of the Association for Com-putational Linguistics , pages 311–318, Philadelphia,Pennsylvania, USA. Association for ComputationalLinguistics.aja Popovi´c. 2015. chrF: character n-gram f-scorefor automatic MT evaluation. In
Proceedings of theTenth Workshop on Statistical Machine Translation ,pages 392–395, Lisbon, Portugal. Association forComputational Linguistics.Matt Post. 2018. A call for clarity in reporting BLEUscores. In
Proceedings of the Third Conference onMachine Translation: Research Papers , pages 186–191, Belgium, Brussels. Association for Computa-tional Linguistics.Ofir Press and Lior Wolf. 2017. Using the output em-bedding to improve language models. In
Proceed-ings of the 15th Conference of the European Chap-ter of the Association for Computational Linguistics:Volume 2, Short Papers , pages 157–163, Valencia,Spain. Association for Computational Linguistics.Rico Sennrich, Orhan Firat, Kyunghyun Cho, Alexan-dra Birch, Barry Haddow, Julian Hitschler, MarcinJunczys-Dowmunt, Samuel L¨aubli, Antonio ValerioMiceli Barone, Jozef Mokry, and Maria Nadejde.2017. Nematus: a Toolkit for Neural Machine Trans-lation. In
Proceedings of the Software Demonstra-tions of the 15th Conference of the European Chap-ter of the Association for Computational Linguistics ,pages 65–68, Valencia, Spain. Association for Com-putational Linguistics.Rico Sennrich, Barry Haddow, and Alexandra Birch.2016. Neural machine translation of rare wordswith subword units. In
Proceedings of the 54th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers) , pages 1715–1725, Berlin, Germany. Association for Computa-tional Linguistics.Rico Sennrich and Biao Zhang. 2019. Revisiting low-resource neural machine translation: A case study.In
Proceedings of the 57th Annual Meeting of the As-sociation for Computational Linguistics , pages 211–221, Florence, Italy. Association for ComputationalLinguistics.Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is Allyou Need. In I. Guyon, U. V. Luxburg, S. Bengio,H. Wallach, R. Fergus, S. Vishwanathan, and R. Gar-nett, editors,
Advances in Neural Information Pro-cessing Systems 30 , pages 5998–6008. Curran Asso-ciates, Inc.Boli Wang, Jinming Hu, Yidong Chen, and XiaodongShi. 2018. XMU neural machine translation systemsfor WAT2018 Myanmar-English translation task. In
Proceedings of the 32nd Pacific Asia Conference onLanguage, Information and Computation: 5th Work-shop on Asian Translation: 5th Workshop on AsianTranslation , Hong Kong. Association for Computa-tional Linguistics.Shijie Wu, Ryan Cotterell, and Mans Hulden. 2020.Applying the Transformer to Character-level Trans-duction. Biao Zhang, Philip Williams, Ivan Titov, and Rico Sen-nrich. 2020. Improving massively multilingual neu-ral machine translation and zero-shot translation. In
Proceedings of the 58th Annual Meeting of the Asso-ciation for Computational Linguistics , pages 1628–1639, Online. Association for Computational Lin-guistics.Barret Zoph, Deniz Yuret, Jonathan May, and KevinKnight. 2016. Transfer learning for low-resourceneural machine translation. In
Proceedings of the2016 Conference on Empirical Methods in Natu-ral Language Processing , pages 1568–1575, Austin,Texas. Association for Computational Linguistics.
Model Details
A.1 Multilingual Pretrained Models
We train multilingual Transformer Base machinetranslation models (Vaswani et al., 2017) with 6encoder layers, 6 decoder layers, 8 heads, an em-bedding and hidden state dimension of 512 anda feed-forward network dimension of 2048. Weregularize our models with a dropout of 0.1 for theembeddings, the residual connections, in the feed-forward sub-layers and for the attention weights.Furthermore, we apply exponential smoothing of0.0001 and label smoothing of 0.1. We tie both ourencoder and decoder input embeddings as well asthe decoder input and output embeddings (Pressand Wolf, 2017). All of our multilingual machinetranslation models are trained with a maximum to-ken length of 200 and a vocabulary of size 32,000.For optimization, we use Adam (Kingma andBa, 2015) with standard hyperparameters and alearning rate of 0.0001. We follow the Trans-former learning schedule described in (Vaswaniet al., 2017) with a linear warmup over 4,000 steps.Our token batch size is set to 16,348 and we trainon 4 NVIDIA Tesla V100 GPUs. All modelswere trained using the implementation providedin nematus (Sennrich et al., 2017) using earlystopping on a development set with patience 5.
A.2 Character-Based Deromanization
We train character-based Transformer Base ma-chine translation models (Vaswani et al., 2017). Toachieve character-level deromanization, we do notmake any changes to the architecture. We simplychange the input format such that every characteris separated by spaces. The original space charac-ters are replaced by another character that does notoccur in the training data ( (cid:31) ). The following ex-ample shows the parallel training data for learnedderomanization: uroman source: C H t o (cid:31) t a m (cid:31) d a l s h e ? uconv source: ˇC t o (cid:31) t a m (cid:31) d a l ’ ˇs e ?target:
Ч т о (cid:31) т а м (cid:31) д а л ь ш е ? “What’s next?”We use a maximum sequence length of 1,200since character-level sequences are much longerthan subword-level sequences. Our vocabulariesare made up of all characters that occur in the re-spective training data. All other parameters areset as for multilingual pretraining described in Ap-pendix A.1.
B Supplementary Results
B.1 Effect of Data Size on Deromanization c h r F sc o r e ar uroman ar uconv c h r F sc o r e ru uroman ru uconv
1% 10% 100% data size c h r F sc o r e zh uroman
1% 10% 100% data sizezh uconv
Figure 1: chrF scores of deromanization models trainedon 1%, 10% and 100% of the total data (correspond-ing to 10,000, 100,000 and 1,000,000 parallel sen-tences). Results compare romanization with uroman and uconv for Arabic, Russian and Chinese.
Figure 1 shows the influence of the training datasize on the chrF score between the deromanized testset and the original script test set. Additional datacan improve deromanization models, especially forlanguages such as Chinese, where a mapping backto the original script is difficult to learn due to theinformation loss from romanization.We analyze how deromanization quality affectsthe BLEU score of deromanized translations. Thisis shown in Table 6. We find that the deromaniza-tion models for uroman are more affected by anextreme low-resource setting. For uconv , dero-manization models trained on smaller data setsshow less performance loss compared to using fulldata. It is notable that training uconv deroman-ization models only on 100,000 sentences has al-most no effect on the BLEU score for EN → ARand EN → ZH. For EN → RU, there is a loss of 1.1BLEU points compared to training on 100% of thedata. Looking at the deromanization outputs forEN → RU, we found that deromanization modelstrained on less data could not handle “script code-roman1% 10% 100%en-ar 20.3 en-ru 26.5 27.9 en-zh 38.8 uconv1% 10% 100%en-ar 21.2 en-ru 27.9 28.2 en-zh 40.2