[PDF] Trivial Transfer Learning for Low-Resource Neural Machine Translation

Abstract

Transfer learning has been proven as an effective technique for neural machine translation under low-resource conditions. Existing methods require a common target language, language relatedness, or specific training tricks and regimes. We present a simple transfer learning method, where we first train a "parent" model for a high-resource language pair and then continue the training on a lowresource pair only by replacing the training corpus. This "child" model performs significantly better than the baseline trained for lowresource pair only. We are the first to show this for targeting different languages, and we observe the improvements even for unrelated languages with different alphabets.

Full PDF

aa r X i v : . [ c s . C L ] S e p Trivial Transfer Learning for Low-Resource Neural Machine Translation

Tom Kocmi Ondˇrej Bojar

Charles University, Faculty of Mathematics and PhysicsInstitute of Formal and Applied LinguisticsMalostransk´e n´amˇest´ı 25, 118 00 Prague, Czech Republic @ufal.mff.cuni.cz

Abstract

Transfer learning has been proven as an ef-fective technique for neural machine transla-tion under low-resource conditions. Exist-ing methods require a common target lan-guage, language relatedness, or speciﬁc train-ing tricks and regimes. We present a simpletransfer learning method, where we ﬁrst traina “parent” model for a high-resource languagepair and then continue the training on a low-resource pair only by replacing the trainingcorpus. This “child” model performs signiﬁ-cantly better than the baseline trained for low-resource pair only. We are the ﬁrst to showthis for targeting different languages, and weobserve the improvements even for unrelatedlanguages with different alphabets.

Neural machine translation (NMT) has made abig leap in performance and became the unques-tionable winning approach in the past few years(Bahdanau et al., 2014; Sutskever et al., 2014;Sennrich et al., 2017; Vaswani et al., 2017). Themain reason behind the success of NMT in re-alistic conditions was the ability to handle largevocabulary (Sennrich et al., 2016b) and to utilizelarge monolingual data (Sennrich et al., 2016a).However, NMT still struggles if the parallel datais insufﬁcient (e.g. fewer than 1M parallel sen-tences), producing ﬂuent output unrelated to thesource and performing much worse than phrase-based machine translation (Koehn and Knowles,2017).Many strategies have been used in MT inthe past for employing resources from addi-tional languages, see e.g. Wu and Wang (2007),Nakov and Ng (2012), El Kholy et al. (2013), orHoang and Bojar (2016). For NMT, a particularlypromising approach is transfer learning or “do- main adaptation” where the “domains” are the dif-ferent languages.For example, Zoph et al. (2016) train a “par-ent” model in a high-resource language pair, thenuse some of the trained weights as the initializa-tion for a “child” model and further train it on thelow-resource language pair. In Zoph et al. (2016),the parent and child pairs shared the target lan-guage (English) and a number of modiﬁcationsof the training process were needed to achieve animprovement in translation from Hansa, Turkish,and Uzbek into English with the help of French-English data.Nguyen and Chiang (2017) explore a relatedscenario where the parent language pair is alsolow-resource but it is related to the child languagepair. They improved the previous approach by us-ing a shared vocabulary of subword units (BPE,Sennrich et al., 2016b). Additionally, they usedtransliteration to improve their results.In this paper, we contribute empirical evidencethat transfer learning for NMT can be simpliﬁedeven further. We leave out the restriction on re-latedness of the languages and extend the experi-ments to parent–child pairs where the target lan-guage changes. Moreover, we do not utilize anyspecial modiﬁcations to the training regime or datapre-preprocessing.In contrast to previous work, we test the methodwith the Transformer model (Vaswani et al.,2017), instead of the recurrent approaches(Bahdanau et al., 2014). As documented ine.g. Popel and Bojar (2018) and anticipated inWMT18, the Transformer model seems superiorto other NMT approaches. Method Description

The proposed method is extremely simple: Wetrain the parent language pair for a number of iter-ations and switch the training corpus to the childlanguage pair for the rest of the training, withoutresetting any of the training (hyper)parameters.As such, this method is similar to the transferlearning proposed by Zoph et al. (2016) but usesthe shared vocabulary as in Nguyen and Chiang(2017). The novelty is that we are removing therestriction about relatedness of the language pairs,and in contrast to the previous papers, we showthat this simple style of transfer learning can beused on both sides (i.e. either the source or thetarget language), not only with the target languagecommon to both parent and child model. In fact,the method is effective also for fully unrelated lan-guage pairs.Our method does not need any modiﬁcation ofexisting NMT frameworks. The only requirementis to use a shared vocabulary of subword units (weuse wordpieces, Johnson et al., 2017) across bothlanguage pairs. This is achieved by learning word-piece segmentation from the concatenated sourceand target sides of both the parent and child lan-guage pairs. All other parameters of the modelstay the same as for the standard NMT training.During the training we ﬁrst train the NMTmodel for the high-resource language pair untilconvergence. This model is called “parent”. Afterthat, we train the child model without any restart,i.e. only by changing the training corpora to thelow-resource language pair.

Current NMT systems use vocabularies of sub-word units instead of whole words. Using sub-word units gives a balance between the ﬂexibil-ity of separate characters and efﬁciency of wholewords. It solves the out-of-vocabulary wordsproblem and reduces the vocabulary size. The ma-jority of NMT systems use either the byte pairencoding (Sennrich et al., 2016b) or wordpieces(Wu et al., 2016). Given a training corpus and thedesired maximal vocabulary size, either methodproduces deterministic rules for word segmenta-tion to achieve the fewest possible splits.Our method requires the vocabulary sharedacross both the parent (translating from languageXX to YY) and the child model (translating fromAA to BB). This is obtained by concatenating both training corpora into one corpus of sentences inlanguages AA, BB, XX and YY. Due to our focus on low-resource languagepairs, we decided to generate the vocabulary ina balanced way by selecting the same amount ofsentences from both language pairs. We thus usethe same number of sentence pairs of the parentcorpus as there are in the child corpus.We did not experiment with any other balanc-ing of the vocabulary. Future research could alsoinvestigate the impact of using only the child cor-pus for vocabulary generation or various amountsof used sentences.We generated vocabularies aiming at 32k sub-word types. The exact size of the vocabularyvaries from 26.1k to 34.8k. All experiments of agiven language set use the same vocabulary. Vo-cabulary overlap in each language set is furtherstudied in Section 6.1.

We use the Transformer sequence-to-sequencemodel (Vaswani et al., 2017) as implementedin Tensor2Tensor (Vaswani et al., 2018) version1.4.2. Our models are based on the “big singleGPU” conﬁguration as deﬁned in the paper. To ﬁtthe model to our GPUs (NVIDIA GeForce GTX1080 Ti with 11 GB RAM), we set the batch sizeto 2300 tokens and limit sentence length to 100wordpieces.We use exponential learning rate decay with thestarting learning rate of 0.2 and 32000 warm upsteps and Adam optimized. In our experiments,we ﬁnd that it is undesirable to reset learning rateas it leads to the loss of the performance from theparent model. Therefore the transfer learning ishandled only by changing the training corpora andnothing else.Decoding uses the beam size of 8 and the lengthnormalization penalty is set to 1.The models were trained for 1M steps (approx.140 hours), which was sufﬁcient for models toconverge to the best performance. We selected themodel with the best performance on the develop-ment test for the ﬁnal evaluation on the testset. Having separate vocabularies for the parent and child andswitching from the XX-YY to AA-BB vocabulary when weswitch the training corpus leads on an expected drop in per-formance. Independent vocabularies use different IDs evenfor identical subwords and the network cannot rely on any ofits weights from the parent training.ang. Sent. Words Vocabularypair pairs First Second First SecondET,EN 0.8 M 14 M 20 M 631 k 220 kFI,EN 2.8 M 44 M 64 M 1697 k 545 kSK,EN 4.3 M 82 M 95 M 1059 k 610 kRU,EN 12.6 M 297 M 321 M 2202 k 3161 kCS,EN 40.1 M 491 M 563 M 6253 k 4130 kAR,RU 10.2 M 243 M 252 M 2299 k 2099 kFR,RU 10.0 M 295 M 238 M 1339 k 2045 kES,FR 10.0 M 297 M 288 M 1426 k 1323 kES,RU 10.0 M 300 M 235 M 1433 k 2032 k

Table 1: Datasets sizes overview. We consider Esto-nian and Slovak low-resource languages in our paper.Word counts and vocabulary sizes are from the originalcorpus, tokenizing only at whitespace and preservingthe case.

In our experiments, we compare low-resource andhigh-resource language pairs spanning two ordersof magnitude of training data sizes. We considerEstonian (ET) and Slovak (SK) as low-resourcelanguages compared to the Finnish (FI) and Czech(CS) counterparts.The choice of languages was closely related tothe languages in this year’s WMT 2018 sharedtasks. In particular, Estonian and Finnish (pairedwith English) were suggested as the main focusfor their relatedness. We added Czech and Slovakas another closely related language pair. Russian(RU) for the parent model was chosen for two rea-sons: (1) written in Cyrillic, there will be hardlyany intersection in the shared vocabulary with thechild language pairs, and (2) previous work usestransliteration to handle Russian, which is a nicecontrast to our work. Finally, we added Arabic(AR), French (FR) and Spanish (ES) for experi-ments with unrelated languages.The sizes of the training datasets are in Table 1.If not speciﬁed otherwise we use training, de-velopment and test sets from WMT. Pairs withtraining sentences with less than 4 words or morethan 75 words on either the source or the targetside are removed to allow for a speedup of Trans-former by capping the maximal length and allow-ing a bigger batch size. The reduction of train-ing data is small and based on our experiments, itdoes not change the performance of the translationmodel.We use the Europarl and Rapid corpora forEstonian-English. We disregard Paracrawl due toits noisiness. The development and test sets are from WMT news 2018.The Finnish-English was prepared as in¨Ostling et al. (2017), removing Wikipedia head-lines. The dev and test sets are from WMT news2015.For English-Czech, we use all paralel dataallowed in WMT2018 except Paracrawl. Themain resource is CzEng 1.7 (the ﬁltered version,Bojar et al., 2016). The devset is WMT new-stest2011 and the testset is WMT newstest2017.Slovak-English uses corpora fromGaluˇsc´akov´a and Bojar (2012), detokenizedby Moses. WMT newstest2011 serves as thedevset and testset.The Russian-English training set was createdfrom News Commentary, Yandex and UN Corpus.As the devset, we use WMT newstest 2012.The language pairs Arabic-Russian, French-Russian, Spanish-French and Spanish-Russianwere selected from UN corpus (Ziemski et al.,2016), which provides over 10 million multi-parallel sentences in 6 languages.

In this section, we present results of our approach.Statistical signiﬁcance of the winner (marked with ‡ ) is tested by paired bootstrap resampling againstthe baseline (child-only) setup (1000 samples,conf. level 0.05; Koehn, 2004).As customary, we label the models with the pairof the source and target language codes, for ex-ample the English-to-Estonian translation model isdenoted by ENET.The vocabularies are generated as described in2.1 separately for each experimented combinationof parent and child. The same vocabulary is usedwhenever the parent and child use the same set oflanguages, i.e. disregarding the translation direc-tion and model stage (parent or child). Table 2 summarizes our results for various combi-nations of high-resource parent and low-resourcechild language pairs when English is shared be-tween the child and parent either in the encoder orin the decoder.We conﬁrm that sharing the target lan-guage improves performance as previously shown(Zoph et al., 2016; Nguyen and Chiang, 2017).This gains up to 2.44 BLEU absolute for ETEN https://github.com/moses-smt/mosesdecoder aselines: OnlyParent - Child Transfer Child ParentenFI - enET 19.74 ‡ ‡ enCS - enET ‡ enRU - enET ‡ RUen - ETen ‡ ‡ ‡ ‡ ‡ ‡ Table 2: Transfer learning with English reused eitherin source (encoder) or target (decoder). The column“Transfer” is our method, baselines correspond to train-ing on one of the corpora only. Scores (BLEU) arealways for the child language pair and they are compa-rable only within lines or when the child language pairis the same. “Unrelated” language pairs in bold. Up-per part: parent larger, lower part: child larger. (“EN”lowercased just to stand out.) with the FIEN parent. Using only the parent(FIEN) model to translate the child (ETEN) testset gives a miserable performance, conﬁrming theneed for transfer learning or “ﬁnetuning”.A novel result is that the method works also forsharing the source language, improving ENET byup to 2.71 BLEU thanks to ENFI parent.Furthermore, the improvement is not restrictedonly to related languages as Estonian and Finnishas shown in previous works. Unrelated languagepairs (shown in bold in Table 2) like Czech andEstonian work too and in some cases even betterthan with the related datasets. We reach an im-provement of 3.38 BLEU for ENET when parentmodel was ENCS, compared to improvement of2.71 from ENFI parent. This statistically signif-icant improvement contradicts Dabre et al. (2017)who concluded that the more related the languagesare, the better transfer learning works. We see it asan indication that the size of the parent training setis more important than relatedness of languages.The results with Russian parent for Estonianchild (both directions) show that transliteration isalso not necessary. Because there is no vocab-ulary sharing between Russian Cyrilic and Esto-nian Latin (except numbers and punctuation, seeSection 6.1 for further details), the improvementcould be attributed to a better coverage of English;an effect similar to domain adaptation.On the other hand, this transfer learning workswell only when the parent has more training data

Child Training Sents Transfer BLEU Baseline BLEU800k 19.74 17.03400k 19.04 14.94200k 17.95 11.96100k 17.61 9.3950k 15.95 5.7410k 12.46 1.95

Table 3: Maximal score reached by ENET child fordecreasing sizes of child training data, trained off anENFI parent (all ENFI data are used and models aretrained for 800k steps). The baselines use only the re-duced ENET data. than the child. As presented in the bottom part ofTable 2, low-resource parents do not generally im-prove the performance of better-resourced childsand sometimes, they even (signiﬁcantly) decreaseit. This is another indication, that the most impor-tant is the size of the parent corpus compared tothe child one.The baselines are either models trained purelyon the child parallel data or only on the parentdata. The second baseline only indicates the relat-edness of languages because it is only tested butnever trained on the child language pair. Also, wedo not add any language tag as in Johnson et al.(2017). This also highlights that the improvementof our method cannot be directly attributed to therelatedness of languages: e.g. Czech and Slo-vak are much more similar than Czech and Es-tonian (Parent Only BLEU of translation out ofEnglish is 6.51 compared to 1.42) and yet thegain from transfer learning is larger for Estonian(+3.38) than from Slovak (+1.62).

In Table 3, we simulate very low-resource settingsby downscaling the data for the child model. Itis a common knowledge, that gains from transferlearning are more pronounced for smaller childs.The point of Table 3 is to illustrate that our ap-proach is applicable even to extremely small childsetups, with as few as 10k sentence pairs. Ourtransfer learning (“start with a model for what-ever parent pair”) may thus resolve the issue ofapplicability of NMT for low resource languagesas pointed out by Koehn and Knowles (2017).

Figure 1 compares the performance of the childmodel when trained from various training stagesof the parent model. The performance of the childclearly correlates with the performance of the par-2141618 0 250 500 750 1000 B LE U Steps (in thousands)

Baseline en-et onlyen-et after 50k of en-ﬁen-et after 100k of en-ﬁen-et after 200k of en-ﬁen-et after 400k of en-ﬁen-et after 800k of en-ﬁNon-comparable English-Finnish

Figure 1: Learning curves on dev set for ENFI parentand ENET child where the child model started trainingafter various numbers of the parent’s training steps.

Parent - Child Transfer Baseline AlignedenFI - ETen 22.75 ‡ ‡ ‡ ‡ ‡ ‡ Table 4: Results of child following a parent withswapped direction. “Baseline” is child-only training.“Aligned” is the more natural setup with English ap-pearing on the “correct” side of the parent, the numbersin this column thus correspond to those in Table 2. ent. Therefore, it is better to use a parent modelthat already converged and reached its best perfor-mance.

Relaxing the setup in Section 5.1, we now allow amismatch in translation direction of the parent andchild. The parent XX-EN is thus followed by anEN-YY child or vice versa. It is important to notethat Transformer shares word embeddings for thesource and target side. The gain can be thus due tobetter English word embeddings, but deﬁnitely notdue to a better English language model. It wouldbe interesting to study the effect of not sharing theembeddings but we leave it for some future work.The results in Table 4 document that an im-

Parent - Child Transfer BaselineARRU - ETEN 22.23 21.74ESFR - ETEN 22.24 ‡ ‡ ‡ Table 5: Transfer learning with parent and child notsharing any language. provement can be reached even when none of theinvolved languages is reused on the same side.This interesting result should be studied in moredetail. Firat et al. (2016) hinted possible gainseven when both languages are distinct from thelow-resource languages but in a multilingual set-ting. Not surprisingly, the improvements are betterwhen the common language is aligned.The bottom part of Table 4 shows a particu-larly interesting trick: the parent is not any high-resource pair but the very same EN-ET corpuswith source and target swapped. We see gainsin both directions, although not always statisti-cally signiﬁcant. Future work should investigate ifthis performance boost is possible even for high-resource languages. Similar behavior has beenshown in Niu et al. (2018), where in contrast toour work they mixed the data together and addedan artiﬁcial token indicating the target language.

Our ﬁnal set of experiments examines the perfor-mance of ETEN child trained off parents in totallyunrelated language pairs. Without any commonlanguage, the gains cannot be attributed, e.g., tothe shared English word embeddings. The vocab-ulary overlap is mostly due to short n-grams ornumbers and punctuations.We see gains from transfer learning in all cases,mostly signiﬁcant. The only non-signiﬁcant gainis from Arabic-Russian which does not share thescript with the child Latin at all. (Sharing ofpunctuation and numbers is possible across all thetested scripts.) The gains are quite similar (+0.49–+0.78 BLEU), supporting our assumption that themain factor is the size of the parent (here, all have10M sentence pairs) rather than language related-ness.

Here we provide a rather initial analysis of thesources of the gains.

T EN RU % Subwords X - - 29.93%- X - 20.69%- - X X X - 10.06%-

X X X - X X X X

Table 6: Breakdown of subword vocabulary of exper-iments involving ET, EN and RU.

Out method relies on the vocabulary estimatedjointly from the child and parent model. In Trans-former, the vocabulary is even shared across en-coder and decoder. With a large overlap, we couldexpect a lot of “information reuse” between theparent and the child.Since the subword vocabulary depends on thetraining corpora, a little clariﬁcation is needed.We take the vocabulary of subword units as cre-ated e.g. for ENRU-ENET experiments, see Sec-tion 2.1. This vocabulary contains 28.2k subwordsin total. We then process the training corpora foreach of the languages with this shared vocabulary,ignore all subwords that appear less than 10 timesin each of the languages (these subwords will havelittle to no impact on the result of the training) andbreak down the total 28.2k subwords into classesdepending on the languages in which the particu-lar subword was observed, see Table 6.We see that the vocabulary is reasonably bal-anced, with each language having 20–30% of sub-words unique to it. English and Estonian share10% subwords not seen in Russian while Russianshares only 0–1.39% of subwords with each of theother languages. Overall 8.89% of subwords areseen in all three languages.A particularly interesting subset is the onewhere parent languages help the child model, inother words subwords appearing anywhere in En-glish and also tokens common to Estonian andRussian. For this set of languages, this amountsto 20.69+10.06+1.39+0.0+8.89 = 41.03%. We listthis number on a separate line in Table 6, “Fromparent”. These subwords get their embeddingstrained better thanks to the parent model.Table 7 summarizes this analysis for several lan-guage sets, listing what portion of subwords isunique to individual languages in the set, what

Languages Unique in a Lang. In All From ParentET-EN-FI 24.4-18.2-26.2 19.5 49.4ET-EN-RU 29.9-20.7-29.0 8.9 41.0ET-EN-CS 29.6-17.5-21.2 20.3 49.2AR-RU-ET-EN 28.6-27.7-21.2-9.1 4.6 6.2ES-FR-ET-EN 15.7-13.0-24.8-8.8 18.4 34.1ES-RU-ET-EN 14.7-31.1-21.3-9.3 6.0 21.4FR-RU-ET-EN 12.3-32.0-22.3-8.1 6.3 23.1

Table 7: Summary of vocabulary overlaps for the var-ious language sets. All ﬁgures in % of the shared vo-cabulary.

BLEU nPER nTER nCDER chrF3 nCharacTERBase ENET 16.13 47.13 32.45 36.41 48.38 33.23ENRU+ENET 19.10 50.87 36.10 39.77 52.12 39.39ENCS+ENET 19.30 51.51 36.84 40.42 52.71 40.81

Table 8: Various automatic scores on ENET test set.Scores preﬁxed “n” reported as (1 − score ) to makehigher numbers better. portion is shared by all the languages and whatportion of subwords beneﬁts from the parent train-ing. We see a similar picture across the board, onlyAR-RU-ET-EN stands out with the very low num-ber of subwords (6.2%) available already in theparent. The parent AR-RU thus offered very lit-tle word knowledge to the child and yet lead to again in BLEU. Since we rely on automatic analysis, we need toprevent some potential overestimations of trans-lation quality due to BLEU. For this, we took acloser look at the baseline ENET model (BLEUof 17.03 in Table 2) and two ENET childs derivedfrom ENCS (BLEU of 20.41) and ENRU parent(BLEU 20.09).Table 8 conﬁrms the improvements are not anartifact of uncased BLEU. The gains are apparentwith several (now cased) automatic scores.As documented in Table 9, the improved out-puts are considerably longer. In the table, we showalso individual n -gram precisions and brevitypenalty (BP) of BLEU. The longer output clearlyhelps to reduce the incurred BP but the improve-ments are also apparent in n -gram precisions.In other words, the observed gain cannot be at-tributed solely to producing longer outputs.Table 10 explains the gains in unigram preci-sions by checking which tokens in the improvedoutputs (the parent followed by the child) werepresent also in the baseline (child-only, denoted“b” in Table 10) and/or conﬁrmed by the refer- ength BLEU Components BPBase ENET 35326 48.1/21.3/11.3/6.4 0.979ENRU+ENET 35979 51.0/24.2/13.5/8.0 0.998ENCS+ENET 35921 51.7/24.6/13.7/8.1 0.996 Table 9: Candidate total length, BLEU n -gram preci-sions and brevity penalty (BP). The reference length inthe matching tokenization was 36062. ENRU+ENET ENCS+ENETrb 15902 (44.2 %) 15924 (44.3 %)- 9635 (26.8 %) 9485 (26.4 %)b 7209 (20.0 %) 7034 (19.6 %)r 3233 (9.0 %) 3478 (9.7 %)Total 35979 (100.0 %) 35921 (100.0 %)

Table 10: Comparison of improved outputs vs. thebaseline and reference. ence (denoted “r”). We see that about 44+20% oftokens of improved outputs can be seen as “un-changed” compared to the baseline because theyappear already in the baseline output (“b”). (The44% “rb” tokens are actually conﬁrmed by the ref-erence.)The differing tokens are more interesting: “-”denotes the cases when the improved system pro-duced something different from the baseline andalso from the reference. Gains in BLEU are due to“r” tokens, i.e. tokens only in the improved out-puts and the reference but not the baseline “b”.For both parent setups, there are about 9–9.7 %of such tokens. We looked at these 3.2k and 3.5ktokens and we have to conclude that these are reg-ular

Estonian words; no Czech or Russian leaks tothe output and the gains are not due to simple to-ken types common to all the languages (punctua-tion, numbers or named entities). We see identicalBLEU gains even if we remove all such simple to-kens from the candidates and references. A betterexplanation of the gains thus still has to be soughtfor.

Firat et al. (2016) propose multi-way multi-lingualsystems, with the main goal of reducing the to-tal number of parameters needed to cater multiplesource and target languages. To keep all the lan-guage pairs “active” in the model, a special train-ing schedule is needed. Otherwise, catastrophicforgetting would remove the ability to translateamong the languages trained earlier.Johnson et al. (2017) is another multi-lingual approach: all translation pairs are simply used atonce and the desired target language is indicatedwith a special token at the end of the source side.The model implicitly learns translation betweenmany languages and it can even translate amonglanguage pairs never seen together.Lack of parallel data can be tackled byunsupervised translation (Artetxe et al., 2018;Lample et al., 2018). The general idea is tomix monolingual training of autoencoders forthe source and target languages with translationtrained on data translated by the previous iterationof the system.When no parallel data are available, the train-set of closely related high-resource pair can beused with transliteration approach as described inKarakanta et al. (2018).Aside from the common back-translation(Sennrich et al., 2016a; Kocmi et al., 2018), sim-ple copying of target monolingual data back tosource (Currey et al., 2017) has been also shownto improve translation quality in low-data condi-tions.Similar to transfer learning is also curriculumlearning (Bengio et al., 2009; Kocmi and Bojar,2017), where the training data are ordered fromforeign out-of-domain to the in-domain trainingexamples.

We presented a simple method for transfer learn-ing in neural machine translation based on train-ing a parent high-resource pair followed a low-resource language pair dataset. The method worksfor shared source or target side as well as for lan-guage pairs that do not share any of the translationsides. We observe gains also from totally unre-lated language pairs, although not always signiﬁ-cant.One interesting trick we propose for low-resource languages is to start training in the oppo-site direction and swap to the main one afterwards.The reasons for the gains are yet to be explainedin detail but our observations indicate that the keyfactor is the size of the parent corpus rather thane.g. vocabulary overlaps.

Acknowledgments

This study was supported in parts by the grantsSVV 260 453, GAUK 8502/2016, and 18-24210Sof the Czech Science Foundation. This work haseen using language resources and tools storedand distributed by the LINDAT/CLARIN projectof the Ministry of Education, Youth and Sports ofthe Czech Republic (projects LM2015071 and OPVVV VI CZ.02.1.01/0.0/0.0/16 013/0001781).

References

Mikel Artetxe, Gorka Labaka, Eneko Agirre, andKyunghyun Cho. 2018. Unsupervised neural ma-chine translation. In

Proceedings of the Sixth Inter-national Conference on Learning Representations .Dzmitry Bahdanau, Kyunghyun Cho, and YoshuaBengio. 2014. Neural Machine Translation byJointly Learning to Align and Translate.

CoRR ,abs/1409.0473.Yoshua Bengio, J´erˆome Louradour, Ronan Collobert,and Jason Weston. 2009. Curriculum learning. In

Proceedings of the 26th annual international con-ference on machine learning , pages 41–48. ACM.Ondˇrej Bojar, Ondˇrej Duˇsek, Tom Kocmi, Jindˇrich Li-bovick´y, Michal Nov´ak, Martin Popel, Roman Su-darikov, and Duˇsan Variˇs. 2016. Czeng 1.6: en-larged czech-english parallel corpus with processingtools dockered. In

International Conference on Text,Speech, and Dialogue , pages 231–238. Springer.Anna Currey, Antonio Valerio Miceli Barone, and Ken-neth Heaﬁeld. 2017. Copied monolingual data im-proves low-resource neural machine translation. In

Proceedings of the Second Conference on MachineTranslation , pages 148–156.Raj Dabre, Tetsuji Nakagawa, and Hideto Kazawa.2017. An empirical study of language relatednessfor transfer learning in neural machine translation.In

Proceedings of the 31st Paciﬁc Asia Conferenceon Language, Information and Computation , pages282–286.Ahmed El Kholy, Nizar Habash, Gregor Leusch,Evgeny Matusov, and Hassan Sawaf. 2013. Lan-guage independent connectivity strength features forphrase pivot statistical machine translation. In

Pro-ceedings of the 51st Annual Meeting of the Associa-tion for Computational Linguistics (Volume 2: ShortPapers) , pages 412–418, Soﬁa, Bulgaria. Associa-tion for Computational Linguistics.Orhan Firat, Kyunghyun Cho, and Yoshua Bengio.2016. Multi-Way, Multilingual Neural MachineTranslation with a Shared Attention Mechanism. In

Proceedings of the 2016 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies ,pages 866–875, San Diego, California. Associationfor Computational Linguistics.Petra Galuˇsc´akov´a and Ondrej Bojar. 2012. Improvingsmt by using parallel data of a closely related lan-guage. In

Proc. of HLT , pages 58–65. Duc Tam Hoang and Ondrej Bojar. 2016. Pivotingmethods and data for czech-vietnamese translationvia english.

Baltic Journal of Modern Computing ,4(2):190–202.Melvin Johnson, Mike Schuster, Quoc Le, MaximKrikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat,Fernand a Vigas, Martin Wattenberg, Greg Corrado,Macduff Hughes, and Jeffrey Dean. 2017. Google’smultilingual neural machine translation system: En-abling zero-shot translation.

Transactions of the As-sociation for Computational Linguistics , 5:339–351.Alina Karakanta, Jon Dehdari, and Josef van Genabith.2018. Neural machine translation for low-resourcelanguages without parallel corpora.

Machine Trans-lation , 32(1):167–189.Tom Kocmi and Ondˇrej Bojar. 2017. CurriculumLearning and Minibatch Bucketing in Neural Ma-chine Translation. In

Recent Advances in NaturalLanguage Processing 2017 .Tom Kocmi, oman Sudarikov, and Ondˇrej Bojar. 2018.CUNI Submissions in WMT18. In

Proceedings ofthe 3rd Conference on Machine Translation (WMT) ,Brussels, Belgium.Philipp Koehn. 2004. Statistical signiﬁcance tests formachine translation evaluation. In

Proceedings ofEMNLP , volume 4, pages 388–395.Philipp Koehn and Rebecca Knowles. 2017. Six chal-lenges for neural machine translation. In

Pro-ceedings of the First Workshop on Neural MachineTranslation , pages 28–39, Vancouver. Associationfor Computational Linguistics.Guillaume Lample, Myle Ott, Alexis Conneau, Lu-dovic Denoyer, and Marc’Aurelio Ranzato. 2018.Phrase-based & neural unsupervised machine trans-lation. In

Proceedings of the 2018 Conference onEmpirical Methods in Natural Language Processing(EMNLP) .Preslav Nakov and Hwee Tou Ng. 2012. Improv-ing statistical machine translation for a resource-poor language using related resource-rich languages.

Journal of Artiﬁcial Intelligence Research , 44:179–222.Toan Q. Nguyen and David Chiang. 2017. Trans-fer learning across low-resource, related languagesfor neural machine translation. In

Proceedings ofthe Eighth International Joint Conference on Natu-ral Language Processing (Volume 2: Short Papers) ,pages 296–301. Asian Federation of Natural Lan-guage Processing.Xing Niu, Michael Denkowski, and Marine Carpuat.2018. Bi-directional neural machine translationwith synthetic parallel data. In

Proceedings of the2nd Workshop on Neural Machine Translation andGeneration , pages 84–91, Melbourne, Australia. As-sociation for Computational Linguistics.obert ¨Ostling, Yves Scherrer, J¨org Tiedemann,Gongbo Tang, and Tommi Nieminen. 2017. Thehelsinki neural machine translation system. In

Pro-ceedings of the Second Conference on MachineTranslation , pages 338–347, Copenhagen, Den-mark. Association for Computational Linguistics.Martin Popel and Ondej Bojar. 2018. Training Tipsfor the Transformer Model.

The Prague Bulletin ofMathematical Linguistics , 110(1):43–70.Rico Sennrich, Alexandra Birch, Anna Currey, UlrichGermann, Barry Haddow, Kenneth Heaﬁeld, An-tonio Valerio Miceli Barone, and Philip Williams.2017. The university of edinburgh’s neural mt sys-tems for wmt17. In

Proceedings of the Second Con-ference on Machine Translation, Volume 2: SharedTask Papers , pages 389–399, Copenhagen, Den-mark. Association for Computational Linguistics.Rico Sennrich, Barry Haddow, and Alexandra Birch.2016a. Improving Neural Machine TranslationModels with Monolingual Data. In

Proceedingsof the 54th Annual Meeting of the Association forComputational Linguistics (Volume 1: Long Pa-pers) , pages 86–96, Berlin, Germany. Associationfor Computational Linguistics.Rico Sennrich, Barry Haddow, and Alexandra Birch.2016b. Neural machine translation of rare wordswith subword units. In

Proceedings of the 54th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers) , pages 1715–1725, Berlin, Germany. Association for Computa-tional Linguistics.Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.Sequence to sequence learning with neural net-works. In

Advances in neural information process-ing systems , pages 3104–3112.Ashish Vaswani, Samy Bengio, Eugene Brevdo, Fran-cois Chollet, Aidan Gomez, Stephan Gouws, LlionJones, Lukasz Kaiser, Nal Kalchbrenner, Niki Par-mar, Ryan Sepassi, Noam Shazeer, and JakobUszkoreit. 2018. Tensor2Tensor for Neural MachineTranslation. In

Proceedings of the 13th Conferenceof the Association for Machine Translation in theAmericas (Volume 1: Research Papers) , pages 193–199, Boston, MA. Association for Machine Transla-tion in the Americas.Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In I. Guyon, U. V. Luxburg, S. Bengio,H. Wallach, R. Fergus, S. Vishwanathan, and R. Gar-nett, editors,

Advances in Neural Information Pro-cessing Systems 30 , pages 6000–6010. Curran As-sociates, Inc.Hua Wu and Haifeng Wang. 2007. Pivot language ap-proach for phrase-based statistical machine transla-tion. In

Proceedings of the 45th Annual Meeting ofthe Association of Computational Linguistics , pages 856–863, Prague, Czech Republic. Association forComputational Linguistics.Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc VLe, Mohammad Norouzi, Wolfgang Macherey,Maxim Krikun, Yuan Cao, Qin Gao, KlausMacherey, et al. 2016. Google’s neural ma-chine translation system: Bridging the gap betweenhuman and machine translation. arXiv preprintarXiv:1609.08144 .Michal Ziemski, Marcin Junczys-Dowmunt, and BrunoPouliquen. 2016. The united nations parallel corpusv1. 0. In

LREC .Barret Zoph, Deniz Yuret, Jonathan May, and KevinKnight. 2016. Transfer learning for low-resourceneural machine translation. In