Edinburgh Neural Machine Translation Systems for WMT 16
aa r X i v : . [ c s . C L ] J un Edinburgh Neural Machine Translation Systems for WMT 16
Rico Sennrich and
Barry Haddow and
Alexandra Birch
School of Informatics, University of Edinburgh {rico.sennrich,a.birch}@ed.ac.uk , [email protected] Abstract
We participated in the WMT 2016shared news translation task by build-ing neural translation systems forfour language pairs, each trainedin both directions: English ↔ Czech,English ↔ German, English ↔ Romanianand English ↔ Russian. Our systems arebased on an attentional encoder-decoder,using BPE subword segmentation foropen-vocabulary translation with a fixedvocabulary. We experimented with us-ing automatic back-translations of themonolingual News corpus as additionaltraining data, pervasive dropout, andtarget-bidirectional models. All reportedmethods give substantial improvements,and we see improvements of 4.3–11.2B
LEU over our baseline systems. In thehuman evaluation, our systems were the(tied) best constrained system for 7 outof 8 translation directions in which weparticipated. We participated in the WMT 2016 shared newstranslation task by building neural translationsystems for four language pairs: English ↔ Czech,English ↔ German, English ↔ Romanianand English ↔ Russian. Our systems arebased on an attentional encoder-decoder(Bahdanau et al., 2015), using BPE subwordsegmentation for open-vocabulary translationwith a fixed vocabulary (Sennrich et al., 2016b).We experimented with using automatic back-translations of the monolingual News corpus as We have released the implementation that weused for the experiments as an open source toolkit: https://github.com/rsennrich/nematus We have released scripts, sample con-figs, synthetic training data and trained models: https://github.com/rsennrich/wmt16-scripts additional training data (Sennrich et al., 2016a),pervasive dropout (Gal, 2015), and target-bidirectional models.
Our systems are attentional encoder-decoder net-works (Bahdanau et al., 2015). We base our im-plementation on the dl4mt-tutorial , which we en-hanced with new features such as ensemble decod-ing and pervasive dropout.We use minibatches of size 80, a maximum sen-tence length of 50, word embeddings of size 500,and hidden layers of size 1024. We clip the gra-dient norm to 1.0 (Pascanu et al., 2013). We trainthe models with Adadelta (Zeiler, 2012), reshuf-fling the training corpus between epochs. Wevalidate the model every 10 000 minibatches viaB LEU on a validation set (newstest2013, new-stest2014, or half of newsdev2016 for EN ↔ RO).We perform early stopping for single models, anduse the 4 last saved models (with models saved ev-ery 30 000 minibatches) for the ensemble results.Note that ensemble scores are the result of a sin-gle training run. Due to resource limitations, wedid not train ensemble components independently,which could result in more diverse models and bet-ter ensembles.Decoding is performed with beam search witha beam size of 12. For some language pairs, weused the AmuNMT C++ decoder as a more effi-cient alternative to the theano implementation ofthe dl4mt tutorial. To enable open-vocabulary translation, we seg-ment words via byte-pair encoding (BPE) (Sennrich et al., 2016b). BPE, originally devised https://github.com/nyu-dl/dl4mt-tutorial https://github.com/emjotde/amunmt https://github.com/rsennrich/subword-nmt s a compression algorithm (Gage, 1994), isadapted to word segmentation as follows:First, each word in the training vocabulary isrepresented as a sequence of characters, plus anend-of-word symbol. All characters are added tothe symbol vocabulary. Then, the most frequentsymbol pair is identified, and all its occurrencesare merged, producing a new symbol that is addedto the vocabulary. The previous step is repeateduntil a set number of merge operations have beenlearned.BPE starts from a character-level segmentation,but as we increase the number of merge opera-tions, it becomes more and more different from apure character-level model in that frequent charac-ter sequences, and even full words, are encoded asa single symbol. This allows for a trade-off be-tween the size of the model vocabulary and thelength of training sequences. The ordered list ofmerge operations, learned on the training set, canbe applied to any text to segment words into sub-word units that are in-vocabulary in respect to thetraining set (except for unseen characters).To increase consistency in the segmentation ofthe source and target text, we combine the sourceand target side of the training set for learning BPE.For each language pair, we learn 89 500 merge op-erations. WMT provides task participants with largeamounts of monolingual data, both in-domainand out-of-domain. We exploit this mono-lingual data for training as described in(Sennrich et al., 2016a). Specifically, we samplea subset of the available target-side monolingualcorpora, translate it automatically into the sourceside of the respective language pair, and thenuse this synthetic parallel data for training. Forexample, for EN → RO, the back-translationis performed with a RO → EN system, andvice-versa.Sennrich et al. (2016a) motivate the use ofmonolingual data with domain adaptation, re-ducing overfitting, and better modelling of flu-ency. We sample monolingual data from the NewsCrawl corpora , which is in-domain with respect Due to recency effects, we expect last year’s corpus to bemost relevant, and sampled from News Crawl 2015 for EN-RO, EN-RU and EN-CS; for EN-DE, we re-used data from type DE CS RO RUparallel 4.2 52.0 0.6 2.1synthetic ( ∗ →
EN) 4.2 10.0 2.0 2.0synthetic (EN → ∗ ) 3.6 8.2 2.3 2.0Table 1: Amount of parallel and synthetic trainingdata (number of sentences, in millions) for EN-* language pairs. For synthetic data, we separatethe data according to whether the original mono-lingual language is English or not.to the test set.The amount of monolingual data back-translated for each translation direction rangesfrom 2 million to 10 million sentences. Statisticsabout the amount of parallel and synthetic trainingdata are shown in Table 1. With dl4mt, weobserved a translation speed of about 200 000sentences per day (on a single Titan X GPU).
For English ↔ Romanian, we observed poor per-formance because of overfitting. To mitigate this,we apply dropout to all layers in the network, in-cluding recurrent ones.Previous work dropped out different units ateach time step. When applied to recurrent con-nections, this has the downside that it impedesthe information flow over long distances, andPham et al. (2014) propose to only apply dropoutto non-recurrent connections.Instead, we follow the approach suggested byGal (2015), and use the same dropout mask at eachtime step. Our implementation differs from therecommendations by Gal (2015) in one respect:we also drop words at random, but we do so ona token level, not on a type level. In other words,if a word occurs multiple times in a sentence, wemay drop out any number of its occurrences, andnot just none or all.In our English ↔ Romanian experiments, wedrop out full words (both on the source and tar-get side) with a probability of 0.1. For all otherlayers, the dropout probability is set to 0.2.
We found that during decoding, the modelwould occasionally assign a high probability towords based on the target context alone, ig-noring the source sentence. We speculate that (Sennrich et al., 2016a), which was randomly sampled fromNews Crawl 2007–2014. ystem EN → DE DE → ENdev test dev testbaseline 22.4 26.8 26.4 28.5+synthetic 25.8 31.6 29.9 36.2+ensemble 27.5 33.1 31.5 37.5+r2l reranking
Table 2: English ↔ German translation results(B
LEU ) on dev (newstest2015) and test (new-stest2016). Submitted system in bold.this is an instance of the label bias problem(Lafferty et al., 2001).To mitigate this problem, we experiment withtraining separate models that produce the targettext from right-to-left (r2l), and re-scoring the n-best lists that are produced by the main (left-to-right) models with these r2l models. Since theright-to-left model will see a complementary tar-get context at each time step, we expect that theaveraged probabilities will be more robust. In par-allel to our experiments, this idea was publishedby Liu et al. (2016).We increase the size of the n-best-list to 50 forthe reranking experiments.A possible criticism of the l-r/r-l reranking ap-proach is that the gains actually come from addingdiversity to the ensemble, since we are now us-ing two independent runs. However experimentsin (Liu et al., 2016) show that a l-r/r-l rerankingsystems is stronger than an ensemble created fromtwo independent l-r runs. ↔ German
Table 2 shows results for English ↔ German. Weobserve improvements of 3.4–5.7 B
LEU fromtraining with a mix of parallel and synthetic data,compared to the baseline that is only trained onparallel data. Using an ensemble of the last 4checkpoints gives further improvements (1.3–1.7B
LEU ). Our submitted system includes rerank-ing of the 50-best output of the left-to-right modelwith a right-to-left model – again an ensembleof the last 4 checkpoints – with uniform weights.This yields an improvements of 0.6–1.1 B
LEU . ↔ Czech
For English → Czech, we trained our baselinemodel on the complete WMT16 parallel trainingset (including CzEng 1.6pre (Bojar et al., 2016)), until we observed convergence on our heldoutset (newstest2014). This took approximately 1Mminibatches, or 3 weeks. Then we continuedtraining the model on a new parallel corpus,comprising 8.2M sentences back-translated fromthe Czech monolingual news2015, 5 copies ofnews-commentary v11, and 9M sentences sam-pled from Czeng 1.6pre. The model used for back-translation was a neural MT model from earlierexperiments, trained on WMT15 data. The train-ing on this synthetic mix continued for a further400,000 minibatches.The right-left model was trained using a simi-lar process, but with the target side of the paral-lel corpus reversed prior to training. The resultingmodel had a slightly lower B
LEU score on the devdata than the standard left-right model. We can seein Table 3 that back-translation improves perfor-mance by 2.2–2.8 B
LEU , and that the final system(+r2l reranking) improves by 0.7–1.0 B
LEU on theensemble of 4, and 4.3–4.9 on the baseline.For Czech → English the training process wassimilar to the above, except that we created thesynthetic training data (back-translated from sam-ples of news2015 monolingual English) in batchesof 2.5M, and so were able to observe the effectof increasing the amount of synthetic data. Af-ter training a baseline model on all the WMT16parallel set, we continued training with a paral-lel corpus consisting of 2 copies of the 2.5M sen-tences of back-translated data, 5 copies of news-commentary v11, and a matching quantity of datasampled from Czeng 1.6pre. After training this toconvergence, we restarted training from the base-line model using 5M sentences of back-translateddata, 5 copies of news-commentary v11, and amatching quantity of data sampled from Czeng1.6pre. We repeated this with 7.5M sentencesfrom news2015 monolingual, and then with 10Msentences of news2015. The back-translationswere, as for English → Czech, created with an ear-lier NMT model trained on WMT15 data. Our fi-nal Czech → English was an ensemble of 8 systems– the last 4 save-points of the 10M synthetic datarun, and the last 4 save-points of the 7.5M run. Weshow this as ensemble8 in Table 3, and the +syn-thetic results are on the last (i.e. 10M) syntheticdata run.We also show in Table 4 how increasing theamount of back-translated data affects the results.We see that most of the gain from back-translationystem EN → CS CS → ENdev test dev testbaseline 18.5 20.9 23.8 25.3+synthetic 20.7 23.7 27.2 30.1+ensemble 22.1 24.8 28.6 31.0+ensemble8 – – +r2l reranking – –Table 3: English ↔ Czech translation results(B
LEU ) on dev (newstest2015) and test (new-stest2016). Submitted system in bold.system best single ensemble4dev test dev testbaseline 23.8 25.3 25.5 26.8+2.5M synthetic 26.7 29.4 27.7 30.4+5M synthetic 27.2 29.3 28.2 30.4+7.5M synthetic 27.2 29.7 28.4 30.8+10M synthetic 27.2 30.1 28.6 31.0Table 4: Czech → English translation results(B
LEU ) on dev (newstest2015) and test (new-stest2016), after continued training with increas-ing amounts of back-translated synthetic data. Foreach row, training was continued from the baselinemodel until convergence.comes with the first batch, but increasing theamount of back-translated data does gradually im-prove performance. ↔ Romanian
The results of our English ↔ Romanian experi-ments are shown in Table 5. This language pairhas the smallest amount of parallel training data,and we found dropout to be very effective, yield-ing improvements of 4–5 B
LEU . We found that the use of diacritics was inconsis-tent in the Romanian training (and development)data, so for Romanian → English we removed dia-critics from the Romanian source side, obtainingimprovements of 1.3–1.4 B
LEU .Synthetic training data gives improvements of4.1–5.1 B
LEU . for English → Romanian, we foundthat the best single system outperformed the en-semble of the last 4 checkpoints on dev, and wethus submitted the best single system as primarysystem. We also tested dropout for EN → DE with 8 million sen-tence pairs of training data, but found no improvement after10 days of training. We speculate that dropout could stillbe helpful for datasets of this size with longer training timesand/or larger networks. system EN → RO RO → ENdev test dev testbaseline 20.2 19.2 23.6 22.7+dropout 24.2 23.9 28.7 27.8+remove diacritics - - 30.0 29.2+synthetic
Table 5: English ↔ Romanian translation results(B
LEU ) on dev (newsdev2016), and test (new-stest2016). Submitted system in bold.system EN → RU RU → ENdev test dev testbaseline 21.3 20.3 22.7 22.5+synthetic 25.8 24.3 27.1 26.9+ensemble
Table 6: English ↔ Russian translation results(B
LEU ) on dev (newstest2015) and test (new-stest2016). Submitted system in bold. ↔ Russian
For English ↔ Russian, we cannot effectively learnBPE on the joint vocabulary because alphabetsdiffer. We thus follow the approach described in(Sennrich et al., 2016b), first mapping the Russiantext into Latin characters via ISO-9 transliteration,then learning the BPE operations on the concate-nation of the English and latinized Russian train-ing data, then mapping the BPE operations backinto Cyrillic alphabet. We apply the Latin BPEoperations to the English data (training data andinput), and both the Cyrillic and Latin BPE opera-tions to the Russian data.Translation results are shown in Table 6. Asfor the other language pairs, we observe strongimprovements from synthetic training data (4–4.4B
LEU ). Ensembles yield another 1.1–1.7 B
LEU . Table 7 shows the ranking of our submitted sys-tems at the WMT16 shared news translation task.Our submissions are ranked (tied) first for 5 out of8 translation directions in which we participated:EN ↔ CS, EN ↔ DE, and EN → RO. They are alsothe (tied) best constrained system for EN → RUand RO → EN, or 7 out of 8 translation directionsin total.Our models are also used in QT21-HimL-SysComb (Peter et al., 2016), rankedirection B
LEU rank human rankEN → CS 1 of 9 1 of 20EN → DE 1 of 11 1 of 15EN → RO 2 of 10 1–2 of 12EN → RU 1 of 8 2–5 of 12CS → EN 1 of 4 1 of 12DE → EN 1 of 6 1 of 10RO → EN 2 of 5 2 of 7RU → EN 3 of 6 5 of 10Table 7: Automatic (B
LEU ) and humanranking of our submitted systems (uedin-nmt) at WMT16 shared news translationtask. Automatic rankings are taken from http://matrix.statmt.org , only con-sidering primary systems. Human rankingsinclude anonymous online systems, and forEN ↔ CS, systems from the tuning task.1–2 for EN → RO, and in AMU-UEDIN(Junczys-Dowmunt et al., 2016), ranked 2–3for EN → RU, and 1–2 for RU → EN.
We describe Edinburgh’s neural machine transla-tion systems for the WMT16 shared news trans-lation task. For all translation directions, we ob-serve large improvements in translation qualityfrom using synthetic parallel training data, ob-tained by back-translating in-domain monolingualtarget-side data. Pervasive dropout on all layerswas used for English ↔ Romanian, and gave sub-stantial improvements. For English ↔ German andEnglish → Czech, we trained a right-to-left modelwith reversed target side, and we found rerank-ing the system output with these reversed modelshelpful.
Acknowledgments
This project has received funding from the Euro-pean Union’s Horizon 2020 research and innova-tion programme under grant agreements 645452(QT21), 644333 (TraMOOC) and 644402 (HimL).
References [Bahdanau et al.2015] Dzmitry Bahdanau, KyunghyunCho, and Yoshua Bengio. 2015. Neural MachineTranslation by Jointly Learning to Align and Trans-late. In
Proceedings of the International Conferenceon Learning Representations (ICLR) . [Bojar et al.2016] Ondˇrej Bojar, Ondˇrej Dušek, TomKocmi, Jindˇrich Libovický, Michal Novák, MartinPopel, Roman Sudarikov, and Dušan Variš. 2016.CzEng 1.6: Enlarged Czech-English Parallel Cor-pus with Processing Tools Dockered. In
Text,Speech and Dialogue: 19th International Confer-ence, TSD 2016, Brno, Czech Republic, September12-16, 2016, Proceedings . Springer Verlag, Septem-ber 12-16. In press.[Gage1994] Philip Gage. 1994. A New Algorithm forData Compression.
C Users J. , 12(2):23–38, Febru-ary.[Gal2015] Yarin Gal. 2015. A Theoretically GroundedApplication of Dropout in Recurrent Neural Net-works.
ArXiv e-prints .[Junczys-Dowmunt et al.2016] Marcin Junczys-Dowmunt, Tomasz Dwojak, and Rico Sennrich.2016. The AMU-UEDIN Submission to theWMT16 News Translation Task: Attention-basedNMT Models as Feature Functions in Phrase-basedSMT. In
Proceedings of the First Conference onMachine Translation (WMT16) , Berlin, Germany.[Lafferty et al.2001] John D. Lafferty, Andrew McCal-lum, and Fernando C. N. Pereira. 2001. ConditionalRandom Fields: Probabilistic Models for Segment-ing and Labeling Sequence Data. In
Proceedingsof the Eighteenth International Conference on Ma-chine Learning , pages 282–289, San Francisco, CA,USA. Morgan Kaufmann Publishers Inc.[Liu et al.2016] Lemao Liu, Masao Utiyama, AndrewFinch, and Eiichiro Sumita. 2016. Agreement onTarget-bidirectional Neural Machine Translation .In
NAACL HLT 16 , San Diego, CA.[Pascanu et al.2013] Razvan Pascanu, Tomas Mikolov,and Yoshua Bengio. 2013. On the difficulty of train-ing recurrent neural networks. In
Proceedings of the30th International Conference on Machine Learn-ing, ICML 2013 , pages 1310–1318, , Atlanta, GA,USA.[Peter et al.2016] Jan-Thorsten Peter, Tamer Alkhouli,Hermann Ney, Matthias Huck, Fabienne Braune,Alexander Fraser, Aleš Tamchyna, Ondˇrej Bojar,Barry Haddow, Rico Sennrich, Frédéric Blain, Lu-cia Specia, Jan Niehues, Alex Waibel, AlexandreAllauzen, Lauriane Aufrant, Franck Burlot, ElenaKnyazeva, Thomas Lavergne, François Yvon, andMarcis Pinnis. 2016. The QT21/HimL CombinedMachine Translation System. In
Proceedings of theFirst Conference on Machine Translation (WMT16) ,Berlin, Germany.[Pham et al.2014] Vu Pham, Théodore Bluche, Christo-pher Kermorvant, and Jérôme Louradour. 2014.Dropout Improves Recurrent Neural Networks forHandwriting Recognition. In , pages 285–290.Sennrich et al.2016a] Rico Sennrich, Barry Haddow,and Alexandra Birch. 2016a. Improving Neu-ral Machine Translation Models with MonolingualData. In
Proceedings of the 54th Annual Meeting ofthe Association for Computational Linguistics (ACL2016) , Berlin, Germany.[Sennrich et al.2016b] Rico Sennrich, Barry Haddow,and Alexandra Birch. 2016b. Neural MachineTranslation of Rare Words with Subword Units. In
Proceedings of the 54th Annual Meeting of the Asso-ciation for Computational Linguistics (ACL 2016) ,Berlin, Germany.[Zeiler2012] Matthew D. Zeiler. 2012. ADADELTA:An Adaptive Learning Rate Method.