On Using Very Large Target Vocabulary for Neural Machine Translation
Sébastien Jean, Kyunghyun Cho, Roland Memisevic, Yoshua Bengio
OOn Using Very Large Target Vocabulary for Neural Machine Translation
S´ebastien Jean
Universit´e de Montr´eal
Kyunghyun Cho
Universit´e de Montr´eal
Roland Memisevic
Universit´e de Montr´eal
Yoshua Bengio
Universit´e de Montr´ealCIFAR Senior Fellow
Abstract
Neural machine translation, a recentlyproposed approach to machine transla-tion based purely on neural networks,has shown promising results compared tothe existing approaches such as phrase-based statistical machine translation. De-spite its recent success, neural machinetranslation has its limitation in handlinga larger vocabulary, as training complex-ity as well as decoding complexity in-crease proportionally to the number of tar-get words. In this paper, we proposea method based on importance samplingthat allows us to use a very large target vo-cabulary without increasing training com-plexity. We show that decoding can beefficiently done even with the model hav-ing a very large target vocabulary by se-lecting only a small subset of the wholetarget vocabulary. The models trainedby the proposed approach are empiricallyfound to match, and in some cases out-perform, the baseline models with a smallvocabulary as well as the LSTM-basedneural machine translation models. Fur-thermore, when we use an ensemble ofa few models with very large target vo-cabularies, we achieve performance com-parable to the state of the art (measuredby BLEU) on both the English → Germanand English → French translation tasks ofWMT’14.
Neural machine translation (NMT) is a recentlyintroduced approach to solving machine transla-tion (Kalchbrenner and Blunsom, 2013; Bahdanauet al., 2014; Sutskever et al., 2014). In neural ma-chine translation, one builds a single neural net-work that reads a source sentence and generates its translation. The whole neural network is jointlytrained to maximize the conditional probability ofa correct translation given a source sentence, us-ing the bilingual corpus. The NMT models haveshown to perform as well as the most widely usedconventional translation systems (Sutskever et al.,2014; Bahdanau et al., 2014).Neural machine translation has a number ofadvantages over the existing statistical machinetranslation system, specifically, the phrase-basedsystem (Koehn et al., 2003). First, NMT requiresa minimal set of domain knowledge. For instance,all of the models proposed in (Sutskever et al.,2014), (Bahdanau et al., 2014) or (Kalchbrennerand Blunsom, 2013) do not assume any linguis-tic property in both source and target sentencesexcept that they are sequences of words. Sec-ond, the whole system is jointly tuned to maxi-mize the translation performance, unlike the exist-ing phrase-based system which consists of manyfeature functions that are tuned separately. Lastly,the memory footprint of the NMT model is oftenmuch smaller than the existing system which relieson maintaining large tables of phrase pairs.Despite these advantages and promising results,there is a major limitation in NMT compared tothe existing phrase-based approach. That is, thenumber of target words must be limited. This ismainly because the complexity of training and us-ing an NMT model increases as the number of tar-get words increases.A usual practice is to construct a targetvocabulary of the k most frequent words (aso-called shortlist), where k is often in therange of , (Bahdanau et al., 2014) to , (Sutskever et al., 2014). Any word notincluded in this vocabulary is mapped to a spe-cial token representing an unknown word [ UNK ] .This approach works well when there are only afew unknown words in the target sentence, butit has been observed that the translation perfor- a r X i v : . [ c s . C L ] M a r ance degrades rapidly as the number of unknownwords increases (Cho et al., 2014a; Bahdanau etal., 2014).In this paper, we propose an approximate train-ing algorithm based on (biased) importance sam-pling that allows us to train an NMT model witha much larger target vocabulary. The proposed al-gorithm effectively keeps the computational com-plexity during training at the level of using onlya small subset of the full vocabulary. Oncethe model with a very large target vocabulary istrained, one can choose to use either all the targetwords or only a subset of them.We compare the proposed algorithm against thebaseline shortlist-based approach in the tasks ofEnglish → French and English → German transla-tion using the NMT model introduced in (Bah-danau et al., 2014). The empirical results demon-strate that we can potentially achieve better trans-lation performance using larger vocabularies, andthat our approach does not sacrifice too muchspeed for both training and decoding. Further-more, we show that the model trained with this al-gorithm gets the best translation performance yetachieved by single NMT models on the WMT’14English → French translation task.
In this section, we briefly describe an approachto neural machine translation proposed recently in(Bahdanau et al., 2014). Based on this descrip-tion we explain the issue of limited vocabulariesin neural machine translation.
Neural machine translation is a recently proposedapproach to machine translation, which uses a sin-gle neural network trained jointly to maximizethe translation performance (Forcada and ˜Neco,1997; Kalchbrenner and Blunsom, 2013; Cho etal., 2014b; Sutskever et al., 2014; Bahdanau et al.,2014).Neural machine translation is often imple-mented as the encoder–decoder network. The en-coder reads the source sentence x = ( x , . . . , x T ) and encodes it into a sequence of hidden states h = ( h , · · · , h T ) : h t = f ( x t , h t − ) . (1)Then, the decoder, another recurrent neural net-work, generates a corresponding translation y = ( y , · · · , y T (cid:48) ) based on the encoded sequence ofhidden states h : p ( y t | y Europarl v7, Common Crawl, News Commentary To ensure fair comparison, the English → Frenchcorpus, which comprises approximately 12 mil-lion sentences, is identical to the one used in(Kalchbrenner and Blunsom, 2013; Bahdanauet al., 2014; Sutskever et al., 2014). As for The preprocessed data can be found and down-loaded from . nglish-French English-GermanTrain Test Train Test15k 93.5 90.8 88.5 83.830k 96.0 94.6 91.8 87.950k 97.3 96.3 93.7 90.4500k 99.5 99.3 98.4 96.1All 100.0 99.6 100.0 97.3Table 1: Data coverage (in %) on target-side cor-pora for different vocabulary sizes. ”All” refers toall the tokens in the training set.English → German, the corpus was preprocessed,in a manner similar to (Peitz et al., 2014; Li et al.,2014), in order to remove many poorly translatedsentences.We evaluate the models on the WMT’14 testset (news-test 2014) , while the concatenationof news-test-2012 and news-test-2013 is usedfor model selection (development set). Table 1presents data coverage w.r.t. the vocabulary size,on the target side.Unless mentioned otherwise, all reported BLEUscores (Papineni et al., 2002) are computed withthe multi-bleu.perl script on the cased tokenizedtranslations. As a baseline for English → French translation, weuse the RNNsearch model proposed by (Bah-danau et al., 2014), with 30,000 source and targetwords . Another RNNsearch model is trained forEnglish → German translation with 50,000 sourceand target words.For each language pair, we train another setof RNNsearch models with much larger vocab-ularies of 500,000 source and target words, us-ing the proposed approach. We call these mod-els RNNsearch-LV . We vary the size of theshortlist used during training ( τ in Sec. 3.1). Wetried 15,000 and 30,000 for English → French, and15,000 and 50,000 for English → German. We laterreport the results for the best performance on thedevelopment set, with models generally evaluatedevery twelve hours. To compare with previous submissions, we use the fil-tered test sets. https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/multi-bleu.perl The authors of (Bahdanau et al., 2014) gave us access totheir trained models. We chose the best one on the validationset and resumed training. For both language pairs, we also trained newmodels, with τ = 15 , and τ = 50 , , byreshuffling the dataset at the beginning of eachepoch. While this causes a non-negligible amountof overhead, such a change allows words to becontrasted with different sets of other words eachepoch.To stabilize parameters other than the word em-beddings, at the end of the training stage, wefreeze the word embeddings and tune only theother parameters for approximately two more daysafter the peak performance on the development setis observed. This helped increase BLEU scores onthe development set.We use beam search to generate a translationgiven a source. During beam search, we keepa set of 12 hypotheses and normalize probabili-ties by the length of the candidate sentences, as in(Cho et al., 2014a) . The candidate list is chosento maximize the performance on the developmentset, for K ∈ { k, k, k } and K (cid:48) ∈ { , } .As explained in Sec. 3.2, we test using a bilin-gual dictionary to accelerate decoding and to re-place unknown words in translations. The bilin-gual dictionary is built using fast align (Dyer etal., 2013). We use the dictionary only if a wordstarts with a lowercase letter, and otherwise, wecopy the source word directly. This led to betterperformance on the development sets. In Table. 2, we present the results obtained by thetrained models with very large target vocabular-ies, and alongside them, the previous results re-ported in (Sutskever et al., 2014), (Luong et al.,2014), (Buck et al., 2014) and (Durrani et al.,2014). Without translation-specific strategies, wecan clearly see that the RNNsearch-LV outper-forms the baseline RNNsearch.In the case of the English → French task,RNNsearch-LV approached the performance levelof the previous best single neural machine transla-tion (NMT) model, even without any translation-specific techniques (Sec. 3.2–3.3). With these,however, the RNNsearch-LV outperformed it. Theperformance of the RNNsearch-LV is also betterthan that of a standard phrase-based translationsystem (Cho et al., 2014b). Furthermore, by com-bining 8 models, we were able to achieve a trans- These experimental details differ from (Bahdanau et al.,2014). NNsearch RNNsearch-LV Google Phrase-based SMTBasic NMT 29.97 (26.58) (28.76) (cid:63) ∗ • +Candidate List – 33.36 (29.32) –+UNK Replace 33.08 (29.08) (29.98) ◦ +Reshuffle ( τ =50k) – 34.60 (30.53) –+Ensemble – 37.19 (31.98) ◦ (a) English → FrenchRNNsearch RNNsearch-LV Phrase-based SMTBasic NMT 16.46 (17.13) (17.85) (cid:5) +Candidate List – 17.46 (18.00) +UNK Replace 18.97 (19.16) (19.03) +Reshuffle – 19.40 (19.37) +Ensemble – 21.59 (21.06) (b) English → GermanTable 2: The translation performances in BLEU obtained by different models on (a) English → Frenchand (b) English → German translation tasks. RNNsearch is the model proposed in (Bahdanau et al.,2014), RNNsearch-LV is the RNNsearch trained with the approach proposed in this paper, and Googleis the LSTM-based model proposed in (Sutskever et al., 2014). Unless mentioned otherwise, we re-port single-model RNNsearch-LV scores using τ = 30 , (English → French) and τ = 50 , (English → German). For the experiments we have run ourselves, we show the scores on the develop-ment set as well in the brackets. ( (cid:63) ) (Sutskever et al., 2014), ( ◦ ) (Luong et al., 2014), ( • ) (Durrani et al.,2014), ( ∗ ) Standard Moses Setting (Cho et al., 2014b), ( (cid:5) ) (Buck et al., 2014).lation performance comparable to the state of theart, measured in BLEU.For English → German, the RNNsearch-LV out-performed the baseline before unknown word re-placement, but after doing so, the two systems per-formed similarly. We could reach higher large-vocabulary single-model performance by reshuf-fling the dataset, but this step could potentiallyalso help the baseline. In this case, we were ableto surpass the previously reported best translationresult on this task by building an ensemble of 8models.With τ = 15 , , the RNNsearch-LV perfor-mance worsened a little, with best BLEU scores,without reshuffling, of 33.76 and 18.59 respec-tively for English → French and English → German. For each language pair, we began training fourmodels from each of which two points correspond-ing to the best and second-best performance onthe development set were collected. We continuedtraining from each point, while keeping the wordembeddings fixed, until the best development per-formance was reached, and took the model at thispoint as a single model in an ensemble. This pro-cedure resulted in total eight models, but because CPU (cid:63) GPU ◦ RNNsearch 0.09 s 0.02 sRNNsearch-LV 0.80 s 0.25 sRNNsearch-LV 0.12 s 0.05 s+Candidate listTable 3: The average per-word decoding time.Decoding here does not include parameter load-ing and unknown word replacement. The baselineuses 30,000 words. The candidate list is built with K = 30,000 and K (cid:48) = 10.( (cid:63) ) i7-4820K (single thread), ( ◦ ) GTX TITANBlackmuch of training had been shared, the compositionof the ensemble may be sub-optimal. This is sup-ported by the fact that higher cross-model BLEUscores (Freitag et al., 2014) are observed for mod-els that were partially trained together. In Table 3, we present the timing information ofdecoding for different models. Clearly, decodingfrom RNNsearch-LV with the full target vocab-ulary is slowest. If we use a candidate list fordecoding each translation, the speed of decodingsubstantially improves and becomes close to theaseline RNNsearch.A potential issue with using a candidate list isthat for each source sentence, we must re-build atarget vocabulary and subsequently replace a partof the parameters, which may easily become time-consuming. We can address this issue, for in-stance, by building a common candidate list formultiple source sentences. By doing so, we wereable to match the decoding speed of the baselineRNNsearch model. For English → French ( τ = 30 , , we evaluatethe influence of the target vocabulary when trans-lating the test sentences by using the union of afixed set of , common words and (at most) K (cid:48) likely candidates for each source word accord-ing to the dictionary. Results are presented in Fig-ure 1. With K (cid:48) = 0 (not shown), the perfor-mance of the system is comparable to the baselinewhen not replacing the unknown words (30.12),but there is not as much improvement when doingso (31.14). As the large vocabulary model doesnot predict [ UNK ] as much during training, it isless likely to generate it when decoding, limitingthe effectiveness of the post-processing step in thiscase. With K (cid:48) = 1 , which limits the diversity ofallowed uncommon words, BLEU is not as goodas with moderately larger K (cid:48) , which indicates thatour models can, to some degree, correctly choosebetween rare alternatives. If we rather use K =50 , , as we did for testing based on validationperformance, the improvement over K (cid:48) = 1 is ap-proximately 0.2 BLEU.When validating the choice of K , we found it tobe correlated to which τ was chosen during train-ing. For example, on the English → French val-idation set, with τ = 15 , (and K (cid:48) = 10 ),the BLEU score is 29.44 with K = 15 , ,but drops to . and . respectively for K = 30 , and , . For τ = 30 , ,scores increase moderately from K = 15 , to K = 50 , . Similar effects were observedfor English → German and on the test sets. Asour implementation of importance sampling doesnot apply to usual correction to the gradient, itseems beneficial for the test vocabularies to resem-ble those used during training. In this paper, we proposed a way to extend the sizeof the target vocabulary for neural machine trans- K’ . . . . . . . B L E U s c o r e With UNK replacementWithout UNK replacement Figure 1: Single-model test BLEU scores(English → French) with respect to the number ofdictionary entries K (cid:48) allowed for each sourceword.lation. The proposed approach allows us to traina model with much larger target vocabulary with-out any substantial increase in computational com-plexity. It is based on the earlier work in (Bengioand S´en´ecal, 2008) which used importance sam-pling to reduce the complexity of computing thenormalization constant of the output word proba-bility in neural language models.On English → French and English → Germantranslation tasks, we observed that the neural ma-chine translation models trained using the pro-posed method performed as well as, or betterthan, those using only limited sets of target words,even when replacing unknown words. As per-formance of the RNNsearch-LV models increasedwhen only a selected subset of the target vocab-ulary was used during decoding, this makes theproposed learning algorithm more practical.When measured by BLEU, our models showedtranslation performance comparable to thestate-of-the-art translation systems on both theEnglish → French task and English → German task.On the English → French task, a model trainedwith the proposed approach outperformed the bestsingle neural machine translation (NMT) modelfrom (Luong et al., 2014) by approximately 1BLEU point. The performance of the ensembleof multiple models, despite its relatively lessdiverse composition, is approximately 0.3 BLEUpoints away from the best system (Luong et al.,2014). On the English → German task, the bestperformance of 21.59 BLEU by our model ishigher than that of the previous state of the art(20.67) reported in (Buck et al., 2014). cknowledgments The authors would like to thank the developersof Theano (Bergstra et al., 2010; Bastien et al.,2012). We acknowledge the support of the fol-lowing agencies for research funding and comput-ing support: NSERC, Calcul Qu´ebec, ComputeCanada, the Canada Research Chairs and CIFAR. References Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2014. Neural machine translation by jointlylearning to align and translate. Technical report,arXiv preprint arXiv:1409.0473.Fr´ed´eric Bastien, Pascal Lamblin, Razvan Pascanu,James Bergstra, Ian J. Goodfellow, Arnaud Berg-eron, Nicolas Bouchard, and Yoshua Bengio. 2012.Theano: new features and speed improvements.Deep Learning and Unsupervised Feature LearningNIPS 2012 Workshop.Yoshua Bengio and Jean-S´ebastien S´en´ecal. 2008.Adaptive importance sampling to accelerate train-ing of a neural probabilistic language model. IEEETrans. Neural Networks , 19(4):713–722.James Bergstra, Olivier Breuleux, Fr´ed´eric Bastien,Pascal Lamblin, Razvan Pascanu, Guillaume Des-jardins, Joseph Turian, David Warde-Farley, andYoshua Bengio. 2010. Theano: a CPU andGPU math expression compiler. In Proceedingsof the Python for Scientific Computing Conference(SciPy) , June. Oral Presentation.Christian Buck, Kenneth Heafield, and Bas van Ooyen.2014. N-gram counts and language models from thecommon crawl. In Proceedings of the Language Re-sources and Evaluation Conference , Reykjav´ık, Ice-land, May.Kyunghyun Cho, Bart van Merri¨enboer, Dzmitry Bah-danau, and Yoshua Bengio. 2014a. On theproperties of neural machine translation: Encoder–Decoder approaches. In Eighth Workshop on Syn-tax, Semantics and Structure in Statistical Transla-tion , October.Kyunghyun Cho, Bart van Merrienboer, Caglar Gul-cehre, Fethi Bougares, Holger Schwenk, and YoshuaBengio. 2014b. Learning phrase representa-tions using RNN encoder-decoder for statistical ma-chine translation. In Proceedings of the EmpiricialMethods in Natural Language Processing (EMNLP2014) , October.Nadir Durrani, Barry Haddow, Philipp Koehn, andKenneth Heafield. 2014. Edinburgh’s phrase-basedmachine translation systems for WMT-14. In Pro-ceedings of the Ninth Workshop on Statistical Ma-chine Translation , pages 97–104. Association forComputational Linguistics Baltimore, MD, USA. Chris Dyer, Victor Chahuneau, and Noah A. Smith.2013. A simple, fast, and effective reparameter-ization of IBM Model 2. In Proceedings of the2013 Conference of the North American Chapter ofthe Association for Computational Linguistics: Hu-man Language Technologies , pages 644–648, At-lanta, Georgia, June. Association for ComputationalLinguistics.Mikel L. Forcada and Ram´on P. ˜Neco. 1997. Re-cursive hetero-associative memories for translation.In Jos´e Mira, Roberto Moreno-D´ıaz, and JoanCabestany, editors, Biological and Artificial Compu-tation: From Neuroscience to Technology , volume1240 of Lecture Notes in Computer Science , pages453–462. Springer Berlin Heidelberg.Markus Freitag, Stephan Peitz, Joern Wuebker, Her-mann Ney, Matthias Huck, Rico Sennrich, NadirDurrani, Maria Nadejde, Philip Williams, PhilippKoehn, et al. 2014. Eu-bridge MT: Combined ma-chine translation. ACL 2014 , page 105.M. Gutmann and A. Hyvarinen. 2010. Noise-contrastive estimation: A new estimation principlefor unnormalized statistical models. In Proceedingsof The Thirteenth International Conference on Arti-ficial Intelligence and Statistics (AISTATS’10) .Nal Kalchbrenner and Phil Blunsom. 2013. Recurrentcontinuous translation models. In Proceedings ofthe ACL Conference on Empirical Methods in Nat-ural Language Processing (EMNLP) , pages 1700–1709. Association for Computational Linguistics.Philipp Koehn, Franz Josef Och, and Daniel Marcu.2003. Statistical phrase-based translation. In Pro-ceedings of the 2003 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics on Human Language Technology - Vol-ume 1 , NAACL ’03, pages 48–54.Philipp Koehn. 2010. Statistical Machine Translation .Cambridge University Press, New York, NY, USA,1st edition.Liangyou Li, Xiaofeng Wu, Santiago Cortes Vaillo, JunXie, Andy Way, and Qun Liu. 2014. The DCU-ICTCAS MT system at WMT 2014 on German-English translation task. In Proceedings of the NinthWorkshop on Statistical Machine Translation , pages136–141, Baltimore, Maryland, USA, June. Associ-ation for Computational Linguistics.Thang Luong, Ilya Sutskever, Quoc V Le, OriolVinyals, and Wojciech Zaremba. 2014. Addressingthe rare word problem in neural machine translation. arXiv preprint arXiv:1410.8206 .Tomas Mikolov, Kai Chen, Greg Corrado, and JeffreyDean. 2013. Efficient estimation of word represen-tations in vector space. In International Conferenceon Learning Representations: Workshops Track .ndriy Mnih and Koray Kavukcuoglu. 2013. Learningword embeddings efficiently with noise-contrastiveestimation. In C.J.C. Burges, L. Bottou, M. Welling,Z. Ghahramani, and K.Q. Weinberger, editors, Ad-vances in Neural Information Processing Systems26 , pages 2265–2273. Curran Associates, Inc.Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automaticevaluation of machine translation. In Proceedingsof the 40th Annual Meeting on Association for Com-putational Linguistics , ACL ’02, pages 311–318,Stroudsburg, PA, USA. Association for Computa-tional Linguistics.Stephan Peitz, Joern Wuebker, Markus Freitag, andHermann Ney. 2014. The RWTH Aachen German-English machine translation system for WMT 2014.In Proceedings of the Ninth Workshop on Statisti-cal Machine Translation , pages 157–162, Baltimore,Maryland, USA, June. Association for Computa-tional Linguistics.Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014.Sequence to sequence learning with neural net-works. In