Learning to Parse and Translate Improves Neural Machine Translation
LLearning to Parse and Translate Improves Neural Machine Translation
Akiko Eriguchi † , Yoshimasa Tsuruoka † , and Kyunghyun Cho ‡† The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo, Japan { eriguchi, tsuruoka } @logos.t.u-tokyo.ac.jp ‡ New York University, New York, NY 10012, USA [email protected]
Abstract
There has been relatively little attentionto incorporating linguistic prior to neu-ral machine translation. Much of theprevious work was further constrained toconsidering linguistic prior on the sourceside. In this paper, we propose a hybridmodel, called NMT+RNNG, that learnsto parse and translate by combining therecurrent neural network grammar intothe attention-based neural machine trans-lation. Our approach encourages the neu-ral machine translation model to incorpo-rate linguistic prior during training, andlets it translate on its own afterward. Ex-tensive experiments with four languagepairs show the effectiveness of the pro-posed NMT+RNNG.
Neural Machine Translation (NMT) has enjoyedimpressive success without relying on much, ifany, prior linguistic knowledge. Some of the mostrecent studies have for instance demonstrated thatNMT systems work comparably to other systemseven when the source and target sentences aregiven simply as flat sequences of characters (Leeet al., 2016; Chung et al., 2016) or statistically, notlinguistically, motivated subword units (Sennrichet al., 2016; Wu et al., 2016). Shi et al. (2016)recently made an observation that the encoder ofNMT captures syntactic properties of a source sen-tence automatically, indirectly suggesting that ex-plicit linguistic prior may not be necessary.On the other hand, there have only been acouple of recent studies showing the potentialbenefit of explicitly encoding the linguistic priorinto NMT. Sennrich and Haddow (2016) for in-stance proposed to augment each source word withits corresponding part-of-speech tag, lemmatized form and dependency label. Eriguchi et al. (2016)instead replaced the sequential encoder with atree-based encoder which computes the represen-tation of the source sentence following its parsetree. Stahlberg et al. (2016) let the lattice from ahierarchical phrase-based system guide the decod-ing process of neural machine translation, whichresults in two separate models rather than a singleend-to-end one. Despite the promising improve-ments, these explicit approaches are limited in thatthe trained translation model strictly requires theavailability of external tools during inference time.More recently, researchers have proposed meth-ods to incorporate target-side syntax into NMTmodels. Alvarez-Melis and Jaakkola (2017) haveproposed a doubly-recurrent neural network thatcan generate a tree-structured sentence, but its ef-fectiveness in a full scale NMT task is yet to beshown. Aharoni and Goldberg (2017) introduceda method to serialize a parsed tree and to train theserialized parsed sentences.We propose to implicitly incorporate linguis-tic prior based on the idea of multi-task learn-ing (Caruana, 1998; Collobert et al., 2011). Morespecifically, we design a hybrid decoder for NMT,called NMT+RNNG , that combines a usual con-ditional language model and a recently pro-posed recurrent neural network grammars (RN-NGs, Dyer et al., 2016). This is done by pluggingin the conventional language model decoder in theplace of the buffer in RNNG, while sharing a sub-set of parameters, such as word vectors, betweenthe language model and RNNG. We train this hy-brid model to maximize both the log-probability ofa target sentence and the log-probability of a parseaction sequence. We use an external parser (An-dor et al., 2016) to generate target parse actions,but unlike the previous explicit approaches, we donot need it during test time. Our code is available at https://github.com/tempra28/nmtrnng . a r X i v : . [ c s . C L ] A p r e evaluate the proposed NMT+RNNG on fourlanguage pairs ( { JP, Cs, De, Ru } -En). We observesignificant improvements in terms of BLEU scoreson three out of four language pairs and RIBESscores on all the language pairs. Neural machine translation is a recently proposedframework for building a machine translation sys-tem based purely on neural networks. It is of-ten built as an attention-based encoder-decodernetwork (Cho et al., 2015) with two recurrentnetworks—encoder and decoder—and an atten-tion model. The encoder, which is often imple-mented as a bidirectional recurrent network withlong short-term memory units (LSTM, Hochre-iter and Schmidhuber, 1997) or gated recurrentunits (GRU, Cho et al., 2014), first reads a sourcesentence represented as a sequence of words x =( x , x , . . . , x N ) . The encoder returns a sequenceof hidden states h = ( h , h , . . . , h N ) . Each hid-den state h i is a concatenation of those from theforward and backward recurrent network: h i = (cid:104) −→ h i ; ←− h i (cid:105) , where −→ h i = −→ f enc ( −→ h i − , V x ( x i )) , ←− h i = ←− f enc ( ←− h i +1 , V x ( x i )) .V x ( x i ) refers to the word vector of the i -th sourceword.The decoder is implemented as a conditional re-current language model which models the targetsentence, or translation, as log p ( y | x ) = (cid:88) j log p ( y j | y Construction First, we replace the hidden stateof the buffer h buffer (in Eq. (5)) with the hiddenstate of the decoder of the attention-based neuralmachine translation from Eq. (3). As is clear fromthose two equations, both the buffer sLSTM andthe translation decoder take as input the previoushidden state ( h buffertop and s j − , respectively) andthe previously decoded word (or the previouslyshifted word in the case of the RNNG’s buffer),and returns its summary state. The only differenceis that the translation decoder additionally consid-ers the state ˜ s j − . Once the buffer of the RNNGis replaced with the NMT decoder in our proposedmodel, the NMT decoder is also under control ofthe actions provided by the RNNG. Second, welet the next word prediction of the translation de-coder as a generator of RNNG. In other words,the generator of RNNG will output a word, whenasked by the shift action, according to the condi-tional distribution defined by the translation de-coder in Eq. (1). Once the buffer sLSTM is re-placed with the neural translation decoder, the ac-tion sLSTM naturally takes as input the translationdecoder’s hidden state when computing the actionconditional distribution in Eq. (4). We call this hy-brid model NMT+RNNG . The j -th hidden state in Eq. (3) is calculated only whenthe action ( shift ) is predicted by the RNNG. This is why ourproposed model can handle the sequences of words and ac-tions which have different lengths. Learning and Inference After this integration,our hybrid NMT+RNNG models the conditionaldistribution over all possible pairs of transla-tion and its parse given a source sentence, i.e., p ( y , a | x ) . Assuming the availability of parseannotation in the target-side of a parallel cor-pus, we train the whole model jointly to maxi-mize E ( x , y , a ) ∼ data [log p ( y , a | x )] . In doing so, wenotice that there are two separate paths throughwhich the neural translation decoder receives er-ror signal. First, the decoder is updated in or-der to maximize the conditional probability of thecorrect next word, which has already existed inthe original neural machine translation. Second,the decoder is updated also to maximize the con-ditional probability of the correct parsing action,which is a novel learning signal introduced by theproposed hybridization. Furthermore, the secondlearning signal affects the encoder as well, encour-aging the whole neural translation model to beaware of the syntactic structure of the target lan-guage. Later in the experiments, we show that thisadditional learning signal is useful for translation,even though we discard the RNNG (the stack andaction sLSTMs) in the inference time. A major challenge in training the proposed hybridmodel is that there is not a parallel corpus aug-mented with gold-standard target-side parse, andvice versa. In other words, we must either parsethe target-side sentences of an existing parallelcorpus or translate sentences with existing gold-standard parses. As the target task of the proposedmodel is translation, we start with a parallel cor-pus and annotate the target-side sentences. It ishowever costly to manually annotate any corpusof reasonable size (Table 6 in Alonso et al., 2016).We instead resort to noisy, but automated an-notation using an existing parser. This approachof automated annotation can be considered alongthe line of recently proposed techniques of knowl-edge distillation (Hinton et al., 2015) and distantsupervision (Mintz et al., 2009). In knowledge dis-tillation, a teacher network is trained purely on atraining set with ground-truth annotations, and theannotations predicted by this teacher are used totrain a student network, which is similar to our ap-proach where the external parser could be thoughtof as a teacher and the proposed hybrid network’sRNNG as a student. On the other hand, what we rain. Dev. Test Voc. ( src , tgt, act )Cs-En 134,453 2,656 2,999 (33,867, 27,347, 82)De-En 166,313 2,169 2,999 (33,820, 30,684, 80)Ru-En 131,492 2,818 2,998 (32,442, 27,979, 82)Jp-En 100,000 1,790 1,812 (23,509, 28,591, 80) Table 1: Statistics of parallel corpora.propose here is a special case of distant supervi-sion in that the external parser provides noisy an-notations to otherwise an unlabeled training set.Specifically, we use SyntaxNet, released by An-dor et al. (2016), on a target sentence. We converta parse tree into a sequence of one of three tran-sition actions (SHIFT, REDUCE-L, REDUCE-R).We label each REDUCE action with a correspond-ing dependency label and treat it as a more fine-grained action. We compare the proposed NMT+RNNG againstthe baseline model on four different languagepairs–Jp-En, Cs-En, De-En and Ru-En. The ba-sic statistics of the training data are presented inTable 1. We mapped all the low-frequency wordsto the unique symbol “UNK” and inserted a spe-cial symbol “EOS” at the end of both source andtarget sentences. Ja We use the ASPEC corpus (“train1.txt”) fromthe WAT’16 Jp-En translation task. We tokenizeeach Japanese sentence with KyTea (Neubig et al.,2011) and preprocess according to the recommen-dations from WAT’16 (WAT, 2016). We use thefirst 100K sentence pairs of length shorter than 50for training. The vocabulary is constructed withall the unique tokens that appear at least twice inthe training corpus. We use “dev.txt” and “test.txt”provided by WAT’16 respectively as developmentand test sets. Cs, De and Ru We use News Commentary v8.We removed noisy metacharacters and used the to-kenizer from Moses (Koehn et al., 2007) to build avocabulary of each language using unique tokensthat appear at least 6, 6 and 5 times respectively forCs, Ru and De. The target-side (English) vocab-ulary was constructed with all the unique tokens When the target sentence is parsed as data preprocessing,we use all the vocabularies in a corpus and do not cut offany words. We use the plain SyntaxNet and do not train itfurthermore. appearing more than three times in each corpus.We also excluded the sentence pairs which includeempty lines in either a source sentence or a targetsentence. We only use sentence pairs of length 50or less for training. We use “newstest2015” and“newstest2016” as development and test sets re-spectively. In all our experiments, each recurrent network hasa single layer of LSTM units of 256 dimensions,and the word vectors and the action vectors areof 256 and 128 dimensions, respectively. To re-duce computational overhead, we use BlackOut (Jiet al., 2015) with 2000 negative samples and α =0 . . When employing BlackOut, we shared thenegative samples of each target word in a sen-tence in training time (Hashimoto and Tsuruoka,2017), which is similar to the previous work (Zophet al., 2016). For the proposed NMT+RNNG, weshare the target word vectors between the decoder(buffer) and the stack sLSTM.Each weight is initialized from the uniform dis-tribution [ − . , . . The bias vectors and theweights of the softmax and BlackOut are initial-ized to be zero. The forget gate biases of LSTMsand Stack-LSTMs are initialized to 1 as recom-mended in J´ozefowicz et al. (2015). We usestochastic gradient descent with minibatches of128 examples. The learning rate starts from . ,and is halved each time the perplexity on the de-velopment set increases. We clip the norm of thegradient (Pascanu et al., 2012) with the thresh-old set to 3.0 (2.0 for the baseline models on Ru-En and Cs-En to avoid NaN and Inf). When theperplexity of development data increased in train-ing time, we halved the learning rate of stochasticgradient descent and reloaded the previous model.The RNNG’s stack computes the vector of a de-pendency parse tree which consists of the gener-ated target words by the buffer. Since the completeparse tree has a “ROOT” node, the special token ofthe end of a sentence (“EOS”) is considered as theROOT. We use beam search in the inference time,with the beam width selected based on the devel-opment set performance.It took about 15 minutes per epoch and about 20minutes respectively for the baseline and the pro-posed model to train a full JP-EN parallel corpusin our implementation. We run all the experiments on multi-core CPUs (10 e-En Ru-En Cs-En Jp-EnBLEU NMT NMT+RNNG † † † RIBES NMT NMT+RNNG † † † † Table 2: BLEU and RIBES scores by the baselineand proposed models on the test set. We use thebootstrap resampling method from Koehn (2004)to compute the statistical significance. We use † tomark those significant cases with p < . .Jp-En (Dev) BLEUNMT+RNNG 18.60w/o Buffer 18.02w/o Action 17.94w/o Stack 17.58NMT 17.75Table 3: Effect of each component in RNNG. In Table 2, we report the translation qualities ofthe tested models on all the four language pairs.We report both BLEU (Papineni et al., 2002) andRIBES (Isozaki et al., 2010). Except for De-En, measured in BLEU, we observe the statis-tically significant improvement by the proposedNMT+RNNG over the baseline model. It is worth-while to note that these significant improvementshave been achieved without any additional param-eters nor computational overhead in the inferencetime. Ablation Since each component in RNNG maybe omitted, we ablate each component in the pro-posed NMT+RNNG to verify their necessity. Asshown in Table 3, we see that the best performancecould only be achieved when all the three compo-nents were present. Removing the stack had themost adverse effect, which was found to be thecase for parsing as well by Kuncoro et al. (2017). Generated Sentences with Parsed Actions The decoder part of our proposed model consistsof two components: the NMT decoder to gener- threads on Intel(R) Xeon(R) CPU E5-2680 v2 @2.80GHz) Since the buffer is the decoder, it is not possible to com-pletely remove it. Instead we simply remove the dependencyof the action distribution on it. Figure 1: An example of translation and its depen-dency relations obtained by our proposed model.ate a translated sentence and the RNNG decoderto predict its parsing actions. The proposed modelcan therefore output a dependency structure alongwith a translated sentence. Figure 1 shows anexample of JP-EN translation in the developmentdataset and its dependency parse tree obtained bythe proposed model. The special symbol (“EOS”)is treated as the root node (“ROOT”) of the parsedtree. The translated sentence was generated byusing beam search, which is the same setting ofNMT+RNNG shown in Table 3. The parsing ac-tions were obtained by greedy search. The re-sulting dependency structure is mostly correct butcontains a few errors; for example, dependency re-lation between “The” and “ transition” should notbe “pobj”. We propose a hybrid model, to which we referas NMT+RNNG, that combines the decoder of anattention-based neural translation model with theRNNG. This model learns to parse and translate si-multaneously, and training it encourages both theencoder and decoder to better incorporate linguis-tic priors. Our experiments confirmed its effec-tiveness on four language pairs ( { JP, Cs, De, Ru } -En). The RNNG can in principle be trained with-out ground-truth parses, and this would eliminatethe need of external parsers completely. We leavethe investigation into this possibility for future re-search. Acknowledgments We thank Yuchen Qiao and Kenjiro Taura for theirhelp to speed up the implementations of trainingand also Kazuma Hashimoto for his valuable com-ments and discussions. This work was supportedby JST CREST Grant Number JPMJCR1513 andJSPS KAKENHI Grant Number 15J12597 and6H01715. KC thanks support by eBay, Face-book, Google and NVIDIA. References Roee Aharoni and Yoav Goldberg. 2017. Towardsstring-to-tree neural machine translation. In Pro-ceedings of the 55th Annual Meeting of the Asso-ciation for Computational Linguistics . to appear.H´ector Mart´ınez Alonso, Djam´e Seddah, and BenoˆıtSagot. 2016. From noisy questions to minecrafttexts: Annotation challenges in extreme syntax sce-nario. In Proceedings of the 2nd Workshop on NoisyUser-generated Text (WNUT) . pages 13–23.David Alvarez-Melis and Tommi S. Jaakkola. 2017.Tree-structured decoding with doubly-recurrentneural networks. In Proceedings of InternationalConference on Learning Representations 2017 .Daniel Andor, Chris Alberti, David Weiss, AliakseiSeveryn, Alessandro Presta, Kuzman Ganchev, SlavPetrov, and Michael Collins. 2016. Globally nor-malized transition-based neural networks. In Pro-ceedings of the 54th Annual Meeting of the Asso-ciation for Computational Linguistics . pages 2442–2452.Rich Caruana. 1998. Multitask learning. In Learningto learn , Springer, pages 95–133.Kyunghyun Cho, Aaron Courville, and Yoshua Ben-gio. 2015. Describing multimedia content usingattention-based encoder-decoder networks. IEEETransactions on Multimedia Proceedings ofthe 2014 Conference on Empirical Methods in Nat-ural Language Processing . pages 1724–1734.Junyoung Chung, Kyunghyun Cho, and Yoshua Ben-gio. 2016. A character-level decoder without ex-plicit segmentation for neural machine translation.In Proceedings of the 54th Annual Meeting of theAssociation for Computational Linguistics . Associ-ation for Computational Linguistics, pages 1693–1703.Ronan Collobert, Jason Weston, L´eon Bottou, MichaelKarlen, Koray Kavukcuoglu, and Pavel Kuksa.2011. Natural language processing (almost) fromscratch. Journal of Machine Learning Research Proceedings of the 53rd Annual Meeting of the Association for Computational Lin-guistics and the 7th International Joint Conferenceon Natural Language Processing . pages 334–343.Chris Dyer, Adhiguna Kuncoro, Miguel Ballesteros,and A. Noah Smith. 2016. Recurrent neural net-work grammars. In Proceedings of the 2016 Con-ference of the North American Chapter of the Asso-ciation for Computational Linguistics: Human Lan-guage Technologies . pages 199–209.Akiko Eriguchi, Kazuma Hashimoto, and YoshimasaTsuruoka. 2016. Tree-to-sequence attentional neu-ral machine translation. In Proceedings of the 54thAnnual Meeting of the Association for Computa-tional Linguistics . pages 823–833.Kazuma Hashimoto and Yoshimasa Tsuruoka.2017. Neural Machine Translation with Source-Side Latent Graph Parsing. arXiv preprintarXiv:1702.02265 .Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015.Distilling the knowledge in a neural network. arXivpreprint arXiv:1503.02531 .Sepp Hochreiter and J¨urgen Schmidhuber. 1997. Longshort-term memory. Neural Comput. Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Process-ing . pages 944–952.Shihao Ji, S. V. N. Vishwanathan, Nadathur Satish,Michael J. Anderson, and Pradeep Dubey. 2015.Blackout: Speeding up recurrent neural network lan-guage models with very large vocabularies. Pro-ceedings of International Conference on LearningRepresentations 2015 .Rafal J´ozefowicz, Wojciech Zaremba, and IlyaSutskever. 2015. An empirical exploration of recur-rent network architectures. In Proceedings of the32nd International Conference on Machine Learn-ing . pages 2342–2350.Philipp Koehn. 2004. Statistical significance tests formachine translation evaluation. In Proceedings ofthe 2004 Conference on Empirical Methods in Nat-ural Language Processing . pages 388–395.Philipp Koehn, Hieu Hoang, Alexandra Birch, ChrisCallison-Burch, Marcello Federico, Nicola Bertoldi,Brooke Cowan, Wade Shen, Christine Moran,Richard Zens, Chris Dyer, Ondrej Bojar, AlexandraConstantin, and Evan Herbst. 2007. Moses: Opensource toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the As-sociation for Computational Linguistics CompanionVolume Proceedings of the Demo and Poster Ses-sions . pages 177–180.dhiguna Kuncoro, Miguel Ballesteros, LingpengKong, Chris Dyer, Graham Neubig, and Noah A.Smith. 2017. What do recurrent neural networkgrammars learn about syntax? In Proceedings ofthe 15th Conference of the European Chapter of theAssociation for Computational Linguistics: Volume1, Long Papers . pages 1249–1258.Jason Lee, Kyunghyun Cho, and Thomas Hofmann.2016. Fully character-level neural machine trans-lation without explicit segmentation. arXiv preprintarXiv:1610.03017 .Thang Luong, Hieu Pham, and Christopher D. Man-ning. 2015. Effective approaches to attention-basedneural machine translation. In Proceedings of the2015 Conference on Empirical Methods in NaturalLanguage Processing . pages 1412–1421.Tom´aˇs Mikolov, Martin Karafi´at, Luk´aˇs Burget, JanˇCernock´y, and Sanjeev Khudanpur. 2010. Recurrentneural network based language model. In Proceed-ings of the 11th Annual Conference of the Interna-tional Speech Communication Association (INTER-SPEECH 2010) . International Speech Communica-tion Association, pages 1045–1048.Mike Mintz, Steven Bills, Rion Snow, and Dan Juraf-sky. 2009. Distant supervision for relation extrac-tion without labeled data. In Proceedings of theJoint Conference of the 47th Annual Meeting of theACL and the 4th International Joint Conference onNatural Language Processing of the AFNLP . pages1003–1011.Graham Neubig, Yosuke Nakata, and Shinsuke Mori.2011. Pointwise prediction for robust, adaptablejapanese morphological analysis. In Proceedings ofthe 49th Annual Meeting of the Association for Com-putational Linguistics: Human Language Technolo-gies . pages 529–533.Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: A method for automatic eval-uation of machine translation. In Proceedings of the40th Annual Meeting on Association for Computa-tional Linguistics . pages 311–318.Razvan Pascanu, Tomas Mikolov, and Yoshua Ben-gio. 2012. Understanding the exploding gra-dient problem. arXiv preprint arXiv:1211.5063 abs/1211.5063.Rico Sennrich and Barry Haddow. 2016. Linguisticinput features improve neural machine translation.In Proceedings of the First Conference on MachineTranslation . pages 83–91.Rico Sennrich, Barry Haddow, and Alexandra Birch.2016. Neural machine translation of rare words withsubword units. In Proceedings of the 54th AnnualMeeting of the Association for Computational Lin-guistics . pages 1715–1725. Xing Shi, Inkit Padhi, and Kevin Knight. 2016. Doesstring-based neural mt learn source syntax? In Pro-ceedings of the 2016 Conference on Empirical Meth-ods in Natural Language Processing . pages 1526–1534.Felix Stahlberg, Eva Hasler, Aurelien Waite, and BillByrne. 2016. Syntactically guided neural machinetranslation. In Proceedings of the 54th Annual Meet-ing of the Association for Computational Linguis-tics . pages 299–305.WAT. 2016. http://lotus.kuee.kyoto-u.ac.jp/WAT/baseline/dataPreparationJE.html .Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V.Le, Mohammad Norouzi, Wolfgang Macherey,Maxim Krikun, Yuan Cao, Qin Gao, KlausMacherey, Jeff Klingner, Apurva Shah, MelvinJohnson, Xiaobing Liu, Lukasz Kaiser, StephanGouws, Yoshikiyo Kato, Taku Kudo, HidetoKazawa, Keith Stevens, George Kurian, NishantPatil, Wei Wang, Cliff Young, Jason Smith, JasonRiesa, Alex Rudnick, Oriol Vinyals, Greg Corrado,Macduff Hughes, and Jeffrey Dean. 2016. Google’sneural machine translation system: Bridging thegap between human and machine translation. arXivpreprint arXiv:1609.08144 .Barret Zoph, Ashish Vaswani, Jonathan May, andKevin Knight. 2016. Simple, Fast Noise-ContrastiveEstimation for Large RNN Vocabularies. In