Variational Recurrent Neural Machine Translation
Jinsong Su, Shan Wu, Deyi Xiong, Yaojie Lu, Xianpei Han, Biao Zhang
VVariational Recurrent Neural Machine Translation
Jinsong Su , Shan Wu , , Deyi Xiong ∗ , Yaojie Lu , Xianpei Han , Biao Zhang Xiamen University, Xiamen, China Institute of Software, Chinese Academy of Sciences, Beijing, China Soochow University, Suzhou, China [email protected], [email protected], [email protected]@iscas.ac.cn, [email protected], [email protected] Abstract
Partially inspired by successful applications of variational re-current neural networks, we propose a novel variational recur-rent neural machine translation (VRNMT) model in this pa-per. Different from the variational NMT, VRNMT introducesa series of latent random variables to model the translationprocedure of a sentence in a generative way, instead of a sin-gle latent variable. Specifically, the latent random variablesare included into the hidden states of the NMT decoder withelements from the variational autoencoder. In this way, thesevariables are recurrently generated, which enables them tofurther capture strong and complex dependencies among theoutput translations at different timesteps. In order to deal withthe challenges in performing efficient posterior inference andlarge-scale training during the incorporation of latent vari-ables, we build a neural posterior approximator, and equipit with a reparameterization technique to estimate the vari-ational lower bound. Experiments on Chinese-English andEnglish-German translation tasks demonstrate that the pro-posed model achieves significant improvements over both theconventional and variational NMT models.
1. Introduction
Recently, neural machine translation (NMT) has graduallyestablished state-of-the-art results over statistical machinetranslation (SMT) on various language pairs. Most NMTmodels consist of two recurrent neural networks (RNNs):a bidirectional RNN based encoder that transforms sourcesentence x = { x , x ...x T x } into a hidden state sequence,and a decoder that generates the corresponding target sen-tence y = { y , y ...y T y } by exploiting source-side contextsvia an attention network (Bahdanau, Cho, and Bengio 2015).This attentional neural encoder-decoder framework has nowbecome the dominant architecture for NMT.Within this framework, semantic representations ofsource and target sentences are learned in an implicit way.As a result, the learned semantic representations are far frombeing sufficient for capturing all semantic details and depen-dencies (Sutskever, Vinyals, and Le 2014; Tu et al. 2016). Tocomplement the insufficiency of semantic representations ofNMT, Zhang et al. (2016) present variational NMT (VNMT) ∗ Corresponding author.Copyright c (cid:13) which incorporates a latent random variable into NMT, serv-ing as a global semantic signal for generating good transla-tions. However, the internal transition structure of RNN isentirely deterministic, and hence, this implementation maynot be an effective way to model high variability observedin structured data, such as language modeling and machinetranslation (Chung et al. 2015). Therefore, the potential ofVNMT is limited and how to better improve NMT with la-tent variables is still open for further exploration.In this paper, we propose a variational recurrent NMT(VRNMT) model to deal with the above-mentioned prob-lem, motivated by recent success of the variational recurrentneural network (VRNN) (Chung et al. 2015). It is illustratedin Fig. 1. VRNMT explicitly models underlying semanticsof bilingual sentence pairs, which are then exploited to re-fine translation. However, instead of only employing a singlelatent variable to capture the global semantics of each par-allel sentence, we assume that there is a continuous latentrandom variable sequence z = { z , z ..., z T y } in the under-lying semantic space, where the iteratively generated vari-able z j participates in the generations of each target word y j and hidden state s j +1 . Formally, the conditional probability p ( y | x ) is decomposed as follows: p ( y | x ) = T y Y j =1 p ( y j | x , y 2. Background In this section, we briefly describe the attention-based NMTmodel and VRNN, which provide background knowledgefor the proposed model. Currently, the dominant NMT model mainly consists of aneural encoder and a neural decoder with an attention net-work (Bahdanau, Cho, and Bengio 2015). Generally, the encoder is a bidirectional RNN learninghidden representations of a source sentence in the forwardand backward directions. The learned hidden states in twodirections are then concatenated to form source annotations { h i = [ −→ h Ti , ←− h Ti ] T } , where h i encodes the contextual se-mantics of the i -th word with respect to all other surroundingsource words.Likewise, the decoder is a forward RNN that adopts thenonlinear function g ( · ) to sequentially generate the trans-lation y as p ( y j | x , y Prior, Generation and Recurrence . Based on the hiddenstate h t − of the RNN, VRNN first produces a latent seman-tic variable z t , which is then used to guide the generation ofthe hidden state h t and word x t at the t -th timestep. In doingso, the temporal structure of sequential data is exploited forVRNN modeling.Different from the standard VAE where the prior on thelatent random variable follows a standard Gaussian distribu-tion, VRNN assumes that z t obeys the following Gaussianwith the parameters µ ,t and σ ,t z t ∼ N ( µ ,t , diag ( σ ,t )) , (5)where µ ,t and σ ,t can be produced by any highly flexibleneural networks. Moreover, the generation distribution of x t will be conditioned on both z t and h t − such that x t | z t ∼ N ( µ x,t , diag ( σ x,t )) , (6)where µ x,t and σ x,t are the parameters of the generation dis-tribution. Note that they can also be computed by any highlyflexible neural network.Then, we introduce z t to update the hidden state h t in arecurrent way h t = f θ ( h t − , x t , z t ) . (7)Finally, the parameterization of the generative model canbe factorized as follows: p θ ( x ) = Z z p θ ( x , z ) d z , (8) p θ ( x , z ) = T Y t =1 p ( x t | x 3. Our Model In this section, we extend VNMT into VRNMT by adapt-ing VRNN into NMT. In VRNMT, the semantic dependen-cies between adjacent target words can be captured to re-fine translation. Formally, the variational lower bound ofVRNMT is defined as follows: L VRNMT ( y | x ; θ, φ ) = T y X j =1 L VRNMT ( y j | x , y As described previously, the key of variational models lies inhow to model the distributions related to latent random vari-ables. With respect to VRNMT, we focus on how to modelthe posterior q φ ( z j | x , y ≤ j ) and the prior p θ ( z j | x , y Under the assumption that the pos-terior q φ ( z j | x , y ≤ j ) follows the multivariate Gaussian distri-bution with a diagonal covariance structure, we apply neu-ral networks to simulate the posterior model. Concretely, wecompute q φ ( z j | x , y ≤ j ) as q φ ( z j | x , y ≤ j ) = N ( z j ; µ j ( x , y ≤ j ) , σ j ( x , y ≤ j ) I ) . (13)As illustrated in Fig. 1, the mean µ j and standard derivation σ j of neural networks are imposed on x and y ≤ j .Obviously, the key to estimate z j is how to calculate µ j and σ j . To this end, we first apply the element-wise activa-tion function g ( · ) to perform a nonlinear transformation pro-jecting y j − , s j , c j and y j onto our latent semantic space: h zj = g ( W φz [ y j − ; s j ; c j ; y j ] + b φz ) . (14)here W φz and b φz are the parameter matrix and bias term, re-spectively. Finally, we introduce linear regressions with pa-rameters W φµ , W φσ , b φµ and b φσ to obtain d z -dimension vectors µ j and log σ j as follows: µ j = W φµ h zj + b φµ (15)log σ j = W φσ h zj + b φσ (16)To obtain a representation for latent variable z j , we fol-low the implementation of VAE to reparameterize it as z j = µ j + σ j (cid:12) (cid:15), (cid:15) ∼ N (0 , I ) . Intuitively, this reparameteri-zation procedure bridges the gap between p θ ( y j | x , y Except for the absence of y j , the neuralmodel for the prior p θ ( z j | x , y The final objective for one bilingual sen-tence ( x , y ) involves the following two parts: − KL ( q φ ( z j | x , y ≤ j ) || p θ ( z j | x , y 4. Experiments We conducted experiments on Chinese-English and English-German translation to examine the effectiveness of ourmodel. Our Chinese-English training data consists of 1.25M LDCsentence pairs, with 27.9M Chinese words and 34.5M En-glish words respectively. We used the NIST MT02 datasets the validation set, and the NIST MT03/04/05/06 datasetsas the test sets. In English-German translation, our trainingdata consists of 4.46M sentence pairs with 116.1M Englishwords and 108.9M German words. We used the news-test2013 as the validation set and the news-test 2015 as thetest set. Following Sennrich et al. (2016), we adopted bytepair encoding to segment words into subwords for English-German translation. Finally, we used BLEU (Papineni et al.2002) as our evaluation metric, and performed paired boot-strap sampling (Koehn 2004) for statistical significance testusing the Moses script.We set the maximum length of training sentencesto be 50 words, and preserved the most frequent 30K(Chinese-English) and 50K (English-German) words as boththe source and target vocabulary, covering approximately97.4%/100.0% and 99.3%/98.2% on the source/target sideof the two parallel corpora respectively. All other wordswere replaced with a specific token “UNK”. We applied Rmsprop (Graves 2013) with iterNum=5, momentum=0, ρ =0.95, and (cid:15) =1 × − to train various NMT models. Thesettings of our model were the same as in (Bahdanau,Cho, and Bengio 2015), except for some hyper-parametersspecific to our model. Specifically, we set word embed-ding dimension as 620, hidden layer size as 1000, learn-ing rate as × − , batch size as 80, gradient norm as 1.0,and dropout rate as 0.3. Particularly, we initialized the pa-rameters of VRNMT with the trained conventional NMTmodel. As implemented in VAE, we set the sampling num-ber L =1, and d e = d z =2 d f =2000 according to preliminary ex-periments. During decoding, we used the beam-search algo-rithm, and set beam sizes of all models as 10. We compared our model against the following systems:(1) Moses . An open source phrase-based SMT systemwith default settings and a 4-gram language model trainedon the target portion of the training data.(2) DL4MT . Our re-implementation of the attention-based NMT system (Bahdanau, Cho, and Bengio 2015) withslight changes from dl4mt tutorial .(3) VNMT . It is a variational NMT system (Zhang et al.2016) that incorporates a continuous latent variable to modelthe underlying semantics of sentence pairs.(4) VRNMT(-TD) . A variant of our model without in-troducing temporal dependencies between the latent randomvariables. It differs from our model in that the input of poste-rior model contains only y j but not y j − , s j , c j . More specif-ically, we removed y j − , s j , and c j from Eq. (14). Thus, thelatent variables of VRNMT(-TD) directly obey the standardGauss distribution rather than depend on the output at theprevious timestep. As we incorporate temporal dependen-cies into the prior, we will directly study the impact of thelatent random variables on modeling variability character-ized by dependencies among output words in comparison toVRNMT(-TD). https://github.com/nyu-dl/dl4mt-tutorial System MT03 MT04 MT05 MT06 Ave. COVERAGE 34.49 38.34 34.91 34.25 35.50InterAtten 35.09 37.73 35.53 34.32 35.67MemDec 36.16 39.81 35.91 35.98 36.97DMAtten . ∗ ++ ∗∗ ++ ∗∗ ++ ∗ ++ Table 1: Case-insensitive BLEU scores of Chinese-Englishtranslation. ∗ / ∗∗ and + / ++ : significant over VNMT andVRNMT(-TD) at 0.05/0.01, respectively. COVERAGE (Tuet al. 2016) presented a coverage model to alleviate theover-translation and under-translation problems. InterAtten (Meng et al. 2016) exploited a readable and writable atten-tion mechanism to record interactive history in decoding. MemDec (Wang et al. 2016) introduced external memory toimprove translation quality. DMAtten (Zhang et al. 2017)explicitly incorporated the word reordering knowledge intothe attention model of NMT. Note that all these studies focuson capturing semantic information for NMT. (0,10] (10,20] (20,30] (30,40] (40,50] (50,100] B l e u S c o re Length of source sentence Moses DL4MT VNMT VRNMT(-TD) VRNMT Figure 3: BLEU scores over different lengths of translatedsentences. In addition to the above systems for comparison, we alsodisplayed the BLEU scores of several recent NMT models(Tu et al. 2016; Meng et al. 2016; Wang et al. 2016; Zhang etal. 2017) that have been trained on the same training corpusas ours.Table 1 shows case-insensitive BLEU scores on Chinese-English datasets. Overall, VRNMT significantly improvestranslation quality on all test sets, achieving the gains of5.66, 1.42, 0.78 and 1.0 BLEU points over Moses, DL4MT,VNMT and VRNMT(-TD), respectively. Compared to theexisting NMT models, VRNMT is better than them as shownin Table 1. These results echo the results reported in (Zhanget al. 2016), indicating the integration of latent variablesis effective for improving NMT. Particularly, VRNMT per-forms significantly better than VRNMT(-TD), indicatingource p´ıngrˇang cˇaiqˇu sh`angsh`u x´ıngd`ong zh¯ih`ou s`ı ti¯an , liˇanh´egu´o ¯anqu´an lˇish`ıhu`ı de wˇu g`ech´angr`en lˇish`ıgu´o d¯ou w`ei cˇi y¯i w¯eij¯i cˇaiqˇu y`uf´angx`ıng w`aiji¯ao x´ıngd`ong . Reference four days after pyongyang adopted the aforesaid action , the five permanent members of unitednations security council have all taken preemptive diplomatic actions for the crisis . Moses pyongyang by four days after the operation of ::: the :: un ::::::: security :::::: council , the five permanent membersto adopt preventive diplomacy this crisis . DL4MT the ::: four permanent member states of the united nations security council and the five permanentmembers of the un security council have adopted a preventive diplomatic action following the four- day . . . . . . . . . .operation .. VNMT four days after :::: north ::::: korea took the above actions , the five permanent members of the un securitycouncil have adopted preventive diplomatic . . . . . . . . . .activities .. VRNMT(-TD) ::: four permanent members of the security council of the united nations security council have takenpreventive diplomatic actions during the four - day period . . . . . . . . . .following . . . .the . . . . . .above. . . . . . . .actions. .. VRNMT four days after pyongyang took the action , the five permanent members of the un security councilhave adopted preventive diplomatic actions for the crisis . Table 2: Translation examples of different systems. Words highlighted in underlines are not fluently translated, in wavy linesare incorrectly translated, in dashed lines are over-translated, and in dotted lines are under-translated. System 1-Gram 2-Gram 3-Gram 4-Gram Reference 12.94 1.80 0.93 1.29DL4MT 19.62 5.34 2.96 2.31VNMT 19.45 5.24 2.93 2.29VRNMT(-TD) 19.54 5.25 2.93 2.35VRNMT Table 3: Evaluation of over-translation. The lower the score,the better the system deals with the over-translation problem.that explicitly modeling the temporal dependencies betweenlatent random variables indeed further benefits NMT. Results on Source Sentences with Different Lengths Further, we carried out experiments to investigate our modelon different groups of the test sets, which are divided ac-cording to the lengths of source sentences. Figure 3 showsthat our system outperforms the others over sentences withdifferent length spans. Analysis on Over Translation As mentioned in (Tu et al. 2016), over-translation is oneof big challenges for NMT. Here we followed Zhang et al.(2017) to evaluate over-translations generated by differentNMT models. Concretely, we directly used N-Gram Repeti-tion Rate (N-GRR) metric (Zhang, Xiong, and Su 2017) tocalculate the portion of repeated n-grams in a sentence asfollows:N-GRR = 1 C · R C X c =1 R X r =1 | N-grams c,r | − | u ( N-grams c,r ) || N-grams c,r | (27)where | N-grams c,r | is the number of total n-grams in the r -th translation of the c -th sentence in the testing corpus,and | u ( N-grams c,r ) | denotes the number of n-grams afterduplicate ngrams are removed. By comparing N-GRR scores System AER SAER DL4MT 50.07 63.42VNMT 49.23 62.28VRNMT(-TD) 49.95 63.17VRNMT Table 4: Evaluation of word alignment quality. The lowerthe score is, the better word alignments are.of translations against those of references, we can roughlyknow how serious the over-translation problem is. Table 3gives the final results. We find that our model is able to betterdeal with over-translation issue than other models. Analysis on Attention Results The attention model heavily depends on target-side hiddenstate vectors, which are in turn dependent on the previous la-tent random variables in our model, as illustrated in Eq. (22)-(25). Therefore, if latent variables are helpful for the calcula-tion of target-side hidden state vectors, the attention modelcan also be improved accordingly. To testify this, we con-ducted experiments on the evaluation dataset provided byLiu and Sun (2015), which contains 900 manually alignedChinese-English sentence pairs. Specifically, we first forcedthe decoder to output reference translations so as to obtainword alignments between input sentences and their refer-ence translations according to attention weights. Then, weused the alignment error rate (AER) (Och and Ney 2003)and the soft version (SAER) of AER (Tu et al. 2016) to eval-uate alignment performance. From Table 4, we can concludethat the incorporation of latent variables also improves theattention model as expected. Case Study To understand why our model outperforms the others, wecompared and analyzed their 1-best translations. Table 2provides a translation example with its various translations. ystem BLEU BPEChar 23.9RecAtten 25.0ConvEncoder 24.2Moses 20.54DL4MT 24.88VNMT 25.49VRNMT(-TD) 25.34VRNMT ∗ ++ Table 5: Case-sensitive BLEU scores of English-Germantranslation. We directly displayed the results of the firstthree models provided in (Gehring et al. 2017). BPEChar (Chung, Cho, and Bengio 2016) presented a character-leveldecoder for NMT, RecAtten (Yang et al. 2017) introduced arecurrent attention model to better capture source-side con-text for NMT, and ConvEncoder (Gehring et al. 2017) ex-plored the convolutional encoder to encode the source sen-tence.We have found that the translation produced by Moses isnon-fluent than those of NMT systems. In addition to theissues of incorrect translation and over-translation, the firstthree NMT systems (DL4MT, VNMT, VRNMT(-TD)) donot adequately convey the meaning of the source sentenceto the target as some source phrases have not been translatedat all, such as “ w`ei cˇi y¯i w¯eij¯i ( for this crisis )”. By contrast,due to the advantage of modeling long-distance dependen-cies among target words, VRNMT is able to produce a morecomplete, fluent, and accurate translation. We also carried out experiments on English-German trans-lation. Results are shown in Table 5. We provided re-sults of previous work (Chung, Cho, and Bengio 2016;Yang et al. 2017; Gehring et al. 2017) on this dataset too.Specifically, VRNMT still outperforms Moses, DL4MT,VNMT, VRNMT(-TD), achieving gains of 5.39, 1.05, 0.44and 0.59 BLEU points. Additionally, VRNMT reaches theperformance level that is competitive to or higher than sev-eral recent NMT systems. Note that our approach is orthog-onal to these previous models. Therefore it can be adaptedto these models. We leave this adaptation to our future work. 5. Related Work The previous studies that are related to our work mainly in-clude NMT and variational neural models. NMT . Most NMT models focus on how to translatea source sentence to a target sentence with an encoder-decoder neural network (Kalchbrenner and Blunsom 2013;Cho et al. 2014; Sutskever, Vinyals, and Le 2014). Tohandle the defeat of encoding all source-side informationinto a fixed-length vector, Bahdanau et al. (2015) proposedattention-based NMT, which has now become the dominantarchitecture. However, this model usually suffer from atten-tion failures, which usually lead to undesirable translations.Therefore, many researchers then resorted to better attentionmechanisms (Luong, Pham, and Manning 2015; Cheng et al. 2016; Tu et al. 2016; Feng et al. 2016; Meng et al. 2016;Calixto, Liu, and Campbell 2017), or more effective neuralnetworks (Wang et al. 2016; Gehring et al. 2017; Wang et al.2017), or exploiting external knowledge (Chen et al. 2017;Li et al. 2017; Zhang et al. 2017). All these models aredesigned within the discriminative encoder-decoder frame-work, leaving the explicit exploration of underlying seman-tics an open problem. To combine the strengths of discrim-inative and generative modeling, Zhang et al. (2016) pre-sented VNMT that incorporates a continuous latent variableto model the underlying semantics of sentence pairs. Variational Neural Networks . Kingma et al. (2014) aswell as Rezende et al. (2014) focused on variational neuralnetworks, which are effective in the inference and learningof directed probabilistic models on large-scale dataset. Typ-ically, these models introduce a neural inference model toapproximate the intractable posterior, and optimize modelparameters jointly with a reparameterized variational lowerbound. Further, Kingma et al. (2014b) adapted these mod-els to semi-supervised learning. Chung et al. (2015) incor-porated latent variables into the hidden states of a recur-rent neural network, while Gregor et al. (2015) combined anovel spatial attention mechanism that mimics the foveationof human eyes, with a sequential variational auto-encodingframework that allows the iterative construction of compleximages. Miao et al. (2016) proposed a generic variational in-ference framework for generative and conditional models oftext.Both (Zhang et al. 2016) and (Chung et al. 2015) arethe most related to our work. In our model, we extendedVNMT (Zhang et al. 2016) to a recurrent framework, whichhas been proven to be more effective for machine transla-tion. Besides, different from (Chung et al. 2015) that workon speech generation and handwriting generation, we intro-duces a sequence of recurrent latent variables for the seman-tic modeling of NMT, which, to the best of our knowledge,has never been investigated before. 6. Conclusions and Future Work This paper has presented a variational recurrent NMT modelthat introduces a sequence of continuous latent variables tocapture the underlying semantics of sentence pairs. Similarto VNMT, we approximate the posterior distribution withneural networks and reparameterize the variational lowerbound. In doing so, our model becomes an end-to-end neu-ral network which can be optimized through the stochas-tic gradient algorithms. Compared with the dominant NMTand VNMT, our model not only captures the global seman-tic contexts but also models strong and complex dependen-cies among generated words at different timesteps. Experi-ments on Chinese-English and English-German translationtasks demonstrate the effectiveness of our model.Our future works include the following aspects. We willstudy how to better exploit latent variables to further im-prove NMT. Additionally, we are also interested in apply-ing our model to other similar tasks using encoder-decoderframework, such as neural text summarization, neural dia-logue generation. cknowledgments The authors were supported by National Natural Sci-ence Foundation of China (Nos. 61672440, 61622209 and61573294), Scientific Research Project of National Lan-guage Committee of China (Grant No. YB135-49). We alsothank the reviewers for their insightful comments. References Bahdanau, D.; Cho, K.; and Bengio, Y. 2015. Neural ma-chine translation by jointly learning to align and translate. In Proc. of ICLR2015 .Calixto, I.; Liu, Q.; and Campbell, N. 2017. Doubly-attentive decoder for multi-modal neural machine transla-tion. In Proc. of ACL2017 , 1913–1924.Chen, H.; Huang, S.; Chiang, D.; and Chen, J. 2017. Im-proved neural machine translation with a syntax-aware en-coder and decoder. In Proc. of ACL2017 , 1936–1945.Cheng, Y.; Shen, S.; He, Z.; He, W.; Wu, H.; Sun, M.; andLiu, Y. 2016. Agreement-based joint training for bidirec-tional attention-based neural machine translation. In Proc.of IJCAI2016 , 2761–2767.Cho, K.; van Merrienboer, B.; Gulcehre, C.; Bahdanau, D.;Bougares, F.; Schwenk, H.; and Bengio, Y. 2014. Learningphrase representations using rnn encoder–decoder for statis-tical machine translation. In Proc. of EMNLP2014 , 1724–1734.Chung, J.; Kastner, K.; Dinh, L.; Goel, K.; Courville, A. C.;and Bengio, Y. 2015. A recurrent latent variable model forsequential data. In Proc. of NIPS2015 .Chung, J.; Cho, K.; and Bengio, Y. 2016. A character-leveldecoder without explicit segmentation for neural machinetranslation. In Proc. of ACL2016 , 1693–1703.Feng, S.; Liu, S.; Yang, N.; Li, M.; Zhou, M.; and Zhu, K. Q.2016. Improving attention modeling with implicit distor-tion and fertility for machine translation. In Proc. of COL-ING2016 , 3082–3092.Gehring, J.; Auli, M.; Grangier, D.; and Dauphin, Y. 2017. Aconvolutional encoder model for neural machine translation.In Proc. of ACL2017 , 123–135.Graves, A. 2013. Generating sequences with recurrent neu-ral networks. In arXiv:1308.0850v5 .Gregor, K.; Danihelka, I.; Graves, A.; Rezende, D.; andWierstra, D. 2015. Draw: A recurrent neural network forimage generation. In Proc. of ICML2015 , 1462–1471.Kalchbrenner, N., and Blunsom, P. 2013. Recurrent con-tinuous translation models. In Proc. of EMNLP2013 , 1700–1709.Kingma, D. P., and Welling, M. 2014. Auto-encoding vari-ational bayes. In Proc. of ICLR2014 .Kingma, D. P.; Mohamed, S.; Rezende, D. J.; and Welling,M. 2014b. Semi-supervised learning with deep generativemodels. In Proc. of NIPS2014 , 3581–3589.Koehn, P. 2004. Statistical significance tests for machinetranslation evaluation. In Proc. of EMNLP2004 , 388–395. Li, J.; Xiong, D.; Tu, Z.; Zhu, M.; Zhang, M.; and Zhou, G.2017. Modeling source syntax for neural machine transla-tion. In Proc. of ACL2017 , 688–697.Liu, Y., and Sun, M. 2015. Contrastive unsupervised wordalignment with non-local features. In Proc. of AAAI2015 ,857–868.Luong, M.-T.; Pham, H.; and Manning, C. D. 2015. Effec-tive approaches to attention-based neural machine transla-tion. In Proc. of EMNLP2015 , 1412–1421.Meng, F.; Lu, Z.; Li, H.; and Liu, Q. 2016. Interactive at-tention for neural machine translation. In Proc. of COL-ING2016 , 2174–2185.Miao, Y.; Yu, L.; and Blunsom, P. 2016. Neural variationalinference for text processing. In Proc. of ICML2016 , 1727–1736.Och, F. J., and Ney, H. 2003. A systematic comparisonof various statistical alignment models. In ComputationalLinguistics , 2003(29) , 19–51.Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W. 2002. Bleu:A method for automatic evaluation of machine translation.In Proc. of ACL2002 , 311–318.Rezende, D. J.; Mohamed, S.; and Wierstra, D. 2014.Stochastic backpropagation and approximate inference indeep generative models. In Proc. of ICML2014 , 1278–1286.Sennrich, R.; Haddow, B.; and Birch, A. 2016. Neural ma-chine translation of rare words with subword units. In Proc.of ACL2016 , 1715–1725.Sutskever, I.; Vinyals, O.; and Le, Q. V. 2014. Sequenceto sequence learning with neural networks. In Proc. ofNIPS2014 , 3104–3112.Tu, Z.; Lu, Z.; Liu, Y.; Liu, X.; and Li, H. 2016. Mod-eling coverage for neural machine translation. In Proc. ofACL2016 , 76–85.Wang, M.; Lu, Z.; Li, H.; and Liu, Q. 2016. Memory-enhanced decoder for neural machine translation. In Proc.of EMNLP2016 , 278–286.Wang, M.; Lu, Z.; Zhou, J.; and Liu, Q. 2017. Deep neuralmachine translation with linear associative unit. In Proc. ofACL2017 , 136–145.Yang, Z.; Hu, Z.; Deng, Y.; Dyer, C.; and Smola, A. 2017.Neural machine translation with recurrent attention model-ing. In Proc. of EACL2017 , 383–387.Zhang, B.; Xiong, D.; su, j.; Duan, H.; and Zhang, M.2016. Variational neural machine translation. In Proc. ofEMNLP2016 , 521–530.Zhang, J.; Wang, M.; Liu, Q.; and Zhou, J. 2017. Incorporat-ing word reordering knowledge into attention-based neuralmachine translation. In Proc. of ACL2017 , 1524–1534.Zhang, B.; Xiong, D.; and Su, J. 2017. A gru-gated attentionmodel for neural machine translation. In arXiv:1704.08430arXiv:1704.08430