[PDF] Future-Prediction-Based Model for Neural Machine Translation

Abstract

We propose a novel model for Neural Machine Translation (NMT). Different from the conventional method, our model can predict the future text length and words at each decoding time step so that the generation can be helped with the information from the future prediction. With such information, the model does not stop generation without having translated enough content. Experimental results demonstrate that our model can significantly outperform the baseline models. Besides, our analysis reflects that our model is effective in the prediction of the length and words of the untranslated content.

Full PDF

FFuture-Prediction-Based Model for Neural Machine Translation

Bingzhen Wei ∗ Junyang Lin ∗ MOE Key Lab of Computational Linguistics, School of EECS, Peking University School of Foreign Languages, Peking University { weibz, linjunyang } @pku.edu.cn Abstract

We propose a novel model for Neural MachineTranslation (NMT). Different from the con-ventional method, our model can predict thefuture text length and words at each decodingtime step so that the generation can be helpedwith the information from the future predic-tion. With such information, the model doesnot stop generation without having translatedenough content. Experimental results demon-strate that our model can signiﬁcantly outper-form the baseline models. Besides, our anal-ysis reﬂects that our model is effective in theprediction of the length and words of the un-translated content.

Recent researches in machine translation focuson Neural Machine Translation, whose mostcommon baseline is the sequence-to-sequence(Seq2Seq) model (Kalchbrenner and Blunsom,2013; Sutskever et al., 2014; Cho et al., 2014) withattention mechanism (Bahdanau et al., 2014; Lu-ong et al., 2015; Tu et al., 2016; Mi et al., 2016a;Meng et al., 2016; Xiong et al., 2017; Vaswaniet al., 2017; Mi et al., 2016b; Lin et al., 2018c,b,a).In the Seq2Seq model, the encoder encodes thesource text for a representation of the source textand decodes it for a translation that approximatesthe target. However, a salient drawback of thismechanism is that the decoding process shouldfollow the sequential order, which cannot take theinformation in the untranslated content into con-sideration. Without the information about the un-translated content, the translation may end up withfaults on semantic level (e.g., the translation endsby mistake with contents untranslated). The infor-mation about the “future generation” can provideindication for present generation, guaranteeing theloyalty of translation to the source text. ∗ Equal Contribution

To tackle the problem, we propose a novelmodel that targets on the provision of the un-translated information for the decoder. Basedon the conventional attention-based sequence-to-sequence (Seq2Seq) model, we implement a noveldecoder that is able to generate more than thepresent word. At each time step, the model pro-duces a conjecture of the bag of the followingwords (e.g., the model is to generates a sentence “ the new plan can boost the economy ” , whenthe model generates “ the new plan ” , it can pre-dict the bag of the following words that is { can,boost, the, economy } ). Moreover, the decoder canalso predict the length of the untranslated content,so as to make sure that the translation does not endwithout having translated all the source informa-tion. Our proposed model can be effective in gen-erating translation with the help of the predictionof the bag of words and text length of the untrans-lated content.Our contributions are summarized as below:(1). We propose a novel model for NMT thattargets on the prediction of the untranslated con-tent, which guarantees that the system can gener-ate translation that is loyal to the source text; (2).Experimental results demonstrate that our modelcan signiﬁcantly outperform the baseline models.(3). The analysis reﬂects that our model can be ef-fective in predicting the words and text length ofthe untranslated content. In the following, we introduce the detailsof our model, including the basic attention-based Seq2Seq model and our proposed Future-Prediction-Based model.

In our model, the encoder, a bidirectionalLSTM (Hochreiter and Schmidhuber, 1997), a r X i v : . [ c s . C L ] S e p eads the embeddings of the input text se-quence x = { x , ..., x n } and encodes a sequenceof source annotations h = { h , ..., h n } . The de-coder, which is also an LSTM, decodes the ﬁnalstate h n to a new sequence to approximate the tar-get with the application of conventional attentionmechanism (Bahdanau et al., 2014). The model istrained by maximum likelihood estimation (MLE)to minimize the difference between the generationand target. In the following, we introduce the details our pro-posed future-prediction-based decoder, includingthe bag-of-words(BOW) predictor and the lengthpredictor.

On top of the output of the LSTM decoder, we im-plement a Bag-of-Words (BOW) Predictor in or-der to predict the word set of the following textsequence to generate. Some studies (Ma et al.,2018) show that using Bag-of-Words as target canimprove the performance of the model. With theobjective of predicting the words in the future gen-eration, the decoder can obtain more informationabout the target-side information. With the infor-mation about the future, it is less possible for themodel to repeat the previous generation and gener-ates translation far different from the target. More-over, if the BOW predictor successfully predictsthe word set, it can encourage the model not togenerate words outside of the word set and avoidsmistake. The details are in the following: h t,k = f k ( C t , o t − ,k ) (1) g t,k = sigmoid ( h t,k ) (2) z t,k = g t,k · tanh ( C t ) + (1 − g t,k ) · o t − ,k (3) o t,k = Attention ( z t,k , context ) (4) p t,k = sof tmax ( W o t,k ) (5) p t = 1 k k (cid:88) i =1 p t,k (6)where C t refers to the cell state of LSTM and f k ( · ) refers to the k -th linear function. Since a singleoutput is hardly able to predict all of the untrans-lated words, the model generates k outputs for im-proved prediction. The averaged p t refers to theprobability distribution of the untranslated words,which is used to compute the loss below. As to the representation of the target word set,we use one-hot representation by assigning /m tothe word indices and 0 to others for the construc-tion of the representation vector, where m refersto the number of words in the target. Therefore,the model can be trained by minimizing negativelog likelihood, where the loss L BOW is illustratedbelow: L BOW = − N N (cid:88) i =1 T (cid:88) t =1 m logP ( y ( i ) >t | ˜ y ( i ) t | l y

We implement ourmodel on the dataset WMT 2014 with 4.5M sen-tence pairs as training data. The news-test 2013 isour development set and the news-test 2014 is ourtest set. Following Wu et al. (2016), we segmentthe data with byte-pair encoding (Sennrich et al.,2016) and we extract the most frequent 50K wordsfor the dictionary.

English-Vietnamese Translation

Following Lu-ong and Manning (2015), we use the same prepro-cessed data for this task with 133K training sen-tence pairs (Cettolo et al., 2015) for training. TheTED tst2012 with 1553 sentences and the the TEDtst2013 with 1268 sentences are our developmentand test set respectively. We preserve casing, andwe set the English dictionary size to 17K wordsand Vietnamese dictionary to 7K words. The case-sensitive BLEU score (Papineni et al., 2002) is theevaluation metric.

We implement the models in PyTorch on anNVIDIA 1080Ti GPU. Both the size of word em-bedding and the number of units of hidden layersare 512, and the batch size is 64. We use Adamoptimizer (Kingma and Ba, 2014) with the defaultsetting, α = 0 . , β = 0 . , β = 0 . and Model BLEUByteNet 23.10GNMT 24.60ConvS2S 25.16Seq2Seq (our reimplementation) 25.14 FPB 25.79

Table 1 : Results of the models on the English-German translation. (cid:15) = 1 × − , to train the model. Gradient clippingis applied with the norm smaller than 10. Dropout(Srivastava et al., 2014) is used with the dropoutrate set to 0.2 for both datasets, in accordance withthe model’s performance on the development set.Based on the performance on the development set,we use beam search with a beam width of 10 togenerate text. For the English-German translation, we com-pare with the baseline models in the following.

ByteNet is the Seq2Seq model based on dilatedconvolution, which runs faster than conventionalRNN-based model (Kalchbrenner et al., 2016).

GNMT is the improved version of end-to-endtranslation system that tackles many detail prob-lems in NMT (Wu et al., 2016).

ConvS2S is theSeq2Seq model completely based on CNN andattention mechanism, which achieves outstandingperformance in NMT.For English-Vietnamese translation, the modelsto compared are presented below.

RNNSearch

The attention-based Seq2Seq model as mentionedabove, and we present the results of (Luong andManning, 2015).For both datasets, we reimplement the base-line, the attention-based Seq2Seq model, which isnamed

Seq2Seq . In the following, we present our experimental re-sults as well as our analysis of our proposed mod-ules to ﬁgure out how it enhances the performanceof the basic Seq2Seq model for NMT.

Table 1 shows the results of our model as well asthe baseline models on the English-German trans-lation dataset.odel BLEURNNSearch 26.10Seq2Seq (our reimplementation) 25.90

FPB 27.70

Table 2 : Results of the models on the English-Vietnamese translation.

Model BLEUSeq2Seq (our reimplementation) 25.90+length predictor 26.26+BOW predictor 27.38

FPB 27.70

Table 3 : Ablation test on the English-Vietnamese translation.

Seq2Seq refers toour reimplementation of the attention-basedSeq2Seq modelTable 2 shows the results of the models onthe English-Vietnamese translation dataset. It canbe found that on the evaluation of BLEU score,our proposed model has signiﬁcant advantage overthe RNNSearch, which demonstrates that our pro-posed model is effective in improving the perfor-mance of the baseline. In the following, we con-duct ablation test to evaluate the effect of eachmodule and examine the performance of the BOWpredictor in prediction accuracy of words.

To evaluate the effects of each proposed module,we conduct an ablation test for our model to ex-amine the individual effect of our BOW predictorand length predictor.We present the results of the ablation test on Ta-ble 3. Compared with the basic attention-basedSeq2Seq model, it can be found that the lengthpredictor can bring a slight improvement for thebaseline model, while the model only with theBOW predictor can outperform the baseline witha large margin. It is obvious that the BOW pre-dictor brings contribution to the model’s perfor-mance, and we analyze its bag-of-words predic-tion accuracy in the next section. The combinationof the two modules, which is our proposed model,can achieve the best performance.

In this section, we present our analysis of theprediction accuracy of the BOW predictor. As A cc u r a c y o f t h e B O W P r e d i c t o r ( % ) Figure 1 : Accuracy of the BOW prediction at thetime step with different lengths of untranslatedwordsthe BOW predictor predicts words at each decod-ing time step, we evaluate its accuracy in varioussituations by evaluating its bag-of-words predic-tion accuracy with different lengths of untrans-lated words. For example, if there are still 20words left for translation, we evaluate if the BOWpredictor can predict the correct words withoutconcerning sequential order.Results shown in Figure 1 reﬂect our model’sperformance on the prediction of the bag of wordsto translate at different time steps with diverselengths of untranslated content. It can be foundthat with the increase of untranslated words, theprediction accuracy decreases. The phenomenonis reasonable as it is more difﬁcult to predict the in-formation about further future only with the infor-mation from the source-side context and the previ-ous generation. However, even when the lengthof the untranslated words is relatively long (20words), the model can still maintain a stable per-formance on the evaluation with the accuracy ofaround 50%. This demonstrates that our modelpossesses strong capability of predicting the word-level information about future generation.

In this paper, we propose a novel model for NMTwith the BOW predictor that predicts the wordsthat are not translated and the length predictorthat predicts the length of the untranslated words.Therefore, the model can receive informationabout the future from its conjecture to improve thequality of the current translation. Experimental re-sults demonstrate that our model outcompetes theaseline model on the English-Vietnamese trans-lation dataset. Moreover, our analysis shows thatour proposed modules can enhance the perfor-mance of the baseline individually, especially theBOW predictor, and we ﬁnd that the BOW is ableto predict words with high accuracy and the accu-racy increases with the decline of the number ofuntranslated words.

References

Dzmitry Bahdanau, Kyunghyun Cho, and YoshuaBengio. 2014. Neural machine translation byjointly learning to align and translate.

CoRR ,abs/1409.0473.Mauro Cettolo, Jan Niehues, Sebastian St¨uker, LuisaBentivogli, Roldano Cattoni, and Marcello Federico.2015. The iwslt 2015 evaluation campaign.

Proc. ofIWSLT, Da Nang, Vietnam .Kyunghyun Cho, Bart van Merrienboer, C¸ aglarG¨ulc¸ehre, Dzmitry Bahdanau, Fethi Bougares, Hol-ger Schwenk, and Yoshua Bengio. 2014. Learningphrase representations using RNN encoder-decoderfor statistical machine translation. In

EMNLP 2014 ,pages 1724–1734.Sepp Hochreiter and J¨urgen Schmidhuber. 1997.Long short-term memory.

Neural Computation ,9(8):1735–1780.Nal Kalchbrenner and Phil Blunsom. 2013. Recurrentcontinuous translation models. In

EMNLP 2013 ,pages 1700–1709.Nal Kalchbrenner, Lasse Espeholt, Karen Simonyan,A¨aron van den Oord, Alex Graves, and KorayKavukcuoglu. 2016. Neural machine translation inlinear time.

CoRR , abs/1610.10099.Diederik P. Kingma and Jimmy Ba. 2014. Adam:A method for stochastic optimization.

CoRR ,abs/1412.6980.Junyang Lin, Shuming Ma, Qi Su, and Xu Sun.2018a. Decoding-history-based adaptive control ofattention for neural machine translation.

CoRR ,abs/1802.01812.Junyang Lin, Xu Sun, Xuancheng Ren, Muyu Li,and Qi Su. 2018b. Learning when to concen-trate or divert attention: Self-adaptive attention tem-perature for neural machine translation.

CoRR ,abs/1808.07374.Junyang Lin, Xu Sun, Xuancheng Ren, Shuming Ma,Jinsong Su, and Qi Su. 2018c. Deconvolution-based global decoding for neural machine transla-tion.

CoRR , abs/1806.03692. Minh-Thang Luong and Christopher D Manning. 2015.Stanford neural machine translation systems for spo-ken language domains. In

Proceedings of the In-ternational Workshop on Spoken Language Transla-tion .Thang Luong, Hieu Pham, and Christopher D. Man-ning. 2015. Effective approaches to attention-basedneural machine translation. In

EMNLP 2015 , pages1412–1421.Shuming Ma, Xu Sun, Yizhong Wang, and JunyangLin. 2018. Bag-of-words as target for neural ma-chine translation.

CoRR , abs/1805.04871.Fandong Meng, Zhengdong Lu, Hang Li, and Qun Liu.2016. Interactive attention for neural machine trans-lation. In

COLING 2016 , pages 2174–2185.Haitao Mi, Baskaran Sankaran, Zhiguo Wang, and AbeIttycheriah. 2016a. Coverage embedding models forneural machine translation. In

EMNLP 2016 , pages955–960.Haitao Mi, Zhiguo Wang, and Abe Ittycheriah. 2016b.Supervised attentions for neural machine translation.In

EMNLP 2016 , pages 2283–2288.Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic eval-uation of machine translation. In

ACL, 2002 , pages311–318.Rico Sennrich, Barry Haddow, and Alexandra Birch.2016. Neural machine translation of rare words withsubword units. In

ACL 2016 .Nitish Srivastava, Geoffrey E. Hinton, AlexKrizhevsky, Ilya Sutskever, and Ruslan Salakhutdi-nov. 2014. Dropout: a simple way to prevent neuralnetworks from overﬁtting.

Journal of MachineLearning Research , 15(1):1929–1958.Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014.Sequence to sequence learning with neural net-works. In

NIPS, 2014 , pages 3104–3112.Zhaopeng Tu, Zhengdong Lu, Yang Liu, Xiaohua Liu,and Hang Li. 2016. Modeling coverage for neuralmachine translation. In

ACL 2016 .Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N. Gomez, LukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need.

CoRR , abs/1706.03762.Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V.Le, Mohammad Norouzi, Wolfgang Macherey,Maxim Krikun, Yuan Cao, Qin Gao, KlausMacherey, Jeff Klingner, Apurva Shah, MelvinJohnson, Xiaobing Liu, Lukasz Kaiser, StephanGouws, Yoshikiyo Kato, Taku Kudo, HidetoKazawa, Keith Stevens, George Kurian, NishantPatil, Wei Wang, Cliff Young, Jason Smith, JasonRiesa, Alex Rudnick, Oriol Vinyals, Greg Corrado,Macduff Hughes, and Jeffrey Dean. 2016. Google’seural machine translation system: Bridging the gapbetween human and machine translation.

CoRR ,abs/1609.08144.Hao Xiong, Zhongjun He, Xiaoguang Hu, and Hua Wu.2017. Multi-channel encoder for neural machinetranslation.