[PDF] English-Japanese Neural Machine Translation with Encoder-Decoder-Reconstructor

Abstract

Neural machine translation (NMT) has recently become popular in the field of machine translation. However, NMT suffers from the problem of repeating or missing words in the translation. To address this problem, Tu et al. (2017) proposed an encoder-decoder-reconstructor framework for NMT using back-translation. In this method, they selected the best forward translation model in the same manner as Bahdanau et al. (2015), and then trained a bi-directional translation model as fine-tuning. Their experiments show that it offers significant improvement in BLEU scores in Chinese-English translation task. We confirm that our re-implementation also shows the same tendency and alleviates the problem of repeating and missing words in the translation on a English-Japanese task too. In addition, we evaluate the effectiveness of pre-training by comparing it with a jointly-trained model of forward translation and back-translation.

Full PDF

EEnglish-Japanese Neural Machine Translation withEncder-Decoder-Reconstructor

Yukio Matsumura Takayuki Sato

Tokyo Metropolitan UniversityTokyo, Japan [email protected], [email protected], [email protected]

Mamoru KomachiAbstract

Neural machine translation (NMT) has re-cently become popular in the ﬁeld of ma-chine translation. However, NMT suffersfrom the problem of repeating or miss-ing words in the translation. To addressthis problem, Tu et al. (2017) proposed anencoder-decoder-reconstructor frameworkfor NMT using back-translation. In thismethod, they selected the best forwardtranslation model in the same manner asBahdanau et al. (2015), and then traineda bi-directional translation model as ﬁne-tuning. Their experiments show that itoffers signiﬁcant improvement in BLEUscores in Chinese-English translation task.We conﬁrm that our re-implementationalso shows the same tendency and allevi-ates the problem of repeating and miss-ing words in the translation on a English-Japanese task too. In addition, we evaluatethe effectiveness of pre-training by com-paring it with a jointly-trained model offorward translation and back-translation.

Recently, neural machine translation (NMT) hasgained popularity in the ﬁeld of machine trans-lation. The conventional encoder-decoder NMTproposed by Cho et al. (2014) uses two recurrentneural networks (RNN): one is an encoder, whichencodes a source sequence into a ﬁxed-length vec-tor, and the other is a decoder, which decodes thevector into a target sequence. A newly proposedattention-based NMT by Bahdanau et al. (2015)can predict output words using the weights of eachhidden state of the encoder by the attention mech-anism, improving the adequacy of translation. Even with the success of attention-based mod-els, a number of open questions remain in NMT.Tu et al. (2016) argued two of the common prob-lems are over-translation: some words are repeat-edly translated unnecessary and under-translation:some words are mistakenly untranslated. This isdue to the fact that NMT can not completely con-vert the information from the source sentence tothe target sentence. Mi et al. (2016) and Fenget al. (2016) pointed out that NMT lacks the notionof coverage vector in phrase-based statistical ma-chine translation (PBSMT), so unless otherwisespeciﬁed, there is no way to prevent missing trans-lations.Another problem in NMT is an objective func-tion. NMT is optimized by cross-entropy; there-fore, it does not directly maximize the transla-tion accuracy. Shen et al. (2016) pointed outthat optimization by cross-entropy is not appropri-ate and proposed a method of optimization basedon a translation accuracy score, such as expectedBLEU, which led to improvement of translationaccuracy. However, BLEU is an evaluation metricbased on n-gram precision; therefore, repetition ofsome words may be present in the translation eventhough the BLEU score is improved.To address to problem of repeating and missingwords in the translation, Tu et al. (2017) introducean encoder-decoder-reconstructor framework thatoptimizes NMT by back-translation from the out-put sentences into the original source sentences.In their method, after training the forward trans-lation in a manner similar to the conventionalattention-based NMT, they train a back-translationmodel from the hidden state of the decoder intothe source sequence by a new decoder to enforceagreement between source and target sentences.In order to conﬁrm the language independenceof the framework, we experiment on two par-allel corpora of English-Japanese and Japanese- a r X i v : . [ c s . C L ] J un ℎ ! Encoder 𝑠 ! ! ! 𝑠 ! Decoder 𝑐 ! C ontext V ector 𝑦 ! ! ! 𝑦 ! Figure 1: Attention-based NMT.

Encoder

Decoder

Reconstructor C ontext V ector Inverse C ontext V ector Figure 2: Encoder-Decoder-Reconstructor.English translation tasks using encode-decoder-reconstructor. Our experiments show that theirmethod offers signiﬁcant improvement in BLEUscores and alleviates the problem of repeatingand missing words in the translation on English-Japanese translation task, though the differenceis not signiﬁcant on Japanese-English translationtask.In addition, we jointly train a model of for-ward translation and back-translation without pre-training, and then evaluate this model. As a re-sult, the encoder-decoder-reconstructor can not betrained well without pre-training, so it proves thatwe have to train the forward translation model in amanner similar to the conventional attention-basedNMT as pre-training.The main contributions of this paper are as fol-lows: • Experimental results show that encode-decoder-reconstructor framework achievessigniﬁcant improvements in BLEU scores(1.0-1.4) for English-Japanese translationtask. • Experimental results show that encode-decoder-reconstructor framework has to trainthe forward translation model in a mannersimilar to the conventional attention-basedNMT as pre-training.

Several studies have addressed the NMT-speciﬁcproblem of missing or repeating words. Niehueset al. (2016) optimized NMT by adding the out-puts of PBSMT to the input of NMT. Mi et al.(2016) and Feng et al. (2016) introduced a dis-tributed version of coverage vector taken from PB-SMT to consider which words have been already translated. All these methods, including ours, em-ploy information of the source sentence to im-prove the quality of translation, but our methoduses back-translation to ensure that there is no in-consistency. Unlike other methods, once learned,our method is identical to the conventional NMTmodel, so it does not need any additional parame-ters such as coverage vector or a PBSMT systemfor testing.The attention mechanism proposed by Menget al. (2016) considers not only the hidden statesof the encoder but also the hidden states of thedecoder so that over-translation can be relaxed.In addition, the attention mechanism proposed byFeng et al. (2016) computes a context vector byconsidering the previous context vector to pre-vent over-translation. These works indirectly re-duce repeating and missing words, while we di-rectly penalize translation mismatch by consider-ing back-translation.The encoder-decoder-reconstructor frameworkfor NMT proposed by Tu et al. (2017) optimizesNMT by reconstructor using back-translation.They consider likelihood of both of forward trans-lation and back-translation, and then this frame-work offers signiﬁcant improvement in BLEUscores and alleviates the problem of repeating andmissing words in the translation on a Chinese-English translation task.

Here, we describe the attention-based NMT pro-posed by Bahdanau et al. (2015) as shown in Fig-ure 1.The input sequence ( x = [ x , x , · · · , x | x | ] ) isconverted into a ﬁxed-length vector by the encoderusing an RNN. At each time step t , the hidden state t of the encoder is presented as h t = [ −→ h t (cid:62) : ←− h t (cid:62) ] (cid:62) (1)using a bidirectional RNN. The forward state −→ h t and the backward state ←− h t are computed by −→ h t = r ( x t , h t − ) (2)and ←− h t = r (cid:48) ( x t , h t +1 ) (3)where r and r (cid:48) are nonlinear functions. The hiddenstates ( h , h , · · · , h | x | ) are converted into a ﬁxed-length vector v as v = q ([ h , h , · · · , h | x | ]) (4)where q is a nonlinear function.The ﬁxed-length vector v generated by the en-coder is converted into the target sequence ( y =[ y , y , · · · , y | y | ] ) by the decoder using an RNN.At each time step i , the conditional probability ofthe output word ˆ y i is computed by p (ˆ y i | y

We evaluated the encoder-decoder-reconstructorframework for NMT on English-Japanese andJapanese-English translation tasks.xample 1: Improvement in under-translation.Input the conditions under which the effect of turbulent viscosity is cor-rectly evaluated were examined on the basis of the relation between turbulentviscosity and numerical viscosity in size .Baseline-NMT 乱流粘性の影響を正確に評価する条件を検討した。 +Reconstructor 乱流粘性の影響を正確に評価する条件を , 乱乱乱流流流粘粘粘性性性ととと数数数値値値的的的粘粘粘性性性ののの関関関係係係ををを基基基ににに調べた。 +Reconstructor 乱流粘性の影響を考慮した条件を , 乱乱乱流流流粘粘粘性性性ととと粘粘粘性性性ののの粘粘粘性性性とととののの (Jointly-Training) 関関関係係係をををもももとととににに検討した。 Reference 乱乱乱流流流粘粘粘性性性ととと数数数値値値粘粘粘性性性ののの大大大小小小関関関係係係ににによよよりりり , 乱流粘性の効果が正しく評価される条件を検討した。 Example 2: Improvement in over-translation.Input activity was high in cells of the young , especially newborn infant , and was veryslight in cells of 30 - year - old or more .Baseline-NMT 活動性は若齢 , 特に新新新生生生児児児新新新生生生児児児では歳歳歳以以以上上上ののの細胞で高く , 歳歳歳以以以上上上ののの細胞ではわずかであった。 +Reconstructor その活性は若齢 , 特に新新新生生生児児児は細胞が高く , 歳歳歳以以以上上上ののの細胞ではわずかであった。 +Reconstructor 若齢の新新新生生生児児児では活性は高かったが , 歳歳歳以以以上上上ののの場合には極めて (Jointly-Training) 軽度であった。 Reference 活性は若い個体 , 特に新新新生生生児児児の細胞で高く , 歳歳歳以以以上上上のののものではごくわずかであった。 Table 4: Examples of outputs of English-Japanese translation.

We used two parallel corpora: Asian Scien-tiﬁc Paper Excerpt Corpus (ASPEC) (Nakazawaet al., 2016) and NTCIR PatentMT Parallel Corpus(Goto et al., 2013). Regarding the training data ofASPEC, we used only the ﬁrst 1 million sentencessorted by sentence-alignment similarity. Japanesesentences were segmented by the morphologicalanalyzer MeCab (version 0.996, IPADIC), and En-glish sentences were tokenized by tokenizer.perlof Moses. Table 1 shows the numbers of the sen-tences in each corpus. Note that sentences withmore than 40 words were excluded from the train-ing data.

We used the attention-based NMT (Bahdanauet al., 2015) as a baseline-NMT, the encoder-decoder-reconstructor (Tu et al., 2017) and theencoder-decoder-reconstructor that jointly trainedforward translation and back-translation withoutpre-training. The RNN used in the experi-ments had 512 hidden units, 512 embedding units,30,000 vocabulary size and 64 batch size. We used Adagrad (initial learning rate 0.01) for op-timizing model parameters. We trained our modelon GeForce GTX TITAN X GPU. Note that weset the hyper-parameter λ = 1 on the encoder-decoder-reconstructor same as Tu et al. (2017). Tables 2 and 3 show the translation accuracyin BLEU scores, the p -value of the signiﬁcancetest by bootstrap resampling (Koehn, 2004) andtraining time in hours until convergence. Theencoder-decoder-reconstructor (Tu et al., 2017)requires slightly longer time to train than thebaseline NMT, but we emphasize that decodingtime remains the same with the encoder-decoder-reconstructor and baseline-NMT. The results showthat the encoder-decoder-reconstructor (Tu et al.,2017) signiﬁcantly improves translation accuracyby 1.01 points on ASPEC and 1.37 points on NT-CIR in English-Japanese translation ( p < . ).However, it does not signiﬁcantly improve trans-lation accuracy in Japanese-English translation. Inaddition, it is proved that the encoder-decoder-reconstructor without pre-training worsens ratheraseline-NMT Encoder-Decoder-ReconstructorFigure 3: The attention layer in Example 1 : Improvement in under-translation.Baseline-NMT Encoder-Decoder-ReconstructorFigure 4: The attention layer in Example 2 : Improvement in over-translation.than improves translation accuracy.Table 4 shows examples of outputs of English-Japanese translations. In Example 1, “ 乱流粘性と数値粘性の大小関係により ,” (on the ba-sis of the relation between turbulent viscosity andnumerical viscosity in size) is missing in the out-put of baseline-NMT, but “ 乱流粘性と数値的粘性の関係を基に ” (on the basis of therelation between turbulent viscosity and numeri-cal viscosity) is present in the output of encoder-decoder-reconstructor. In Example 2, “ 新生児 ”(newborn infant) and “30 歳以上の ” (of 30 - year -old or more) are repeated in the output of baseline-NMT, but they appear only once in the output ofencoder-decoder-reconstructor. In addition, Figures 3 and 4 show the atten-tion layer on baseline-NMT and encoder-decoder-reconstructor in each example. In Figure 3, al-though the attention layer of baseline NMT attendsinput word “turbulent”, the decoder does not out-put “ 乱流 ” (turbulent) but “ 検討 ” (examined) atthe 13th word. Thus, under-translation may beresulted from the hidden layer or the embeddinglayer instead of the attention layer. In Figure 4, itis found that the attention layer of baseline-NMTrepeatedly attends input words “newborn infant”and “30 - year - old or more”. Consequently,the decoder repeatedly outputs “ 新生児 ” (new-born infant) and “30 歳以上の ” (of 30 - year - oldor more). On the other hand, the attention layerorpus Model English-Japanese Japanese-English(i) (ii) (iii) (i) (ii) (iii)Baseline-NMT 1,141 378 1,045 951 494 1,085ASPEC +Reconstructor 988 336 1,042 836 418 1,014+Reconstructor (Jointly-Training) 1,292 446 1,147 1,106 525 1,821Baseline-NMT 2,122 1,015 1,106 2,521 1,073 1,630NTCIR +Reconstructor 1,958 922 963 2,187 987 1,422+Reconstructor (Jointly-Training) 1,978 916 1,078 2,475 1,107 1,610Table 5: Numbers of redundant and unknown word tokens.of encoder-decoder-reconstructor almost correctlyattends input words.Table 5 shows a comparison of the number ofword occurrences for each corpus and model. Thecolumns show (i) the number of words that ap-pear more frequently than the counterparts in thereference, and (ii) the number of words that ap-pear more than once but are not included in thereference. Note that these numbers do not in-clude unknown words, so (iii) shows the num-ber of unknown words. In all the cases, thenumber of occurrence of redundant words is re-duced in encoder-decoder-reconstructor. Thus,we conﬁrmed that encoder-decoder-reconstructorachieves reduction of repeating and missing wordswhile maintaining the quality of translation. In this paper, we evaluated the encoder-decoder-reconstructor on English-Japanese and Japanese-English translation tasks. In addition, we evaluatethe effectiveness of pre-training by comparing itwith a jointly-trained model of forward translationand back-translation. Experimental results showthat the encoder-decoder-reconstructor offers sig-niﬁcant improvement in BLEU scores and allevi-ates the problem of repeating and missing wordsin the translation on English-Japanese translationtask, and the encoder-decoder-reconstructor cannot be trained well without pre-training, so itproves that we have to train the forward transla-tion model in a manner similar to the conventionalattention-based NMT as pre-training.

References

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2015. Neural Machine Translation by JointlyLearning to Align and Translate.

Proceedings ofthe 3rd International Conference on Learning Rep-resentations (ICLR), pages 1–15. Kyunghyun Cho, Bart Van Merri¨enboer, Caglar Gul-cehre, Dzmitry Bahdanau, Fethi Bougares, Hol-ger Schwenk, and Yoshua Bengio. 2014. Learn-ing Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation.

Pro-ceedings of the 2014 Conference on Empirical Meth-ods in Natural Language Processing (EMNLP), pages 1724–1734.Shi Feng, Shujie Liu, Nan Yang, Mu Li, and MingZhou. 2016. Improving Attention Modeling withImplicit Distortion and Fertility for Machine Trans-lation.

Proceedings of the 26th International Con-ference on Computational Linguistics (COLING), pages 3082–3092.Isao Goto, Ka-Po Chow, Bin Lu, Eiichiro Sumita, andBenjamin K Tsou. 2013. Overview of the PatentMachine Translation Task at the NTCIR-10 Work-shop.

Proceedings of the 10th NII Testbeds andCommunity for Information access Research Con-ference (NTCIR), pages 260–286.Philipp Koehn. 2004. Statistical Signiﬁcance Tests forMachine Translation Evaluation.

Proceedings of the2004 Conference on Empirical Methods in NaturalLanguage Processing (EMNLP), pages 388–395.Fandong Meng, Zhengdong Lu, Hang Li, and QunLiu. 2016. Interactive Attention for Neural Ma-chine Translation.

Proceedings of the 26th Inter-national Conference on Computational Linguistics(COLING), pages 2174–2185.Haitao Mi, Baskaran Sankaran, Zhiguo Wang, and AbeIttycheriah. 2016. Coverage Embedding Models forNeural Machine Translation.

Proceedings of the2016 Conference on Empirical Methods in NaturalLanguage Processing (EMNLP), pages 955–960.Toshiaki Nakazawa, Manabu Yaguchi, Kiyotaka Uchi-moto, Masao Utiyama, Eiichiro Sumita, SadaoKurohashi, and Hitoshi Isahara. 2016. ASPEC:Asian Scientiﬁc Paper Excerpt Corpus.

Proceedingsof the Tenth International Conference on LanguageResources and Evaluation (LREC), pages 2204–2208.Jan Niehues, Eunah Cho, Thanh-Le Ha, and AlexWaibel. 2016. Pre-Translation for Neural MachineTranslation.

Proceedings of the 26th Internationalonference on Computational Linguistics (COL-ING), pages 1828–1836.Shiqi Shen, Yong Cheng, Zhongjun He, Wei He, HuaWu, Maosong Sun, and Yang Liu. 2016. MinimumRisk Training for Neural Machine Translation.

Pro-ceedings of the 54th Annual Meeting of the Asso-ciation for Computational Linguistics (ACL), pages1683–1692.Zhaopeng Tu, Yang Liu, Lifeng Shang, Xiaohua Liu,and Hang Li. 2017. Neural Machine Translationwith Reconstruction.

Proceedings of the Thirty-First AAAI Conference on Artiﬁcial Intelligence(AAAI), pages 3097–3103.Zhaopeng Tu, Zhengdong Lu, Yang Liu, Xiaohua Liu,and Hang Li. 2016. Modeling Coverage for NeuralMachine Translation.