English-Japanese Neural Machine Translation with Encoder-Decoder-Reconstructor
EEnglish-Japanese Neural Machine Translation withEncder-Decoder-Reconstructor
Yukio Matsumura Takayuki Sato
Tokyo Metropolitan UniversityTokyo, Japan [email protected], [email protected], [email protected]
Mamoru KomachiAbstract
Neural machine translation (NMT) has re-cently become popular in the field of ma-chine translation. However, NMT suffersfrom the problem of repeating or miss-ing words in the translation. To addressthis problem, Tu et al. (2017) proposed anencoder-decoder-reconstructor frameworkfor NMT using back-translation. In thismethod, they selected the best forwardtranslation model in the same manner asBahdanau et al. (2015), and then traineda bi-directional translation model as fine-tuning. Their experiments show that itoffers significant improvement in BLEUscores in Chinese-English translation task.We confirm that our re-implementationalso shows the same tendency and allevi-ates the problem of repeating and miss-ing words in the translation on a English-Japanese task too. In addition, we evaluatethe effectiveness of pre-training by com-paring it with a jointly-trained model offorward translation and back-translation.
Recently, neural machine translation (NMT) hasgained popularity in the field of machine trans-lation. The conventional encoder-decoder NMTproposed by Cho et al. (2014) uses two recurrentneural networks (RNN): one is an encoder, whichencodes a source sequence into a fixed-length vec-tor, and the other is a decoder, which decodes thevector into a target sequence. A newly proposedattention-based NMT by Bahdanau et al. (2015)can predict output words using the weights of eachhidden state of the encoder by the attention mech-anism, improving the adequacy of translation. Even with the success of attention-based mod-els, a number of open questions remain in NMT.Tu et al. (2016) argued two of the common prob-lems are over-translation: some words are repeat-edly translated unnecessary and under-translation:some words are mistakenly untranslated. This isdue to the fact that NMT can not completely con-vert the information from the source sentence tothe target sentence. Mi et al. (2016) and Fenget al. (2016) pointed out that NMT lacks the notionof coverage vector in phrase-based statistical ma-chine translation (PBSMT), so unless otherwisespecified, there is no way to prevent missing trans-lations.Another problem in NMT is an objective func-tion. NMT is optimized by cross-entropy; there-fore, it does not directly maximize the transla-tion accuracy. Shen et al. (2016) pointed outthat optimization by cross-entropy is not appropri-ate and proposed a method of optimization basedon a translation accuracy score, such as expectedBLEU, which led to improvement of translationaccuracy. However, BLEU is an evaluation metricbased on n-gram precision; therefore, repetition ofsome words may be present in the translation eventhough the BLEU score is improved.To address to problem of repeating and missingwords in the translation, Tu et al. (2017) introducean encoder-decoder-reconstructor framework thatoptimizes NMT by back-translation from the out-put sentences into the original source sentences.In their method, after training the forward trans-lation in a manner similar to the conventionalattention-based NMT, they train a back-translationmodel from the hidden state of the decoder intothe source sequence by a new decoder to enforceagreement between source and target sentences.In order to confirm the language independenceof the framework, we experiment on two par-allel corpora of English-Japanese and Japanese- a r X i v : . [ c s . C L ] J un ℎ ! Encoder 𝑠 ! ! ! 𝑠 ! Decoder 𝑐 ! C ontext V ector 𝑦 ! ! ! 𝑦 ! Figure 1: Attention-based NMT.
Encoder
Decoder
Reconstructor C ontext V ector Inverse C ontext V ector Figure 2: Encoder-Decoder-Reconstructor.English translation tasks using encode-decoder-reconstructor. Our experiments show that theirmethod offers significant improvement in BLEUscores and alleviates the problem of repeatingand missing words in the translation on English-Japanese translation task, though the differenceis not significant on Japanese-English translationtask.In addition, we jointly train a model of for-ward translation and back-translation without pre-training, and then evaluate this model. As a re-sult, the encoder-decoder-reconstructor can not betrained well without pre-training, so it proves thatwe have to train the forward translation model in amanner similar to the conventional attention-basedNMT as pre-training.The main contributions of this paper are as fol-lows: • Experimental results show that encode-decoder-reconstructor framework achievessignificant improvements in BLEU scores(1.0-1.4) for English-Japanese translationtask. • Experimental results show that encode-decoder-reconstructor framework has to trainthe forward translation model in a mannersimilar to the conventional attention-basedNMT as pre-training.
Several studies have addressed the NMT-specificproblem of missing or repeating words. Niehueset al. (2016) optimized NMT by adding the out-puts of PBSMT to the input of NMT. Mi et al.(2016) and Feng et al. (2016) introduced a dis-tributed version of coverage vector taken from PB-SMT to consider which words have been already translated. All these methods, including ours, em-ploy information of the source sentence to im-prove the quality of translation, but our methoduses back-translation to ensure that there is no in-consistency. Unlike other methods, once learned,our method is identical to the conventional NMTmodel, so it does not need any additional parame-ters such as coverage vector or a PBSMT systemfor testing.The attention mechanism proposed by Menget al. (2016) considers not only the hidden statesof the encoder but also the hidden states of thedecoder so that over-translation can be relaxed.In addition, the attention mechanism proposed byFeng et al. (2016) computes a context vector byconsidering the previous context vector to pre-vent over-translation. These works indirectly re-duce repeating and missing words, while we di-rectly penalize translation mismatch by consider-ing back-translation.The encoder-decoder-reconstructor frameworkfor NMT proposed by Tu et al. (2017) optimizesNMT by reconstructor using back-translation.They consider likelihood of both of forward trans-lation and back-translation, and then this frame-work offers significant improvement in BLEUscores and alleviates the problem of repeating andmissing words in the translation on a Chinese-English translation task.
Here, we describe the attention-based NMT pro-posed by Bahdanau et al. (2015) as shown in Fig-ure 1.The input sequence ( x = [ x , x , · · · , x | x | ] ) isconverted into a fixed-length vector by the encoderusing an RNN. At each time step t , the hidden state t of the encoder is presented as h t = [ −→ h t (cid:62) : ←− h t (cid:62) ] (cid:62) (1)using a bidirectional RNN. The forward state −→ h t and the backward state ←− h t are computed by −→ h t = r ( x t , h t − ) (2)and ←− h t = r (cid:48) ( x t , h t +1 ) (3)where r and r (cid:48) are nonlinear functions. The hiddenstates ( h , h , · · · , h | x | ) are converted into a fixed-length vector v as v = q ([ h , h , · · · , h | x | ]) (4)where q is a nonlinear function.The fixed-length vector v generated by the en-coder is converted into the target sequence ( y =[ y , y , · · · , y | y | ] ) by the decoder using an RNN.At each time step i , the conditional probability ofthe output word ˆ y i is computed by p (ˆ y i | y
Tu et al. (2017) used a beam search to predict tar-get sentences that approximately maximizes bothof forward translation and back-translation on test-ing. In this paper, however, we do not use a beamsearch for simplicity and effectiveness.
We evaluated the encoder-decoder-reconstructorframework for NMT on English-Japanese andJapanese-English translation tasks.xample 1: Improvement in under-translation.Input the conditions under which the effect of turbulent viscosity is cor-rectly evaluated were examined on the basis of the relation between turbulentviscosity and numerical viscosity in size .Baseline-NMT 乱 流 粘 性 の 影 響 を 正 確 に 評 価 する 条 件 を 検 討 し た 。 +Reconstructor 乱 流 粘 性 の 影 響 を 正 確 に 評 価 する 条 件 を , 乱乱乱 流流流 粘粘粘 性性性 ととと 数数数 値値値 的的的 粘粘粘 性性性 ののの 関関関 係係係 ををを 基基基 ににに 調 べ た 。 +Reconstructor 乱 流 粘 性 の 影 響 を 考 慮 し た 条 件 を , 乱乱乱 流流流 粘粘粘 性性性 ととと 粘粘粘 性性性 ののの 粘粘粘 性性性 ととと ののの (Jointly-Training) 関関関 係係係 ををを もももととと ににに 検 討 し た 。 Reference 乱乱乱 流流流 粘粘粘 性性性 ととと 数数数 値値値 粘粘粘 性性性 ののの 大大大 小小小 関関関 係係係 ににによよよりりり , 乱 流 粘 性 の 効 果 が 正 しく 評 価 さ れる 条 件 を 検 討 し た 。 Example 2: Improvement in over-translation.Input activity was high in cells of the young , especially newborn infant , and was veryslight in cells of 30 - year - old or more .Baseline-NMT 活 動 性 は 若 齢 , 特 に 新新新生生生 児児児 新新新生生生 児児児 で は 歳歳歳 以以以 上上上 ののの 細 胞 で 高 く , 歳歳歳 以以以 上上上 ののの 細 胞 で は わずか で あっ た 。 +Reconstructor その 活 性 は 若 齢 , 特 に 新新新生生生 児児児 は 細 胞 が 高 く , 歳歳歳 以以以 上上上 ののの 細 胞 で はわずか で あっ た 。 +Reconstructor 若 齢 の 新新新生生生 児児児 で は 活 性 は 高 かっ た が , 歳歳歳 以以以 上上上 ののの 場 合 に は 極 めて (Jointly-Training) 軽 度 で あっ た 。 Reference 活 性 は 若 い 個 体 , 特 に 新新新生生生 児児児 の 細 胞 で 高 く , 歳歳歳 以以以 上上上 ののの もの で はごく わずか で あっ た 。 Table 4: Examples of outputs of English-Japanese translation.
We used two parallel corpora: Asian Scien-tific Paper Excerpt Corpus (ASPEC) (Nakazawaet al., 2016) and NTCIR PatentMT Parallel Corpus(Goto et al., 2013). Regarding the training data ofASPEC, we used only the first 1 million sentencessorted by sentence-alignment similarity. Japanesesentences were segmented by the morphologicalanalyzer MeCab (version 0.996, IPADIC), and En-glish sentences were tokenized by tokenizer.perlof Moses. Table 1 shows the numbers of the sen-tences in each corpus. Note that sentences withmore than 40 words were excluded from the train-ing data.
We used the attention-based NMT (Bahdanauet al., 2015) as a baseline-NMT, the encoder-decoder-reconstructor (Tu et al., 2017) and theencoder-decoder-reconstructor that jointly trainedforward translation and back-translation withoutpre-training. The RNN used in the experi-ments had 512 hidden units, 512 embedding units,30,000 vocabulary size and 64 batch size. We used Adagrad (initial learning rate 0.01) for op-timizing model parameters. We trained our modelon GeForce GTX TITAN X GPU. Note that weset the hyper-parameter λ = 1 on the encoder-decoder-reconstructor same as Tu et al. (2017). Tables 2 and 3 show the translation accuracyin BLEU scores, the p -value of the significancetest by bootstrap resampling (Koehn, 2004) andtraining time in hours until convergence. Theencoder-decoder-reconstructor (Tu et al., 2017)requires slightly longer time to train than thebaseline NMT, but we emphasize that decodingtime remains the same with the encoder-decoder-reconstructor and baseline-NMT. The results showthat the encoder-decoder-reconstructor (Tu et al.,2017) significantly improves translation accuracyby 1.01 points on ASPEC and 1.37 points on NT-CIR in English-Japanese translation ( p < . ).However, it does not significantly improve trans-lation accuracy in Japanese-English translation. Inaddition, it is proved that the encoder-decoder-reconstructor without pre-training worsens ratheraseline-NMT Encoder-Decoder-ReconstructorFigure 3: The attention layer in Example 1 : Improvement in under-translation.Baseline-NMT Encoder-Decoder-ReconstructorFigure 4: The attention layer in Example 2 : Improvement in over-translation.than improves translation accuracy.Table 4 shows examples of outputs of English-Japanese translations. In Example 1, “ 乱 流 粘 性 と 数 値 粘 性 の 大 小 関 係 により ,” (on the ba-sis of the relation between turbulent viscosity andnumerical viscosity in size) is missing in the out-put of baseline-NMT, but “ 乱 流 粘 性 と 数 値的 粘 性 の 関 係 を 基 に ” (on the basis of therelation between turbulent viscosity and numeri-cal viscosity) is present in the output of encoder-decoder-reconstructor. In Example 2, “ 新生 児 ”(newborn infant) and “30 歳 以 上 の ” (of 30 - year -old or more) are repeated in the output of baseline-NMT, but they appear only once in the output ofencoder-decoder-reconstructor. In addition, Figures 3 and 4 show the atten-tion layer on baseline-NMT and encoder-decoder-reconstructor in each example. In Figure 3, al-though the attention layer of baseline NMT attendsinput word “turbulent”, the decoder does not out-put “ 乱 流 ” (turbulent) but “ 検 討 ” (examined) atthe 13th word. Thus, under-translation may beresulted from the hidden layer or the embeddinglayer instead of the attention layer. In Figure 4, itis found that the attention layer of baseline-NMTrepeatedly attends input words “newborn infant”and “30 - year - old or more”. Consequently,the decoder repeatedly outputs “ 新 生 児 ” (new-born infant) and “30 歳 以 上 の ” (of 30 - year - oldor more). On the other hand, the attention layerorpus Model English-Japanese Japanese-English(i) (ii) (iii) (i) (ii) (iii)Baseline-NMT 1,141 378 1,045 951 494 1,085ASPEC +Reconstructor 988 336 1,042 836 418 1,014+Reconstructor (Jointly-Training) 1,292 446 1,147 1,106 525 1,821Baseline-NMT 2,122 1,015 1,106 2,521 1,073 1,630NTCIR +Reconstructor 1,958 922 963 2,187 987 1,422+Reconstructor (Jointly-Training) 1,978 916 1,078 2,475 1,107 1,610Table 5: Numbers of redundant and unknown word tokens.of encoder-decoder-reconstructor almost correctlyattends input words.Table 5 shows a comparison of the number ofword occurrences for each corpus and model. Thecolumns show (i) the number of words that ap-pear more frequently than the counterparts in thereference, and (ii) the number of words that ap-pear more than once but are not included in thereference. Note that these numbers do not in-clude unknown words, so (iii) shows the num-ber of unknown words. In all the cases, thenumber of occurrence of redundant words is re-duced in encoder-decoder-reconstructor. Thus,we confirmed that encoder-decoder-reconstructorachieves reduction of repeating and missing wordswhile maintaining the quality of translation. In this paper, we evaluated the encoder-decoder-reconstructor on English-Japanese and Japanese-English translation tasks. In addition, we evaluatethe effectiveness of pre-training by comparing itwith a jointly-trained model of forward translationand back-translation. Experimental results showthat the encoder-decoder-reconstructor offers sig-nificant improvement in BLEU scores and allevi-ates the problem of repeating and missing wordsin the translation on English-Japanese translationtask, and the encoder-decoder-reconstructor cannot be trained well without pre-training, so itproves that we have to train the forward transla-tion model in a manner similar to the conventionalattention-based NMT as pre-training.
References
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2015. Neural Machine Translation by JointlyLearning to Align and Translate.
Proceedings ofthe 3rd International Conference on Learning Rep-resentations (ICLR), pages 1–15. Kyunghyun Cho, Bart Van Merri¨enboer, Caglar Gul-cehre, Dzmitry Bahdanau, Fethi Bougares, Hol-ger Schwenk, and Yoshua Bengio. 2014. Learn-ing Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation.
Pro-ceedings of the 2014 Conference on Empirical Meth-ods in Natural Language Processing (EMNLP), pages 1724–1734.Shi Feng, Shujie Liu, Nan Yang, Mu Li, and MingZhou. 2016. Improving Attention Modeling withImplicit Distortion and Fertility for Machine Trans-lation.
Proceedings of the 26th International Con-ference on Computational Linguistics (COLING), pages 3082–3092.Isao Goto, Ka-Po Chow, Bin Lu, Eiichiro Sumita, andBenjamin K Tsou. 2013. Overview of the PatentMachine Translation Task at the NTCIR-10 Work-shop.
Proceedings of the 10th NII Testbeds andCommunity for Information access Research Con-ference (NTCIR), pages 260–286.Philipp Koehn. 2004. Statistical Significance Tests forMachine Translation Evaluation.
Proceedings of the2004 Conference on Empirical Methods in NaturalLanguage Processing (EMNLP), pages 388–395.Fandong Meng, Zhengdong Lu, Hang Li, and QunLiu. 2016. Interactive Attention for Neural Ma-chine Translation.
Proceedings of the 26th Inter-national Conference on Computational Linguistics(COLING), pages 2174–2185.Haitao Mi, Baskaran Sankaran, Zhiguo Wang, and AbeIttycheriah. 2016. Coverage Embedding Models forNeural Machine Translation.
Proceedings of the2016 Conference on Empirical Methods in NaturalLanguage Processing (EMNLP), pages 955–960.Toshiaki Nakazawa, Manabu Yaguchi, Kiyotaka Uchi-moto, Masao Utiyama, Eiichiro Sumita, SadaoKurohashi, and Hitoshi Isahara. 2016. ASPEC:Asian Scientific Paper Excerpt Corpus.
Proceedingsof the Tenth International Conference on LanguageResources and Evaluation (LREC), pages 2204–2208.Jan Niehues, Eunah Cho, Thanh-Le Ha, and AlexWaibel. 2016. Pre-Translation for Neural MachineTranslation.
Proceedings of the 26th Internationalonference on Computational Linguistics (COL-ING), pages 1828–1836.Shiqi Shen, Yong Cheng, Zhongjun He, Wei He, HuaWu, Maosong Sun, and Yang Liu. 2016. MinimumRisk Training for Neural Machine Translation.
Pro-ceedings of the 54th Annual Meeting of the Asso-ciation for Computational Linguistics (ACL), pages1683–1692.Zhaopeng Tu, Yang Liu, Lifeng Shang, Xiaohua Liu,and Hang Li. 2017. Neural Machine Translationwith Reconstruction.
Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence(AAAI), pages 3097–3103.Zhaopeng Tu, Zhengdong Lu, Yang Liu, Xiaohua Liu,and Hang Li. 2016. Modeling Coverage for NeuralMachine Translation.