Corpus Augmentation by Sentence Segmentation for Low-Resource Neural Machine Translation
CCorpus Augmentation by Sentence Segmentationfor Low-Resource Neural Machine Translation
Jinyi Zhang
Graduate School of EngineeringGifu UniversityGifu, Japan [email protected]
Tadahiro Matsumoto
Department of Electrical,Electronic and Computer EngineeringGifu UniversityGifu, Japan [email protected]
Abstract
Neural Machine Translation (NMT) hasbeen proven to achieve impressive results.The NMT system translation results dependstrongly on the size and quality of paral-lel corpora. Nevertheless, for many lan-guage pairs, no rich-resource parallel cor-pora exist. As described in this paper, wepropose a corpus augmentation method bysegmenting long sentences in a corpus us-ing back-translation and generating pseudo-parallel sentence pairs. The experiment resultsof the Japanese-Chinese and Chinese-Japanesetranslation with Japanese-Chinese scientificpaper excerpt corpus (ASPEC-JC) show thatthe method improves translation performance.
Neural Machine Translation (NMT) has producedremarkable results with large-scale parallel cor-pus. However, for low-resource languages fordomain-defined translation tasks, the parallel cor-pus scale is small. Accordingly, the translationperformance is reduced considerably (Koehn andKnowles, 2017). Therefore, the study of NMT un-der conditions of low-resource language corporahas high practical value.As described in this paper, we propose a corpusaugmentation method by segmenting long sen-tences into partial sentences of the corpus usingback-translation and generating pseudo-parallelsentence pairs. The larger corpus can improvetranslation performance, in experiments on theJapanese–Chinese scientific paper excerpt corpus(ASPEC-JC) as the low-resource corpus, the trans-lation results over the baseline both have bettertranslation performance in Japanese-Chinese andChinese-Japanese directions, respectively.The main contributions of this paper are the fol-lowing. We demonstrate the ability to improve the translation performance of NMT systems by mix-ing generated pseudo-parallel sentence pairs intotraining data with no monolingual data, and with-out changing the neural network architecture. Thiscapability makes our approach applicable to dif-ferent NMT architectures.
Expanding the number of parallel corpora is an ef-fective means of improving the translation qualityfor NMT in low-resource languages. The paral-lel corpus can be constructed quickly using back-translation with monolingual target data (Sennrichet al., 2015). One study reported by Sennrich et al.(2017) also showed that even simply duplicatingthe monolingual target data and using them as thesource data was sufficient to realize some benefits.Moreover, a pseudo-parallel corpus can be con-structed using the copy method, i.e., the target lan-guage sentences are copied as the correspondingsource language sentences (Currey et al., 2017),which illustrates that even poor translations can bebeneficial. Data augmentation for low-frequencywords has also been proven an effective method(Fadaee et al., 2017).For back-translation method, Gwinnup et al.(2017) implemented their NMT system with iter-atively applying back-translation. Lample et al.(2018) explored the use of generated back-translated data, aided by denoising with a lan-guage model trained on the target side. Trans-lation performance can also be improved by iter-ative back-translation in both high-resource andlow-resource scenarios (Poncelas et al., 2018). Amore refined idea of back-translation is the duallearning approach of He et al. (2016), which in-tegrates training on parallel data and training onmonolingual data via round-tripping. a r X i v : . [ c s . C L ] M a y NMT and ASPEC-JC Corpus
For this research, we follow the NMT architec-ture by Luong et al. (2015), which implements asa global attentional encoder–decoder neural net-work with Long Short-Term Memory (LSTM).We simply use it at the character level, becausethe translation results have better performancethan the word-level between Japanese and Chi-nese. However, it is noteworthy that our proposedmethod is not specific to this architecture.We conducted experiments with the ASPEC-JC corpus, which was constructed by manuallytranslating Japanese scientific papers into Chi-nese (Nakazawa et al., 2016). ASPEC-JC com-prises four parts: training data (672,315 sentencepairs), development data (2,090 sentence pairs),development-test data (2,148 sentence pairs) andtest data (2,107 sentence pairs) on the assumptionthat they would be used for machine translationresearch.We chose ASPEC-JC as the low-resource cor-pus compared with other language pairs such asEnglish-French, which usually comprises millionsof parallel sentences. ASPEC-JC corpus only hasabout 672k sentences. We randomly extracted300k sentence pairs from the training data for ex-periments.
Sennrich et al. (2015) proposed a method to ex-tend parallel corpora by back-translating targetlanguage sentences in monolingual corpora to ob-tain pseudo-source sentences; the pseudo-sourcesentences together with the original target sen-tences are then added to the parallel corpus.Our method expands the existing parallel cor-pus with itself, not with any monolingual data, notlike some back-translation methods with monolin-gual data (Sennrich et al., 2015) (Currey et al.,2017) (Fadaee et al., 2017). Moreover, our methodcould be combined with other corpus augmenta-tion methods. Our augmentation process includesthe following phases: 1) splitting ‘ long ’ parallelsentence pairs of the corpus into parallel partialsentence pairs, 2) back-translating the target par-tial sentences, and 3) constructing parallel sen-tence pairs by combining the source and the back-translated target partial sentences. To be precise, a‘ long ’ sentence above means a sentence that con-tains more than one punctuation marks.
The following procedure generates parallel partialsentence pairs from long parallel sentence pairs.1. Obtain the word alignment information fromtokenized Japanese-Chinese parallel sen-tences.2. Split the long parallel sentences into seg-ments at the punctuation symbols, such as“,”, “;”, “:”. Figure 1 presents an example ofthe word alignment information and the seg-ments of a sentence pair.3. Obtain source-target segment alignments:For each source segment s-seg i and targetsegment t-seg j , count the words in s-seg i thatcorrespond to the words in t-seg j accordingto the word alignment information. The nu-merical values on the arrows in Figure 1 rep-resent the rate of the correspondence relationbetween the segments. We infer that s-seg i corresponds to t-seg j if the rate is greaterthan or equal to a threshold value θ . In thisresearch, we set θ = 0 . .4. Obtain target-source segment alignments:According to the procedure in 3.5. Concatenate multiple segments to form aone-to-one relation if there is a one-to-manyor many-to-many relation between the seg-ments.In Figure 2, each sentence is divided into threesegments. Thereby, two parallel partial sentencesare generated. Using the generated parallel partial sentences,pseudo-parallel sentences are constructed accord-ing to the following procedure.1. Back-translate the target partial sentencesinto source language with a translation modelbuilt from parallel data.2. Create a pseudo-source sentence that is partlydifferent from the original source sentence byreplacing a part of the original sentence witha partial sentence obtained through back-translation. For example, if a sentence is di-vided into three partial sentences, then threepseudo-source sentences will be created. ニラ 刺激 で 両 群 に 差 が なく ,トルエン 刺激 で 患者 のみ に ,主 に テント 下 の 中枢 神経 で ... 在 香草 刺激 中 ,两个 人群 没有 差 别 ,在 甲苯 刺激 中 只有 患者 主要 在 帐 状物 下 的 中枢神 经 ...
20 21 22 : Segment divided by punctuation.
Japanese
SentenceChineseSentence
Vanilla stimulation did not differ between the two groups , with toluene stimulation only in patients, broadly in the central nervous system under the tentorium…
English Translation
Figure 1: Example of word alignment information and sentence segments.
J -> C0: [(0, 0.5), (1, 0.5)]1: [(2, 1.0)]
2: [(2, 1.0)]
C -> J0: [(0, 1.0)]1: [(0, 1.0)]2: [(2, 0.73), (1, 0.27)]Mapping: [[[0], [0, 1]], [[1, 2], [2]]] バニラ 刺激 で 両 群 に 差 が なく , トルエン 刺激 で 患者 のみ に ,主 に テント 下 の 中枢 神経 で 広範 の 異常 を 認め た 。 在 香草 刺激 中 ,两个 人群 没有 差 别 , 在 甲苯 刺激 中 只有 患者 主要 在 帐 状物 下 的 中枢神 经 发现 了 大 范 围 的 异常 。 : Correspondence between segments based on word alignment information (numerical values are proportion): Parallel partial sentence obtained from correspondence between segments Japanese SentenceChinese
Sentence
Vanilla stimulation did not differ between the two groups , with toluene stimulation only in patients, broadly abnormal in the central nervous system under the tentorium.
English Translation
Figure 2: Examples of generated parallel partial sentences.
3. Copy the target sentences corresponding tothe created pseudo-source sentences to pro-duce pseudo-parallel sentences.4. Add the generated pseudo-parallel sentencesto the original parallel corpus.
We follow the NMT architecture by Luong et al.(2015) and implement the NMT architecture us-ing OpenNMT (Klein et al., 2017). The modelhas one layer with 512 cells; the embedding sizeis 512. The parameters are uniformly initializedin ( − . , . ), using plain SGD, starting with alearning rate of 1 until epoch 6, and subsequently0.5 times for each epoch. The max-batch size is100. The normalized gradient is rescaled when-ever its norm exceeds 1. Because of the amountsof training data (300k as the baseline) is rela-tively small, the dropout probability is set as 0.5to avoid overfitting. Decoding is performed bybeam search with a beam size of 5. We segmentthe Chinese and Japanese sentences into words us-ing Jieba and Mecab . We employed fast alignto obtain word alignment information, which wassymmetrized using the included atool command . http://github.com/fxsjy/jieba http://taku910.github.io/mecab http://github.com/clab/fast align The average of BLEU scores from validationperplexity (perplexity with dev data) stopped pointto epoch 16 was taken as the evaluation BLEUvalue.
The translation results are presented in Table 1 (for300k sentence pairs). “Baseline” is a character-level translation with the 300k original trainingdata. The back-translation models for corpus aug-mentation are constructed using the 300k origi-nal training data of “Baseline”. “Copied” is themethod that adds duplicate copies of both thesource and target sides of the training data as thesame times as the proposed method does. The ex-periment of this method aims to highlight differ-ences between the generated pseudo-parallel sen-tences pairs and unchanged sentences pairs. “Par-tial” is the method that augments the corpus withparallel partial sentences generated by the pro-cedure in Section 4.1, without back-translatingand mixing the partial sentences. The experimentof this method aims to confirm the mixing step(Section 4.1, step 2) is necessary. This methodexpands the parallel corpus from 300k sentencepairs to 984k sentence pairs in both directions.“Back-translation” is the back-translation methodthat back-translates the same data as the pro-posed method does (218k from original trainingdata). The experiment of this method aims to com- able 1: Experiment results of 300k training data. Translation directions are designated as Japanese → Chinese (JC,J → C) and Chinese → Japanese (CJ, C → J) in the table.
Method → C C → Jsentences translated BLEU (%) BLEU (%)
Baseline 300k(JC),300k(CJ) 0 38.7 37.9Copied 952k(JC),952k(CJ) 0 39.2 (+0.5) 39.8 (+1.9)Partial 984k(JC),984k(CJ) 0 39.2 (+0.5) 39.2 (+1.3)Back-translation 518k(JC),518k(CJ) 218k(JC),218k(CJ) 39.4 (+0.7) 39.4 (+1.5)Proposed 952k(JC),952k(CJ) 218k(JC),218k(CJ)
Table 2: Experiment results of 300k training data with 372k monolingual data. Translation directions are desig-nated as Japanese → Chinese (JC, J → C) and Chinese → Japanese (CJ, C → J) in the table.
Method → C C → Jsentences translated BLEU (%) BLEU (%) pare proposed method with the back-translationmethod (Sennrich et al. (2015)) on the same back-translated data. “ → C and C → J. These results demon-strated that the proposed method is effective forextending the small-scale parallel corpus to im-prove NMT performance.The experiments described above prove the ef-fectiveness of the proposed method. Nevertheless,our approach is based on only the original paralleldata and does not require any additional monolin-gual data, unlike back-translation method of Sen-nrich et al. (2015). Most methods of corpus aug-mentation are applied to pair monolingual trainingdata with automatic back-translation and then treatthem as additional parallel training data. There-fore, we have added comparison experiments.We conducted a comparison experiment using300k sentences as the original data and the remain-ing 372k sentences as the monolingual data.Translation results of comparison experimentare presented in Table 2. “+Proposed” back-translates 508k and 513k from the “300k+mono” (672k training data), so that the numbers of sen-tence pairs are increased from 672k to 2,255k and2,200k in both directions. The proposed methodproduced higher BLEU scores than the originalmonolingual method. These comparison experi-ments demonstrate that our proposed method canaugment the extended data by the other corpusaugmentation methods to yield better translationperformance. In the future we plan to combinethe proposed methods with other augmentation ap-proaches as our results suggest it may be morebeneficial than only back-translation. Salient ben-efits of the proposed method are that it requires nomonolingual data and that, without changing theneural network architecture, our method can gen-erate more pseudo-parallel sentences. Moreover,it can be combined with other augmentation meth-ods.
In this paper, we proposed a simple but effectiveapproach to augment the NMT corpus for low-resource language pairs by segmenting long sen-tences in the corpus, using back-translation, andgenerating pseudo-parallel sentences pairs. Wedemonstrated that this approach engenders gener-ation of more pseudo-parallel sentences. Conse-quently, we obtained higher translation quality forNMT. Future studies should include more compar-ative experiments using other language pairs withdifferent amounts of data. eferences
Anna Currey, Antonio Valerio Miceli Barone, and Ken-neth Heafield. 2017. Copied Monolingual Data Im-proves Low-Resource Neural Machine Translation.In
Proceedings of the Second Conference on Ma-chine Translation , pages 148–156. Association forComputational Linguistics.Marzieh Fadaee, Arianna Bisazza, and Christof Monz.2017. Data Augmentation for Low-Resource NeuralMachine Translation. In
Proc. 55th Annual Meet-ing of the Assoc. for Computational Linguistics (Vol-ume 2: Short Papers) , pages 567–573, Vancouver,Canada.Jeremy Gwinnup, Timothy Anderson, Grant Erdmann,Katherine Young, Michaeel Kazi, Elizabeth Salesky,Brian Thompson, and Jonathan Taylor. 2017. TheAFRL-MITLL WMT17 Systems: Old, New, Bor-rowed, BLEU. In
Proceedings of the Second Con-ference on Machine Translation , pages 303–309.Association for Computational Linguistics.Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu,Tie-Yan Liu, and Wei-Ying Ma. 2016. Dual learn-ing for machine translation. In
Advances in NeuralInformation Processing Systems 29 , pages 820–828.Curran Associates, Inc.Guillaume Klein, Yoon Kim, Yuntian Deng, JeanSenellart, and Alexander M. Rush. 2017. Open-NMT: Open-Source Toolkit for Neural MachineTranslation.
CoRR , abs/1701.02810.Philipp Koehn and Rebecca Knowles. 2017. Six Chal-lenges for Neural Machine Translation.
CoRR ,abs/1706.03872.Guillaume Lample, Myle Ott, Alexis Conneau, Lu-dovic Denoyer, and Marc’Aurelio Ranzato. 2018.Phrase-Based & Neural Unsupervised MachineTranslation.
CoRR , abs/1804.07755.Minh Thang Luong, Hieu Pham, and Christopher D.Manning. 2015. Effective Approaches to Attention-based Neural Machine Translation. In
Proc. 2015Conf. on Empirical Methods in Natural LanguageProcessing , pages 1412–1421, Lisbon, Portugal.ACL.Toshiaki Nakazawa, Manabu Yaguchi, Kiyotaka Uchi-moto, Masao Utiyama, Eiichiro Sumita, SadaoKurohashi, and Hitoshi Isahara. 2016. ASPEC:Asian Scientific Paper Excerpt Corpus. In
Pro-ceedings of the Tenth International Conference onLanguage Resources and Evaluation (LREC 2016) ,Paris, France.Alberto Poncelas, Dimitar Shterionov, Andy Way,Gideon Maillette de Buy Wenniger, and Pey-man Passban. 2018. Investigating Backtrans-lation in Neural Machine Translation.
CoRR ,abs/1804.06189. Rico Sennrich, Alexandra Birch, Anna Currey, UlrichGermann, Barry Haddow, Kenneth Heafield, An-tonio Valerio Miceli Barone, and Philip Williams.2017. The University of Edinburgh’s Neural MTSystems for WMT17.
CoRR , abs/1708.00726.Rico Sennrich, Barry Haddow, and Alexandra Birch.2015. Improving Neural Machine Translation Mod-els with Monolingual Data.