Joint Training for Neural Machine Translation Models with Monolingual Data
JJoint Training for Neural Machine Translation Modelswith Monolingual Data
Zhirui Zhang † , Shujie Liu ‡ , Mu Li ‡ , Ming Zhou ‡ , Enhong Chen †∗ † University of Science and Technology of China, Hefei, China ‡ Microsoft Research † [email protected] † [email protected] ‡ { shujliu,muli,mingzhou } @microsoft.com Abstract
Monolingual data have been demonstrated to be helpfulin improving translation quality of both statistical machinetranslation (SMT) systems and neural machine translation(NMT) systems, especially in resource-poor or domain adap-tation tasks where parallel data are not rich enough. In this pa-per, we propose a novel approach to better leveraging mono-lingual data for neural machine translation by jointly learningsource-to-target and target-to-source NMT models for a lan-guage pair with a joint EM optimization method. The train-ing process starts with two initial NMT models pre-trained onparallel data for each direction, and these two models are iter-atively updated by incrementally decreasing translation losseson training data. In each iteration step, both NMT models arefirst used to translate monolingual data from one languageto the other, forming pseudo-training data of the other NMTmodel. Then two new NMT models are learnt from paralleldata together with the pseudo training data. Both NMT mod-els are expected to be improved and better pseudo-trainingdata can be generated in next step. Experiment results onChinese-English and English-German translation tasks showthat our approach can simultaneously improve translationquality of source-to-target and target-to-source models, sig-nificantly outperforming strong baseline systems which areenhanced with monolingual data for model training includingback-translation.
Introduction
Neural machine translation (NMT) performs end-to-endtranslation based on an encoder-decoder framework (Kalch-brenner and Blunsom 2013; Cho et al. 2014; Sutskever,Vinyals, and Le 2014; Bahdanau, Cho, and Bengio 2014)and has obtained state-of-the-art performances on many lan-guage pairs (Luong, Pham, and Manning 2015; Sennrich,Haddow, and Birch 2016b; Tu et al. 2016; Wu et al. 2016).In the encoder-decoder framework, an encoder first trans-forms the source sequence into vector representations, basedon which, a decoder generates the target sequence. Suchframework brings appealing properties over the traditionalphrase-based statistical machine translation (SMT) systems(Koehn, Och, and Marcu 2003; Chiang 2007), such as lit-tle requirements for human feature engineering, or prior ∗ Corresponding authorCopyright c (cid:13) domain knowledge. On the other hand, to train the largeamount of parameters in the encoder and decoder networks,most NMT systems heavily rely on high-quality paralleldata and perform poorly in resource-poor or domain-specifictasks. Unlike bilingual data, monolingual data are usuallymuch easier to collect and more diverse, and have been at-tractive resources for improving machine translation modelssince 1990’s when data-driven machine translation systemswere first built.Monolingual data play a key role in training SMT sys-tems. Additional target monolingual data are usually re-quired to train a powerful language model, which is an im-portant feature of an SMT system’s log-linear model. Us-ing source-side monolingual data in SMT were also ex-plored. Ueffing et al. (2007) introduced a transductive semi-supervised learning method, in which source monolingualsentences are translated and filtered to build pseudo bilin-gual data, which are added to the original bilingual data tore-train the SMT model.For NMT systems, Gulcehre et al. (2015) first tried bothshallow and deep fusion methods to integrate an externalRNN language model into the encoder-decoder framework.The shallow fusion method simply linearly combines thetranslation probability and the language model probability,while the deep fusion method connects the RNN languagemodel with the decoder to form a new tightly coupled net-work. Instead of introducing an explicit language model,Cheng et al. (2016) proposed an auto-encoder-based methodwhich encodes and reconstructs monolingual sentences, inwhich source-to-target and target-to-source NMT modelsserve as the encoder and decoder respectively.Sennrich, Haddow, and Birch (2016a) proposed back-translation for data augmentation as another way to leveragethe target monolingual data. In this method, both the NMTmodel and training algorithm are kept unchanged, insteadthey employed a new approach to constructing training data.That is, target monolingual sentences are translated with apre-constructed machine translation system into source lan-guage, which are used as additional parallel data to re-trainthe source-to-target NMT model. Although back-translationhas been proven to be robust and effective, one major prob-lem for further improvement is the quality of automaticallygenerated training data from monolingual sentences. Due tothe imperfection of machine translation system, some of the a r X i v : . [ c s . C L ] M a r ncorrect translations are very likely to hurt the performanceof source-to-target model.In this paper, we present a novel method for making ex-tended usage of monolingual data from both source sideand target side by jointly optimizing a source-to-target NMTmodel A and a target-to-source NMT model B through aniterative process. In each iteration, these two models serve ashelper machine translation systems for each other as in back-translation: B is used to generated pseudo-training data formodel A with target-side monolingual data, and A is usedto generated pseudo-training data for model B with source-side monolingual data. The key advantage of our new ap-proach comparing with existing work is that the training pro-cess can be repeated to obtain further improvements becauseafter each iteration both model A and B are expected to beimproved with additional pseudo-training data. Therefore, inthe next iteration, better pseudo-training data can be gener-ated with these two improved models, resulting even bettermodel A and model B , so on and so forth.To jointly optimize the two models in both directions, wedesign a new semi-supervised training objective, with whichthe generated training sentence pairs are weighted so thatthe negative impact of noisy translations can be minimized.Original bilingual sentence pairs are all weighted as 1, whilethe synthetic sentence pairs are weighted as the normalizedmodel output probability. Similar to the post-processing stepas described in Ueffing et al. (2007), our weight mechanismalso plays an important role in improving the final transla-tion performance. As we will show in the paper, the overalliterative training process essentially adds a joint EM estima-tion over the monolingual data to the MLE estimation overbilingual data: the E-step tries to estimate the expectationsof translations of the monolingual data, while the M-step up-dates model parameters with the smoothed translation prob-ability estimation.Our experiments are conducted on NIST OpenMT’sChinese-English translation task and WMT’s English-German translation task. Experimental results demonstratethat our joint training method can significantly improvetranslation quality of both source-to-target and target-to-source models, compared with back-translation and otherstrong baselines. Neural Machine Translation
In this section, we will first briefly introduce the NMT modelused in our work. The NMT model follows the attention-based architecture proposed by Bahdanau, Cho, and Ben-gio (2014), and it is implemented as an encoder-decoderframework with recurrent neural networks (RNN). RNN areusually implemented as Gated Recurrent Unit (GRU) (Choet al. 2014) (adopted in our work) or Long Short-Term Mem-ory (LSTM) networks (Hochreiter and Schmidhuber 1997).The whole architecture can be divided into three compo-nents: encoder, decoder and attention mechanism.
Encoder
The encoder reads the source sentence X =( x , x , ... , x T ) and transforms it into a sequence of hiddenstates h = ( h , h , ... , h T ) , using a bi-directional RNN. Ateach time stamp t , the hidden state h t is defined as the con- catenation of the forward and backward RNN hidden states [ −→ h t ; ←− h t ] , where −→ h t = RNN ( x t , −−→ h t − ) , ←− h t = RNN ( x t , ←−− h t +1 ) . Decoder
The decoder uses another RNN to generate thetranslation Y = ( y , y , ... , y T (cid:48) ) based on the hidden states h generated by the encoder. At each time stamp i , the condi-tional probability of each word y i from a target vocabulary V y is computed by p ( y i | y
The context vector c i is a weightedsum of the hidden states ( h , h , ... , h T ) with the coeffi-cients α , α , ... , α T computed by α t = exp ( a ( h t , z i − )) (cid:80) k exp ( a ( h k , z i − )) (3)where a is a feed-forward neural network with a single hid-den layer. MLE Training
NMT systems are usually trained to max-imize the conditional log-probability of the correct transla-tion given a source sentence with respect to the parameters θ of the model: θ ∗ = arg max θ N (cid:88) n =1 | y n | (cid:88) i =1 log p ( y ni | y n
Back translation fills the gap between the requirement forparallel data and availability of monolingual data in NMTmodel training with the help of machine translation sys-tems. Specially, given a set of sentences { y i } in target lan-guage Y , a pre-constructed target-to-source machine trans-lation system is used to automatically generate their transla-tions { x i } in source language X . Then the synthetic sen-tence pairs { ( x i , y i ) } are used as additional parallel datato train the source-to-target NMT model, together with theoriginal bilingual data.Our work follows this parallel data synthesis approach,but extends the task setting from solely improving thesource-to-target NMT model training with target monolin-gual data to a paired one: we aim to jointly optimize asource-to-target NMT model M x → y and a target-to-source = 𝑥 𝑛 , 𝑦 𝑛 X = 𝑥 𝑠 Y = 𝑦 𝑡 X ′ = 𝑥 𝑠 , 𝑦 Y ′ = 𝑦 𝑡 , 𝑥 X ′ = 𝑥 𝑠 , 𝑦 Y ′ = 𝑦 𝑡 , 𝑥 M 𝑥→𝑦0 M 𝑦→𝑥0 𝑝(𝑥 |𝑦 𝑡 ) Iteration 0Iteration 1Iteration 2 … M 𝑥→𝑦1 M 𝑦→𝑥 M 𝑥→𝑦2 M 𝑦→𝑥 𝑝(𝑦 (𝑠) |𝑥 𝑠 ) 𝑝(𝑦 |𝑥 𝑠 ) 𝑝(𝑥 |𝑦 𝑡 ) NMT 𝑥→𝑦
NMT 𝑦→𝑥
NMT 𝑥→𝑦
NMT 𝑦→𝑥 … …… …
Figure 1: Illustration of joint-EM training of NMT mod-els in two directions (NMT x → y and NMT y → x ) using bothsource ( X ) and target ( Y ) monolingual corpora, combinedwith bilingual data D . X (cid:48) is the generated synthetic datawith probability p ( y | x ) by translating X using NMT x → y ,and Y (cid:48) is the synthetic data with probability p ( x | y ) by trans-lating Y using NMT y → x .NMT model M y → x with the aid of monolingual data fromboth source language X and target language Y . Differ-ent from back translation, in which both automatic trans-lation and NMT model training are performed only once,our method runs the machine translation for monolingualdata and updates NMT models M x → y and M y → x throughseveral iterations. At each iteration step, model M x → y and M y → x serves as each other’s pseudo-training data genera-tor: M y → x is used to translate Y into X for M x → y , while M x → y is used to translate X to Y for M y → x .The joint training process is illustrated in Figure 1, inwhich the first 2 iterations are shown. Before the first iter-ation starts, two initial translation models M x → y and M y → x are pre-trained with parallel data D = { x n , y n } . This stepis denoted as iteration 0 for sake of consistency.In iteration 1, at first, two NMT systems based on M x → y and M y → x are used to translate monolingual data X = { x ( s ) i } and Y = { y ( s ) i } , which forms two synthetic trainingdata sets X (cid:48) = { x ( s ) i , y ( s )0 } and Y (cid:48) = { y ( t ) i , x ( t )0 } . Model M x → y and M y → x are then trained on the updated train-ing data by combining Y (cid:48) and X (cid:48) with parallel data D . Itis worth noting that we use n-best translations from an NMTsystem, and the selected translations are weighted with the Algorithm 1
Joint Training Algorithm for NMT procedure P RE - TRAINING Initialize M x → y and M y → x with random weights θ x → y and θ y → x ; Pre-train M x → y and M y → x on bilingual data D = { ( x ( n ) , y ( n ) } Nn =1 with Equation 4; end procedure procedure J OINT - TRAINING while Not Converged do Use NMT y → x to generate back-translation x forY = { y ( t ) } Tt =1 and build pseudo-parallel corpora Y (cid:48) = { x, y ( t ) } Tt =1 ; (cid:46) E-Step for NMT x → y Use NMT x → y to generate back-translation y forX = { x ( s ) } Ss =1 and build pseudo-parallel corpora X (cid:48) = { x ( s ) , y } Ss =1 ; (cid:46) E-Step for NMT y → x Train M x → y with Equation 10 given weightedbilingual corpora D ∪ Y (cid:48) ; (cid:46) M-Step for NMT x → y Train M y → x with Equation 12 given weightedbilingual corpora D ∪ X (cid:48) ; (cid:46) M-Step for NMT y → x end while end procedure translation probabilities from the NMT model.In iteration 2, the above process is repeated, but the syn-thetic training data are re-generated with the updated NMTmodels M x → y and M y → x , which are presumably more ac-curate. In turn, the learnt NMT models M x → y and M y → x are also expected to be improved over the first iteration.The formal algorithm is listed in Algorithm 1, which isdivided into two major steps: pre-training and joint training.As we will show in next section, the joint training step essen-tially adds an EM (Expectation-Maximization) process overthe monolingual data in both source and target languages . Training Objective
Next we will show how to derive our new learning objectivefor joint training, starting with the case that only one NMTmodel is involved.Given parallel corpus D = { ( x ( n ) , y ( n ) ) } Nn =1 and mono-lingual corpus in target language Y = { y ( t ) } Tt =1 , the semi-supervised training objective is to maximize the likelihoodof both bilingual data and monolingual data: L ∗ ( θ x → y ) = N (cid:88) n =1 log p ( y ( n ) | x ( n ) ) + T (cid:88) t =1 log p ( y ( t ) ) (5)where the first term on the right side denotes the likelihoodof bilingual data and the second term represents the likeli-hood of target-side monolingual data. Next we introduce thesource translations as hidden states for the target sentences Note that the training criteria on parallel data D are still usingMLE (maximum likelihood estimation) nd decompose log p ( y ( t ) ) as log p ( y ( t ) ) = log (cid:88) x p ( x, y ( t ) ) = log (cid:88) x Q ( x ) p ( x, y ( t ) ) Q ( x ) ≥ (cid:88) x Q ( x ) log p ( x, y ( t ) ) Q ( x ) ( Jensen’s inequality )= (cid:88) x [ Q ( x ) log p ( y ( t ) | x ) − KL ( Q ( x ) || p ( x ))] (6)where x is latent variable representing the source translationof target sentence y ( t ) , Q ( x ) is the approximated probabilitydistribution of x , p ( x ) represents the marginal distributionof sentence x , and KL ( Q ( x ) || p ( x )) is the Kullback-LeiblerDivergence between two probability distributions. In orderto make the equal sign to be valid in Equation 6, Q ( x ) mustsatisfy the following condition p ( x, y ( t ) ) Q ( x ) = c (7)where c is a constant and does not depend on y . Given (cid:80) x Q ( x ) = 1 , Q ( x ) can be calculated as Q ( x ) = p ( x, y ( t ) ) c = p ( x, y ( t ) ) (cid:80) x p ( x, y ( t ) ) = p ∗ ( x | y ( t ) ) (8)where p ∗ ( x | y ( t ) ) denotes the true target-to-source transla-tion probability. Since it is usually not possible to calcu-late p ∗ ( x | y ( t ) ) in practice, we use the translation probability p ( x | y ( t ) ) given by a target-to-source NMT model as Q ( x ) .Combining Equation 5 and 6, we have L ∗ ( θ x → y ) ≥ L ( θ x → y ) = N (cid:88) n =1 log p ( y ( n ) | x ( n ) )+ T (cid:88) t =1 (cid:88) x [ p ( x | y ( t ) ) log p ( y ( t ) | x ) − KL ( p ( x | y ( t ) ) || p ( x ))] (9)This means L ( θ x → y ) is a lower bound of the true likelihoodfunction L ∗ ( θ x → y ) . Since KL ( p ( x | y ( t ) ) || p ( x )) is irrelevantto parameters θ x → y , L ( θ x → y ) can be simplified as L ( θ x → y ) = N (cid:88) n =1 log p ( y ( n ) | x ( n ) )+ T (cid:88) t =1 (cid:88) x p ( x | y ( t ) ) log p ( y ( t ) | x ) (10)The first part of L ( θ x → y ) is the same as the MLE training,while the second part can be optimized with EM algorithm.We can estimate the expectation of source translation proba-bility p ( x | y ( t ) ) in the E-step, and maximize the second partin the M-step. The E-step uses the target-to-source transla-tion model M y → x to generate the source translations as hid-den variables, which are paired with the target sentences tobuild a new distribution of training data together with trueparallel data D . Therefore maximizing L ( θ x → y ) can be ap-proximated by maximizing the log likelihood on the new training data. The translation probability p ( x | y ( t ) ) is used asthe weight of the pseudo sentence pairs, which helps withfiltering out bad translations.It is easy to verify that back-translation approach (Sen-nrich, Haddow, and Birch 2016a) is a special case of thisformulation of L ( θ x → y ) , in which p ( x | y ( t ) ) = 1 becauseonly the best translation from the NMT model M y → x ( y ( t ) ) is used L ( θ x → y ) = N (cid:88) n =1 log p ( y ( n ) | x ( n ) )+ T (cid:88) t =1 log p ( y ( t ) | M y → x ( y ( t ) )) (11)Similarly, the likelihood of NMT model M y → x can bederived as L ( θ y → x ) = N (cid:88) n =1 log p ( x ( n ) | y ( n ) )+ S (cid:88) s =1 (cid:88) y p ( y | x ( s ) ) log p ( x ( s ) | y ) (12)where y is a target translation (hidden state) of the sourcesentence x ( s ) . The overall training objective is the sum oflikelihood in both directions L ( θ ) = L ( θ x → y ) + L ( θ y → x ) During the derivation of L ( θ x → y ) , we use the translationprobability p ( x | y ( t ) ) from M y → x as the approximation ofthe true distribution p ∗ ( x | y ( t ) ) . When p ( x | y ( t ) ) gets closerto p ∗ ( x | y ( t ) ) , we can get a tighter lower bound of L ∗ ( θ x → y ) ,gaining more opportunities to improve M x → y . Joint trainingof paired NMT models is designed to solve this problem ifsource monolingual data are also available. Experiments
Setup
We evaluate our proposed approach on two language pairs:Chinese ↔ English and English ↔ German. In all experi-ments, we use BLEU (Papineni et al. 2002) as the evaluationmetric for translation quality.
Dataset
For Chinese ↔ English translation, we select ourtraining data from LDC corpora , which consists of 2.6Msentence pairs with 65.1M Chinese words and 67.1M En-glish words respectively. We use 8M Chinese sentences and8M English sentences randomly extracted from Xinhua por-tion of Gigaword corpus as the monolingual data sets. Anysentence longer than 60 words is removed from trainingdata (both the bilingual data and pseudo bilingual data). For The corpora include LDC2002E17, LDC2002E18,LDC2003E07, LDC2003E14, LDC2005E83, LDC2005T06,LDC2005T10, LDC2006E17, LDC2006E26, LDC2006E34,LDC2006E85, LDC2006E92, LDC2006T06, LDC2004T08,LDC2005T10 irection System NIST2006 NIST2003 NIST2005 NIST2008 NIST2012 AverageC → E RNNSearch 38.61 39.39 38.31 30.04 28.48 34.97RNNSearch+M 40.66 43.26 41.61 32.48 31.16 37.83SS-NMT 41.53 44.03 42.24 33.40 31.58 38.56JT-NMT E → C RNNSearch 17.75 18.37 17.10 13.14 12.85 15.84RNNSearch+M 21.28 21.19 19.53 16.47 15.86 18.87SS-NMT 21.62 22.00 19.70 17.06 16.48 19.37JT-NMT
Table 1: Case-insensitive BLEU scores (%) on Chinese ↔ English translation. The “Average” denotes the average BLEU scoreof all datasets in the same setting. The “C” and “E” denote Chinese and English respectively.Chinese-English, NIST OpenMT 2006 evaluation set is usedas validation set, and NIST 2003, NIST 2005, NIST 2008,NIST2012 datasets as test sets. In both validation and testdata sets, each Chinese sentence has four reference transla-tions. For English-Chinese, we use the NIST datasets in areverse direction: treating the first English sentence in thefour reference translation as a source sentence and the Chi-nese sentence as the single reference. We limit the vocabu-lary to contain up to 50K most frequent words on both thesource and target side, and convert remaining words into the
Implementation Details
The RNNSearch model pro-posed by Bahdanau, Cho, and Bengio (2014) is adoptedas our baseline, which uses a single layer GRU-RNN forthe encoder and another. The size of word embedding (forboth source and target words) is 256 and the size of hid-den layer is set to 1024. The parameters are initialized us-ing a normal distribution with a mean of 0 and a varianceof (cid:112) / ( d row + d col ) , where d row and d col are the numberof rows and columns in the structure (Glorot and Bengio2010). Our models are optimized with the Adadelta (Zeiler2012) algorithm with mini-batch size 128. We re-normalizegradient if its norm is larger than 2.0 (Pascanu, Mikolov,and Bengio 2013). At test time, beam search with size 8is employed to find the best translation, and translationprobabilities are normalized by the length of the transla-tion sentences. In post-processing step, we follow the workof Luong et al. (2015) to handle
Baseline
Our proposed joint-training approach is com-pared with three NMT baselines for all translation tasks: • RNNSearch : Attention-based NMT system (Bahdanau,Cho, and Bengio 2014). Only bilingual corpora are usedto train a standard attention-based NMT model. • RNNSearch+M : Bilingual and target-side monolingualcorpora are used to train RNNSearch. We follow Sen-nrich, Haddow, and Birch (2016b) to construct pseudo-parallel corpora by generating source language with back-translation of target-side monolingual data. • SS-NMT : Semi-supervised NMT training proposed byCheng et al. (2016). To be fair in all experiment, theirmethod adopts the same settings as our approach includ-ing the same source and target monolingual data.
Chinese ↔ English Translation Result
Table 1 shows the evaluation results of different models onNIST datasets, in which JT-NMT represents our joint train-ing for NMT using monolingual data. All the results are re-ported based on case-insensitive BLEU.Compared with RNNSearch, we can see thatRNNSearch+M, SS-NMT and JT-NMT all bring sig-nificant improvements across different test sets. Ourapproach achieves the best result, 4.7 and 4.46 BLEUpoints improvement over RNNSearch on average forChinese-to-English and English-to-Chinese respectively.These results confirm that exploiting massive monolingualcorpora improves translation performance.From Table 1, we can find our JT-NMT achieves betterperformances than RNNSearch+M across different test sets,with 1.84 and 1.43 points of BLEU improvements on av-erage in Chinese-to-English and English-to-Chinese direc-tions respectively. Compared with RNNSearch+M, our jointtraining approach introduces data weight to better handlepoor pseudo-training data, and the joint interactive train-ing can boost the models of two directions with the help ofeach other, instead of only use the target-to-source model toystem Architecture E → D D → EJean et al. (2015) Gated RNN with search + PosUnk 18.97 -Jean et al. (2015) Gated RNN with search + PosUnk + 500K vocabs 19.40 -Shen et al. (2016) Gated RNN with search + PosUnk + MRT 20.45 -Luong, Pham, and Manning (2015) LSTM with 4 layers + dropout + local att. + PosUnk 20.90 -RNNSearch Gated RNN with search + BPE 19.78 24.91RNNSearch+M Gated RNN with search + BPE + monolingual data 21.89 26.81SS-NMT Gated RNN with search + BPE + monolingual data 22.64 27.30JT-NMT Gated RNN with search + BPE + monolingual data
Table 2: Case-sensitive BLEU scores (%) on English ↔ German translation. “PosUnk” denotes Luong et al. (2015)’s techniqueof handling rare words. “MRT” denotes minimum risk training proposed in Shen et al. (2016). “BPE” denotes Byte PairEncoding proposed by Sennrich, Haddow, and Birch (2016b) for word segmentation. The “D” and “E” denote German andEnglish respectively.help source-to-target model. Our approach also yields bet-ter translation than SS-NMT with at least 0.93 points BLEUimprovements on average. This result shows that our methodcan better make use of both source and target monolingualcorpora than Cheng et al. (2016)’s approach.
English ↔ German Translation Result
For English ↔ German translation task, in addition to thebaseline system, we also include results of other exist-ing NMT systems, including Jean et al. (2015), Shen etal. (2016) and Luong, Pham, and Manning (2015). In or-der to be comparable with other work, all the results arereported based on case-sensitive BLEU. Experiment resultsare shown in Table 2.We can observe that the baseline RNNSearch with BPEmethod achieves better results than Jean et al. (2015),even better than the result using larger vocabulary ofsize 500K. Compared with RNNSearch, we observe thatRNNSearch+M, SS-NMT and JT-NMT bring significantimprovements in both English-to-German and German-to-English directions. It confirms the effectiveness ofleveraging monolingual corpus. Our approach outperformsRNNSearch+M and SS-NMT by a notable margin and ob-tains the best BLEU score of 23.6 and 27.98 in English-to-German and German-to-English test set respectively. Theseexperimental results further confirm the effectiveness ofour joint training mechanism, similar as shown in theChinese ↔ English translation tasks.
Effect of Joint Training
We further investigate the impact of our joint training ap-proach JT-NMT during the whole training process. Fig-ure 2 shows the BLEU scores on Chinese ↔ English andEnglish ↔ German validation and test sets in each iteration.We can find that more iterations can lead to better evalua-tion results consistently, which verifies that the joint trainingof NMT models in two directions can boost their translationperformance.In Figure 2, “Iteration 0” is the BLEU scores of baselineRNNSearch, and obviously the first few iterations gain most,especially for “Iteration 1”. After three iterations, we can-not get significant improvement anymore. As we said previ- Table 3: The BLEU scores (%) on Chinese ↔ English andEnglish ↔ German translation tasks. For Chinese ↔ Englishtranslation, we list the average results of all test sets. ForEnglish ↔ German translation, we list the results of news-test2014.System C → E E → C D → E E → DRNNSearch+M 37.83 18.87 26.81 21.89JT-NMT (Iteration 1) ously, along with the target-to-source model approaches theideal translation probability, the lower bound of the loss willbe closer to the true loss. During the training, the closer thelower bound to the true loss, the smaller the potential gain.Since there is a lot of uncertainty during the training, theperformance sometimes drops a little.JT-NMT (Iteration 1) can be considered as the generalversion of RNNSearch+M that any pseudo sentence pair isweighted as 1. From Table 3, we can see that JT-NMT (Iter-ation 1) slightly surpass RNNSearch+M on all test datasets,which proves that the weight introduced in our algorithmcan clean poor synthetic data and lead to better performance.Our approach will assign low weight to synthetic sentencepairs with poor translation, so as to punish its effect to themodel update. The translation will be refined and improvedin subsequent iterations, as shown in Table 4, which showstranslation results of a Chinese sentence in different itera-tions.
Related Work
Neural machine translation has drawn more and more at-tention in recent years (Bahdanau, Cho, and Bengio 2014;Luong, Pham, and Manning 2015; Jean et al. 2015; Tu etal. 2016; Wu et al. 2016). For the original NMT system,only parallel corpora can be used for model training usingMLE method, therefore much research in the literature at-tempts to exploit massive monolingual corpora. Gulcehreet al. (2015) first investigate the integration of monolingualdata for neural machine translation. They train monolinguallanguage models independently, which is integrated into theNMT system with proposed shallow and deep fusion meth- B L E U Iteration Dev Average (a) Chinese-English Translation B L E U Iteration Dev Average (b) English-Chinese Translation B L E U Iteration Dev Test (c) German-English Translation B L E U Iteration Dev Test (d) English-German Translation
Figure 2: BLEU scores (%) on Chinese ↔ English and English ↔ German validation and test sets for JT-NMT during trainingprocess. “Dev” denotes the results of validation datasets, while “Test” denotes the results of test datasets.Monolingual 当 终 场 哨 声 响 起 , 意 大 利 首 都 罗 马 沸 腾 了 。 dang zhongchang shaosheng xiang qi , yidali shoudu luoma feiteng le . Reference when the final whistle sounded , the italian capital of rome boiled .Translation [Iteration 0]: the italian capital of rome was boiling with the rome .[Iteration 1]: the italian capital of rome was boiling with the sound of the end of the door .[Iteration 4]: when the final whistle sounded , the italian capital of rome was boiling .Table 4: Example translations of a Chinese sentence in different iterations.ods. Sennrich, Haddow, and Birch (2016a) propose to gen-erate the synthetic bilingual data by translating the targetmonolingual sentences to source language sentences, andthe mixture of original bilingual data and the synthetic paral-lel data are used to retrain the NMT system. As an extensionof their approach, our approach introduces translation prob-abilities from target-to-source model as weights of syntheticparallel sentences to punish poor pseudo parallel sentences,and further interactive training of NMT models in two direc-tions are used to refine them.Recently, Zhang and Zong (2016) propose a multi-tasklearning framework to exploit source-side monolingual data,in which they jointly perform machine translation on syn-thetic bilingual data and sentence reordering with source-side monolingual data. Cheng et al. (2016) reconstructmonolingual data by auto-encoder, in which the source-to-target and target-to-source translation models form a closedloop and are jointly updated. Different from their method,our approach extends Sennrich, Haddow, and Birch (2016a)by directly introducing source-side monolingual data to im-prove reverse NMT models and adopts EM algorithm to it-eratively update bidirectional NMT models. Our approachcan better exploit both target and source monolingual data,while they show no improvement when using both target andsource monolingual data compared just target monolingualdata. He et al. (2016) treat the source-to-target and target-to-source models as the primal and dual tasks respectively, sim-ilar to the work of Cheng et al. (2016), they also employedround-trip translations for each monolingual sentence to ob-tain feedback signals. Ramachandran, Liu, and Le (2017)adopt pre-trained weights of two language models to initialthe encoder and decoder of a seq2seq model, and then fine-tune it with labeled data. Their approach is complementary to our mechanism by leveraging pre-trained language modelto initial bidirectional NMT models, and it may lead to ad-ditional gains.
Conclusion
In this paper, we propose a new semi-supervised trainingapproach to integrating the training of a pair of transla-tion models in a unified learning process with the help ofmonolingual data from both source and target sides. In ourmethod, a joint-EM training algorithm is employed to opti-mize two translation models cooperatively, in which the twomodels are able to mutually boost their translation perfor-mance. Translation probability of the other model is usedas the weight to estimate translation accuracy and punishthe bad translations. Empirical evaluations are conducted inChinese ↔ English and English ↔ German translation tasks,and demonstrate that our approach leads to significant im-provements, compared with strong baseline systems. In thefuture work, we plan to extend this method to jointly trainmultiple NMT systems for 3+ languages using massivemonolingual data.
Acknowledgments
This research was partially supported by grants from theNational Natural Science Foundation of China (Grants No.61727809, 61325010 and U1605251). We appreciate Dong-dong Zhang, Shuangzhi Wu, Wenhu Chen, Guanlin Li forthe fruitful discussions. We also thank the anonymous re-viewers for their careful reading of our paper and insightfulcomments. eferences
Bahdanau, D.; Cho, K.; and Bengio, Y. 2014. Neural ma-chine translation by jointly learning to align and translate.
CoRR abs/1409.0473.Cheng, Y.; Xu, W.; He, Z.; He, W.; Wu, H.; Sun, M.; andLiu, Y. 2016. Semi-supervised learning for neural machinetranslation. In
Proceedings of ACL 2016 .Chiang, D. 2007. Hierarchical phrase-based translation. computational linguistics
Proceedings of EMNLP 2014 .Glorot, X., and Bengio, Y. 2010. Understanding the diffi-culty of training deep feedforward neural networks. In
Ais-tats , volume 9, 249–256.Gulcehre, C.; Firat, O.; Xu, K.; Cho, K.; Barrault, L.; Lin,H.-C.; Bougares, F.; Schwenk, H.; and Bengio, Y. 2015.On using monolingual corpora in neural machine transla-tion. arXiv preprint arXiv:1503.03535 .He, D.; Xia, Y.; Qin, T.; Wang, L.; Yu, N.; Liu, T.; and Ma,W.-Y. 2016. Dual learning for machine translation. In
Ad-vances in Neural Information Processing Systems , 820–828.Hochreiter, S., and Schmidhuber, J. 1997. Long short-termmemory.
Neural computation
Proceedings of ACL 2015 .Kalchbrenner, N., and Blunsom, P. 2013. Recurrent contin-uous translation models. In
EMNLP , volume 3, 413.Koehn, P.; Och, F. J.; and Marcu, D. 2003. Statistical phrase-based translation. In
HLT-NAACL .Luong, T.; Sutskever, I.; Le, Q.; Vinyals, O.; and Zaremba,W. 2015. Addressing the rare word problem in neural ma-chine translation. In
Proceedings of ACL 2015 .Luong, T.; Pham, H.; and Manning, C. D. 2015. Effectiveapproaches to attention-based neural machine translation. In
Proceedings of EMNLP 2015 .Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W.-J. 2002.Bleu: a method for automatic evaluation of machine transla-tion. In
Proceedings of the 40th annual meeting on associa-tion for computational linguistics , 311–318. Association forComputational Linguistics.Pascanu, R.; Mikolov, T.; and Bengio, Y. 2013. On the diffi-culty of training recurrent neural networks. In
InternationalConference on Machine Learning , 1310–1318.Ramachandran, P.; Liu, P.; and Le, Q. 2017. Unsupervisedpretraining for sequence to sequence learning. In
Proceed-ings of EMNLP 2017 .Sennrich, R.; Haddow, B.; and Birch, A. 2016a. Improvingneural machine translation models with monolingual data.In
Proceedings of ACL 2016 .Sennrich, R.; Haddow, B.; and Birch, A. 2016b. Neuralmachine translation of rare words with subword units. In
Proceedings of ACL 2016 . Shen, S.; Cheng, Y.; He, Z.; He, W.; Wu, H.; Sun, M.; andLiu, Y. 2016. Minimum risk training for neural machinetranslation. In
Proceedings of ACL 2016 .Sutskever, I.; Vinyals, O.; and Le, Q. V. 2014. Sequenceto sequence learning with neural networks. In
Advances inneural information processing systems , 3104–3112.Tu, Z.; Lu, Z.; Liu, Y.; Liu, X.; and Li, H. 2016. Modelingcoverage for neural machine translation. In
Proceedings ofACL 2016 .Ueffing, N.; Haffari, G.; Sarkar, A.; et al. 2007. Trans-ductive learning for statistical machine translation. In
An-nual Meeting-Association for Computational Linguistics ,volume 45, 25.Wu, Y.; Schuster, M.; Chen, Z.; Le, Q. V.; Norouzi, M.;Macherey, W.; Krikun, M.; Cao, Y.; Gao, Q.; Macherey, K.;Klingner, J.; Shah, A.; Johnson, M.; Liu, X.; Kaiser, L.;Gouws, S.; Kato, Y.; Kudo, T.; Kazawa, H.; Stevens, K.;Kurian, G.; Patil, N.; Wang, W.; Young, C.; Smith, J.; Riesa,J.; Rudnick, A.; Vinyals, O.; Corrado, G. S.; Hughes, M.;and Dean, J. 2016. Google’s neural machine translation sys-tem: Bridging the gap between human and machine transla-tion.
CoRR abs/1609.08144.Zeiler, M. D. 2012. Adadelta: an adaptive learning ratemethod. arXiv preprint arXiv:1212.5701 .Zhang, J., and Zong, C. 2016. Exploiting source-sidemonolingual data in neural machine translation. In