[PDF] Joint Training for Neural Machine Translation Models with Monolingual Data

Abstract

Monolingual data have been demonstrated to be helpful in improving translation quality of both statistical machine translation (SMT) systems and neural machine translation (NMT) systems, especially in resource-poor or domain adaptation tasks where parallel data are not rich enough. In this paper, we propose a novel approach to better leveraging monolingual data for neural machine translation by jointly learning source-to-target and target-to-source NMT models for a language pair with a joint EM optimization method. The training process starts with two initial NMT models pre-trained on parallel data for each direction, and these two models are iteratively updated by incrementally decreasing translation losses on training data. In each iteration step, both NMT models are first used to translate monolingual data from one language to the other, forming pseudo-training data of the other NMT model. Then two new NMT models are learnt from parallel data together with the pseudo training data. Both NMT models are expected to be improved and better pseudo-training data can be generated in next step. Experiment results on Chinese-English and English-German translation tasks show that our approach can simultaneously improve translation quality of source-to-target and target-to-source models, significantly outperforming strong baseline systems which are enhanced with monolingual data for model training including back-translation.

Full PDF

JJoint Training for Neural Machine Translation Modelswith Monolingual Data

Zhirui Zhang † , Shujie Liu ‡ , Mu Li ‡ , Ming Zhou ‡ , Enhong Chen †∗ † University of Science and Technology of China, Hefei, China ‡ Microsoft Research † [email protected] † [email protected] ‡ { shujliu,muli,mingzhou } @microsoft.com Abstract

Monolingual data have been demonstrated to be helpfulin improving translation quality of both statistical machinetranslation (SMT) systems and neural machine translation(NMT) systems, especially in resource-poor or domain adap-tation tasks where parallel data are not rich enough. In this pa-per, we propose a novel approach to better leveraging mono-lingual data for neural machine translation by jointly learningsource-to-target and target-to-source NMT models for a lan-guage pair with a joint EM optimization method. The train-ing process starts with two initial NMT models pre-trained onparallel data for each direction, and these two models are iter-atively updated by incrementally decreasing translation losseson training data. In each iteration step, both NMT models areﬁrst used to translate monolingual data from one languageto the other, forming pseudo-training data of the other NMTmodel. Then two new NMT models are learnt from paralleldata together with the pseudo training data. Both NMT mod-els are expected to be improved and better pseudo-trainingdata can be generated in next step. Experiment results onChinese-English and English-German translation tasks showthat our approach can simultaneously improve translationquality of source-to-target and target-to-source models, sig-niﬁcantly outperforming strong baseline systems which areenhanced with monolingual data for model training includingback-translation.

Introduction

Neural machine translation (NMT) performs end-to-endtranslation based on an encoder-decoder framework (Kalch-brenner and Blunsom 2013; Cho et al. 2014; Sutskever,Vinyals, and Le 2014; Bahdanau, Cho, and Bengio 2014)and has obtained state-of-the-art performances on many lan-guage pairs (Luong, Pham, and Manning 2015; Sennrich,Haddow, and Birch 2016b; Tu et al. 2016; Wu et al. 2016).In the encoder-decoder framework, an encoder ﬁrst trans-forms the source sequence into vector representations, basedon which, a decoder generates the target sequence. Suchframework brings appealing properties over the traditionalphrase-based statistical machine translation (SMT) systems(Koehn, Och, and Marcu 2003; Chiang 2007), such as lit-tle requirements for human feature engineering, or prior ∗ Corresponding authorCopyright c (cid:13) domain knowledge. On the other hand, to train the largeamount of parameters in the encoder and decoder networks,most NMT systems heavily rely on high-quality paralleldata and perform poorly in resource-poor or domain-speciﬁctasks. Unlike bilingual data, monolingual data are usuallymuch easier to collect and more diverse, and have been at-tractive resources for improving machine translation modelssince 1990’s when data-driven machine translation systemswere ﬁrst built.Monolingual data play a key role in training SMT sys-tems. Additional target monolingual data are usually re-quired to train a powerful language model, which is an im-portant feature of an SMT system’s log-linear model. Us-ing source-side monolingual data in SMT were also ex-plored. Uefﬁng et al. (2007) introduced a transductive semi-supervised learning method, in which source monolingualsentences are translated and ﬁltered to build pseudo bilin-gual data, which are added to the original bilingual data tore-train the SMT model.For NMT systems, Gulcehre et al. (2015) ﬁrst tried bothshallow and deep fusion methods to integrate an externalRNN language model into the encoder-decoder framework.The shallow fusion method simply linearly combines thetranslation probability and the language model probability,while the deep fusion method connects the RNN languagemodel with the decoder to form a new tightly coupled net-work. Instead of introducing an explicit language model,Cheng et al. (2016) proposed an auto-encoder-based methodwhich encodes and reconstructs monolingual sentences, inwhich source-to-target and target-to-source NMT modelsserve as the encoder and decoder respectively.Sennrich, Haddow, and Birch (2016a) proposed back-translation for data augmentation as another way to leveragethe target monolingual data. In this method, both the NMTmodel and training algorithm are kept unchanged, insteadthey employed a new approach to constructing training data.That is, target monolingual sentences are translated with apre-constructed machine translation system into source lan-guage, which are used as additional parallel data to re-trainthe source-to-target NMT model. Although back-translationhas been proven to be robust and effective, one major prob-lem for further improvement is the quality of automaticallygenerated training data from monolingual sentences. Due tothe imperfection of machine translation system, some of the a r X i v : . [ c s . C L ] M a r ncorrect translations are very likely to hurt the performanceof source-to-target model.In this paper, we present a novel method for making ex-tended usage of monolingual data from both source sideand target side by jointly optimizing a source-to-target NMTmodel A and a target-to-source NMT model B through aniterative process. In each iteration, these two models serve ashelper machine translation systems for each other as in back-translation: B is used to generated pseudo-training data formodel A with target-side monolingual data, and A is usedto generated pseudo-training data for model B with source-side monolingual data. The key advantage of our new ap-proach comparing with existing work is that the training pro-cess can be repeated to obtain further improvements becauseafter each iteration both model A and B are expected to beimproved with additional pseudo-training data. Therefore, inthe next iteration, better pseudo-training data can be gener-ated with these two improved models, resulting even bettermodel A and model B , so on and so forth.To jointly optimize the two models in both directions, wedesign a new semi-supervised training objective, with whichthe generated training sentence pairs are weighted so thatthe negative impact of noisy translations can be minimized.Original bilingual sentence pairs are all weighted as 1, whilethe synthetic sentence pairs are weighted as the normalizedmodel output probability. Similar to the post-processing stepas described in Uefﬁng et al. (2007), our weight mechanismalso plays an important role in improving the ﬁnal transla-tion performance. As we will show in the paper, the overalliterative training process essentially adds a joint EM estima-tion over the monolingual data to the MLE estimation overbilingual data: the E-step tries to estimate the expectationsof translations of the monolingual data, while the M-step up-dates model parameters with the smoothed translation prob-ability estimation.Our experiments are conducted on NIST OpenMT’sChinese-English translation task and WMT’s English-German translation task. Experimental results demonstratethat our joint training method can signiﬁcantly improvetranslation quality of both source-to-target and target-to-source models, compared with back-translation and otherstrong baselines. Neural Machine Translation

In this section, we will ﬁrst brieﬂy introduce the NMT modelused in our work. The NMT model follows the attention-based architecture proposed by Bahdanau, Cho, and Ben-gio (2014), and it is implemented as an encoder-decoderframework with recurrent neural networks (RNN). RNN areusually implemented as Gated Recurrent Unit (GRU) (Choet al. 2014) (adopted in our work) or Long Short-Term Mem-ory (LSTM) networks (Hochreiter and Schmidhuber 1997).The whole architecture can be divided into three compo-nents: encoder, decoder and attention mechanism.

Encoder

The encoder reads the source sentence X =( x , x , ... , x T ) and transforms it into a sequence of hiddenstates h = ( h , h , ... , h T ) , using a bi-directional RNN. Ateach time stamp t , the hidden state h t is deﬁned as the con- catenation of the forward and backward RNN hidden states [ −→ h t ; ←− h t ] , where −→ h t = RNN ( x t , −−→ h t − ) , ←− h t = RNN ( x t , ←−− h t +1 ) . Decoder

The decoder uses another RNN to generate thetranslation Y = ( y , y , ... , y T (cid:48) ) based on the hidden states h generated by the encoder. At each time stamp i , the condi-tional probability of each word y i from a target vocabulary V y is computed by p ( y i | y

NMT systems are usually trained to max-imize the conditional log-probability of the correct transla-tion given a source sentence with respect to the parameters θ of the model: θ ∗ = arg max θ N (cid:88) n =1 | y n | (cid:88) i =1 log p ( y ni | y n
Back translation ﬁlls the gap between the requirement forparallel data and availability of monolingual data in NMTmodel training with the help of machine translation sys-tems. Specially, given a set of sentences { y i } in target lan-guage Y , a pre-constructed target-to-source machine trans-lation system is used to automatically generate their transla-tions { x i } in source language X . Then the synthetic sen-tence pairs { ( x i , y i ) } are used as additional parallel datato train the source-to-target NMT model, together with theoriginal bilingual data.Our work follows this parallel data synthesis approach,but extends the task setting from solely improving thesource-to-target NMT model training with target monolin-gual data to a paired one: we aim to jointly optimize asource-to-target NMT model M x → y and a target-to-source = 𝑥 𝑛 , 𝑦 𝑛 X = 𝑥 𝑠 Y = 𝑦 𝑡 X ′ = 𝑥 𝑠 , 𝑦 Y ′ = 𝑦 𝑡 , 𝑥 X ′ = 𝑥 𝑠 , 𝑦 Y ′ = 𝑦 𝑡 , 𝑥 M 𝑥→𝑦0 M 𝑦→𝑥0 𝑝(𝑥 |𝑦 𝑡 ) Iteration 0Iteration 1Iteration 2 … M 𝑥→𝑦1 M 𝑦→𝑥 M 𝑥→𝑦2 M 𝑦→𝑥 𝑝(𝑦 (𝑠) |𝑥 𝑠 ) 𝑝(𝑦 |𝑥 𝑠 ) 𝑝(𝑥 |𝑦 𝑡 ) NMT 𝑥→𝑦

NMT 𝑦→𝑥

NMT 𝑥→𝑦

NMT 𝑦→𝑥 … …… …

Figure 1: Illustration of joint-EM training of NMT mod-els in two directions (NMT x → y and NMT y → x ) using bothsource ( X ) and target ( Y ) monolingual corpora, combinedwith bilingual data D . X (cid:48) is the generated synthetic datawith probability p ( y | x ) by translating X using NMT x → y ,and Y (cid:48) is the synthetic data with probability p ( x | y ) by trans-lating Y using NMT y → x .NMT model M y → x with the aid of monolingual data fromboth source language X and target language Y . Differ-ent from back translation, in which both automatic trans-lation and NMT model training are performed only once,our method runs the machine translation for monolingualdata and updates NMT models M x → y and M y → x throughseveral iterations. At each iteration step, model M x → y and M y → x serves as each other’s pseudo-training data genera-tor: M y → x is used to translate Y into X for M x → y , while M x → y is used to translate X to Y for M y → x .The joint training process is illustrated in Figure 1, inwhich the ﬁrst 2 iterations are shown. Before the ﬁrst iter-ation starts, two initial translation models M x → y and M y → x are pre-trained with parallel data D = { x n , y n } . This stepis denoted as iteration 0 for sake of consistency.In iteration 1, at ﬁrst, two NMT systems based on M x → y and M y → x are used to translate monolingual data X = { x ( s ) i } and Y = { y ( s ) i } , which forms two synthetic trainingdata sets X (cid:48) = { x ( s ) i , y ( s )0 } and Y (cid:48) = { y ( t ) i , x ( t )0 } . Model M x → y and M y → x are then trained on the updated train-ing data by combining Y (cid:48) and X (cid:48) with parallel data D . Itis worth noting that we use n-best translations from an NMTsystem, and the selected translations are weighted with the Algorithm 1

Joint Training Algorithm for NMT procedure P RE - TRAINING Initialize M x → y and M y → x with random weights θ x → y and θ y → x ; Pre-train M x → y and M y → x on bilingual data D = { ( x ( n ) , y ( n ) } Nn =1 with Equation 4; end procedure procedure J OINT - TRAINING while Not Converged do Use NMT y → x to generate back-translation x forY = { y ( t ) } Tt =1 and build pseudo-parallel corpora Y (cid:48) = { x, y ( t ) } Tt =1 ; (cid:46) E-Step for NMT x → y Use NMT x → y to generate back-translation y forX = { x ( s ) } Ss =1 and build pseudo-parallel corpora X (cid:48) = { x ( s ) , y } Ss =1 ; (cid:46) E-Step for NMT y → x Train M x → y with Equation 10 given weightedbilingual corpora D ∪ Y (cid:48) ; (cid:46) M-Step for NMT x → y Train M y → x with Equation 12 given weightedbilingual corpora D ∪ X (cid:48) ; (cid:46) M-Step for NMT y → x end while end procedure translation probabilities from the NMT model.In iteration 2, the above process is repeated, but the syn-thetic training data are re-generated with the updated NMTmodels M x → y and M y → x , which are presumably more ac-curate. In turn, the learnt NMT models M x → y and M y → x are also expected to be improved over the ﬁrst iteration.The formal algorithm is listed in Algorithm 1, which isdivided into two major steps: pre-training and joint training.As we will show in next section, the joint training step essen-tially adds an EM (Expectation-Maximization) process overthe monolingual data in both source and target languages . Training Objective

Next we will show how to derive our new learning objectivefor joint training, starting with the case that only one NMTmodel is involved.Given parallel corpus D = { ( x ( n ) , y ( n ) ) } Nn =1 and mono-lingual corpus in target language Y = { y ( t ) } Tt =1 , the semi-supervised training objective is to maximize the likelihoodof both bilingual data and monolingual data: L ∗ ( θ x → y ) = N (cid:88) n =1 log p ( y ( n ) | x ( n ) ) + T (cid:88) t =1 log p ( y ( t ) ) (5)where the ﬁrst term on the right side denotes the likelihoodof bilingual data and the second term represents the likeli-hood of target-side monolingual data. Next we introduce thesource translations as hidden states for the target sentences Note that the training criteria on parallel data D are still usingMLE (maximum likelihood estimation) nd decompose log p ( y ( t ) ) as log p ( y ( t ) ) = log (cid:88) x p ( x, y ( t ) ) = log (cid:88) x Q ( x ) p ( x, y ( t ) ) Q ( x ) ≥ (cid:88) x Q ( x ) log p ( x, y ( t ) ) Q ( x ) ( Jensen’s inequality )= (cid:88) x [ Q ( x ) log p ( y ( t ) | x ) − KL ( Q ( x ) || p ( x ))] (6)where x is latent variable representing the source translationof target sentence y ( t ) , Q ( x ) is the approximated probabilitydistribution of x , p ( x ) represents the marginal distributionof sentence x , and KL ( Q ( x ) || p ( x )) is the Kullback-LeiblerDivergence between two probability distributions. In orderto make the equal sign to be valid in Equation 6, Q ( x ) mustsatisfy the following condition p ( x, y ( t ) ) Q ( x ) = c (7)where c is a constant and does not depend on y . Given (cid:80) x Q ( x ) = 1 , Q ( x ) can be calculated as Q ( x ) = p ( x, y ( t ) ) c = p ( x, y ( t ) ) (cid:80) x p ( x, y ( t ) ) = p ∗ ( x | y ( t ) ) (8)where p ∗ ( x | y ( t ) ) denotes the true target-to-source transla-tion probability. Since it is usually not possible to calcu-late p ∗ ( x | y ( t ) ) in practice, we use the translation probability p ( x | y ( t ) ) given by a target-to-source NMT model as Q ( x ) .Combining Equation 5 and 6, we have L ∗ ( θ x → y ) ≥ L ( θ x → y ) = N (cid:88) n =1 log p ( y ( n ) | x ( n ) )+ T (cid:88) t =1 (cid:88) x [ p ( x | y ( t ) ) log p ( y ( t ) | x ) − KL ( p ( x | y ( t ) ) || p ( x ))] (9)This means L ( θ x → y ) is a lower bound of the true likelihoodfunction L ∗ ( θ x → y ) . Since KL ( p ( x | y ( t ) ) || p ( x )) is irrelevantto parameters θ x → y , L ( θ x → y ) can be simpliﬁed as L ( θ x → y ) = N (cid:88) n =1 log p ( y ( n ) | x ( n ) )+ T (cid:88) t =1 (cid:88) x p ( x | y ( t ) ) log p ( y ( t ) | x ) (10)The ﬁrst part of L ( θ x → y ) is the same as the MLE training,while the second part can be optimized with EM algorithm.We can estimate the expectation of source translation proba-bility p ( x | y ( t ) ) in the E-step, and maximize the second partin the M-step. The E-step uses the target-to-source transla-tion model M y → x to generate the source translations as hid-den variables, which are paired with the target sentences tobuild a new distribution of training data together with trueparallel data D . Therefore maximizing L ( θ x → y ) can be ap-proximated by maximizing the log likelihood on the new training data. The translation probability p ( x | y ( t ) ) is used asthe weight of the pseudo sentence pairs, which helps withﬁltering out bad translations.It is easy to verify that back-translation approach (Sen-nrich, Haddow, and Birch 2016a) is a special case of thisformulation of L ( θ x → y ) , in which p ( x | y ( t ) ) = 1 becauseonly the best translation from the NMT model M y → x ( y ( t ) ) is used L ( θ x → y ) = N (cid:88) n =1 log p ( y ( n ) | x ( n ) )+ T (cid:88) t =1 log p ( y ( t ) | M y → x ( y ( t ) )) (11)Similarly, the likelihood of NMT model M y → x can bederived as L ( θ y → x ) = N (cid:88) n =1 log p ( x ( n ) | y ( n ) )+ S (cid:88) s =1 (cid:88) y p ( y | x ( s ) ) log p ( x ( s ) | y ) (12)where y is a target translation (hidden state) of the sourcesentence x ( s ) . The overall training objective is the sum oflikelihood in both directions L ( θ ) = L ( θ x → y ) + L ( θ y → x ) During the derivation of L ( θ x → y ) , we use the translationprobability p ( x | y ( t ) ) from M y → x as the approximation ofthe true distribution p ∗ ( x | y ( t ) ) . When p ( x | y ( t ) ) gets closerto p ∗ ( x | y ( t ) ) , we can get a tighter lower bound of L ∗ ( θ x → y ) ,gaining more opportunities to improve M x → y . Joint trainingof paired NMT models is designed to solve this problem ifsource monolingual data are also available. Experiments

Setup

We evaluate our proposed approach on two language pairs:Chinese ↔ English and English ↔ German. In all experi-ments, we use BLEU (Papineni et al. 2002) as the evaluationmetric for translation quality.

Dataset

For Chinese ↔ English translation, we select ourtraining data from LDC corpora , which consists of 2.6Msentence pairs with 65.1M Chinese words and 67.1M En-glish words respectively. We use 8M Chinese sentences and8M English sentences randomly extracted from Xinhua por-tion of Gigaword corpus as the monolingual data sets. Anysentence longer than 60 words is removed from trainingdata (both the bilingual data and pseudo bilingual data). For The corpora include LDC2002E17, LDC2002E18,LDC2003E07, LDC2003E14, LDC2005E83, LDC2005T06,LDC2005T10, LDC2006E17, LDC2006E26, LDC2006E34,LDC2006E85, LDC2006E92, LDC2006T06, LDC2004T08,LDC2005T10 irection System NIST2006 NIST2003 NIST2005 NIST2008 NIST2012 AverageC → E RNNSearch 38.61 39.39 38.31 30.04 28.48 34.97RNNSearch+M 40.66 43.26 41.61 32.48 31.16 37.83SS-NMT 41.53 44.03 42.24 33.40 31.58 38.56JT-NMT E → C RNNSearch 17.75 18.37 17.10 13.14 12.85 15.84RNNSearch+M 21.28 21.19 19.53 16.47 15.86 18.87SS-NMT 21.62 22.00 19.70 17.06 16.48 19.37JT-NMT

Table 1: Case-insensitive BLEU scores (%) on Chinese ↔ English translation. The “Average” denotes the average BLEU scoreof all datasets in the same setting. The “C” and “E” denote Chinese and English respectively.Chinese-English, NIST OpenMT 2006 evaluation set is usedas validation set, and NIST 2003, NIST 2005, NIST 2008,NIST2012 datasets as test sets. In both validation and testdata sets, each Chinese sentence has four reference transla-tions. For English-Chinese, we use the NIST datasets in areverse direction: treating the ﬁrst English sentence in thefour reference translation as a source sentence and the Chi-nese sentence as the single reference. We limit the vocabu-lary to contain up to 50K most frequent words on both thesource and target side, and convert remaining words into the token.For English ↔ German translation, we choose theWMT’14 training corpus used in Jean et al. (2015). Thistraining corpus contains 4.5M sentence pairs with 116MEnglish words and 110M German words. For monolingualdata, we randomly select 8M English sentences and 8MGerman sentences from “News Crawl: articles from 2012”provided by WMT’14. The concatenation of news-test2012 and news-test 2013 is used as the validation set andnews-test 2014 as the test set. The maximal sentence lengthis also set as 60. We use 50K sub-word tokens as vocabularybased on Byte Pair Encoding (Sennrich, Haddow, and Birch2016b).

Implementation Details

The RNNSearch model pro-posed by Bahdanau, Cho, and Bengio (2014) is adoptedas our baseline, which uses a single layer GRU-RNN forthe encoder and another. The size of word embedding (forboth source and target words) is 256 and the size of hid-den layer is set to 1024. The parameters are initialized us-ing a normal distribution with a mean of 0 and a varianceof (cid:112) / ( d row + d col ) , where d row and d col are the numberof rows and columns in the structure (Glorot and Bengio2010). Our models are optimized with the Adadelta (Zeiler2012) algorithm with mini-batch size 128. We re-normalizegradient if its norm is larger than 2.0 (Pascanu, Mikolov,and Bengio 2013). At test time, beam search with size 8is employed to ﬁnd the best translation, and translationprobabilities are normalized by the length of the transla-tion sentences. In post-processing step, we follow the workof Luong et al. (2015) to handle replacement forChinese ↔ English translation.For building the synthetic bilingual data in our approach,beam size is set to 4 to speed up the decoding process. Inpractice, we ﬁrst sort all monolingual data according to the sentence length and then 64 sentences are simultaneouslytranslated with parallel decoding implementation. As formodel training, we found that 4-5 EM iterations are enoughto converge. The best model is selected according to theBLEU scores on the validation set during EM process.

Baseline

Our proposed joint-training approach is com-pared with three NMT baselines for all translation tasks: • RNNSearch : Attention-based NMT system (Bahdanau,Cho, and Bengio 2014). Only bilingual corpora are usedto train a standard attention-based NMT model. • RNNSearch+M : Bilingual and target-side monolingualcorpora are used to train RNNSearch. We follow Sen-nrich, Haddow, and Birch (2016b) to construct pseudo-parallel corpora by generating source language with back-translation of target-side monolingual data. • SS-NMT : Semi-supervised NMT training proposed byCheng et al. (2016). To be fair in all experiment, theirmethod adopts the same settings as our approach includ-ing the same source and target monolingual data.

Chinese ↔ English Translation Result

Table 1 shows the evaluation results of different models onNIST datasets, in which JT-NMT represents our joint train-ing for NMT using monolingual data. All the results are re-ported based on case-insensitive BLEU.Compared with RNNSearch, we can see thatRNNSearch+M, SS-NMT and JT-NMT all bring sig-niﬁcant improvements across different test sets. Ourapproach achieves the best result, 4.7 and 4.46 BLEUpoints improvement over RNNSearch on average forChinese-to-English and English-to-Chinese respectively.These results conﬁrm that exploiting massive monolingualcorpora improves translation performance.From Table 1, we can ﬁnd our JT-NMT achieves betterperformances than RNNSearch+M across different test sets,with 1.84 and 1.43 points of BLEU improvements on av-erage in Chinese-to-English and English-to-Chinese direc-tions respectively. Compared with RNNSearch+M, our jointtraining approach introduces data weight to better handlepoor pseudo-training data, and the joint interactive train-ing can boost the models of two directions with the help ofeach other, instead of only use the target-to-source model toystem Architecture E → D D → EJean et al. (2015) Gated RNN with search + PosUnk 18.97 -Jean et al. (2015) Gated RNN with search + PosUnk + 500K vocabs 19.40 -Shen et al. (2016) Gated RNN with search + PosUnk + MRT 20.45 -Luong, Pham, and Manning (2015) LSTM with 4 layers + dropout + local att. + PosUnk 20.90 -RNNSearch Gated RNN with search + BPE 19.78 24.91RNNSearch+M Gated RNN with search + BPE + monolingual data 21.89 26.81SS-NMT Gated RNN with search + BPE + monolingual data 22.64 27.30JT-NMT Gated RNN with search + BPE + monolingual data

Table 2: Case-sensitive BLEU scores (%) on English ↔ German translation. “PosUnk” denotes Luong et al. (2015)’s techniqueof handling rare words. “MRT” denotes minimum risk training proposed in Shen et al. (2016). “BPE” denotes Byte PairEncoding proposed by Sennrich, Haddow, and Birch (2016b) for word segmentation. The “D” and “E” denote German andEnglish respectively.help source-to-target model. Our approach also yields bet-ter translation than SS-NMT with at least 0.93 points BLEUimprovements on average. This result shows that our methodcan better make use of both source and target monolingualcorpora than Cheng et al. (2016)’s approach.

English ↔ German Translation Result

For English ↔ German translation task, in addition to thebaseline system, we also include results of other exist-ing NMT systems, including Jean et al. (2015), Shen etal. (2016) and Luong, Pham, and Manning (2015). In or-der to be comparable with other work, all the results arereported based on case-sensitive BLEU. Experiment resultsare shown in Table 2.We can observe that the baseline RNNSearch with BPEmethod achieves better results than Jean et al. (2015),even better than the result using larger vocabulary ofsize 500K. Compared with RNNSearch, we observe thatRNNSearch+M, SS-NMT and JT-NMT bring signiﬁcantimprovements in both English-to-German and German-to-English directions. It conﬁrms the effectiveness ofleveraging monolingual corpus. Our approach outperformsRNNSearch+M and SS-NMT by a notable margin and ob-tains the best BLEU score of 23.6 and 27.98 in English-to-German and German-to-English test set respectively. Theseexperimental results further conﬁrm the effectiveness ofour joint training mechanism, similar as shown in theChinese ↔ English translation tasks.

Effect of Joint Training

We further investigate the impact of our joint training ap-proach JT-NMT during the whole training process. Fig-ure 2 shows the BLEU scores on Chinese ↔ English andEnglish ↔ German validation and test sets in each iteration.We can ﬁnd that more iterations can lead to better evalua-tion results consistently, which veriﬁes that the joint trainingof NMT models in two directions can boost their translationperformance.In Figure 2, “Iteration 0” is the BLEU scores of baselineRNNSearch, and obviously the ﬁrst few iterations gain most,especially for “Iteration 1”. After three iterations, we can-not get signiﬁcant improvement anymore. As we said previ- Table 3: The BLEU scores (%) on Chinese ↔ English andEnglish ↔ German translation tasks. For Chinese ↔ Englishtranslation, we list the average results of all test sets. ForEnglish ↔ German translation, we list the results of news-test2014.System C → E E → C D → E E → DRNNSearch+M 37.83 18.87 26.81 21.89JT-NMT (Iteration 1) ously, along with the target-to-source model approaches theideal translation probability, the lower bound of the loss willbe closer to the true loss. During the training, the closer thelower bound to the true loss, the smaller the potential gain.Since there is a lot of uncertainty during the training, theperformance sometimes drops a little.JT-NMT (Iteration 1) can be considered as the generalversion of RNNSearch+M that any pseudo sentence pair isweighted as 1. From Table 3, we can see that JT-NMT (Iter-ation 1) slightly surpass RNNSearch+M on all test datasets,which proves that the weight introduced in our algorithmcan clean poor synthetic data and lead to better performance.Our approach will assign low weight to synthetic sentencepairs with poor translation, so as to punish its effect to themodel update. The translation will be reﬁned and improvedin subsequent iterations, as shown in Table 4, which showstranslation results of a Chinese sentence in different itera-tions.

Related Work

Neural machine translation has drawn more and more at-tention in recent years (Bahdanau, Cho, and Bengio 2014;Luong, Pham, and Manning 2015; Jean et al. 2015; Tu etal. 2016; Wu et al. 2016). For the original NMT system,only parallel corpora can be used for model training usingMLE method, therefore much research in the literature at-tempts to exploit massive monolingual corpora. Gulcehreet al. (2015) ﬁrst investigate the integration of monolingualdata for neural machine translation. They train monolinguallanguage models independently, which is integrated into theNMT system with proposed shallow and deep fusion meth- B L E U Iteration Dev Average (a) Chinese-English Translation B L E U Iteration Dev Average (b) English-Chinese Translation B L E U Iteration Dev Test (c) German-English Translation B L E U Iteration Dev Test (d) English-German Translation

Figure 2: BLEU scores (%) on Chinese ↔ English and English ↔ German validation and test sets for JT-NMT during trainingprocess. “Dev” denotes the results of validation datasets, while “Test” denotes the results of test datasets.Monolingual 当终场哨声响起 , 意大利首都罗马沸腾了。 dang zhongchang shaosheng xiang qi , yidali shoudu luoma feiteng le . Reference when the ﬁnal whistle sounded , the italian capital of rome boiled .Translation [Iteration 0]: the italian capital of rome was boiling with the rome .[Iteration 1]: the italian capital of rome was boiling with the sound of the end of the door .[Iteration 4]: when the ﬁnal whistle sounded , the italian capital of rome was boiling .Table 4: Example translations of a Chinese sentence in different iterations.ods. Sennrich, Haddow, and Birch (2016a) propose to gen-erate the synthetic bilingual data by translating the targetmonolingual sentences to source language sentences, andthe mixture of original bilingual data and the synthetic paral-lel data are used to retrain the NMT system. As an extensionof their approach, our approach introduces translation prob-abilities from target-to-source model as weights of syntheticparallel sentences to punish poor pseudo parallel sentences,and further interactive training of NMT models in two direc-tions are used to reﬁne them.Recently, Zhang and Zong (2016) propose a multi-tasklearning framework to exploit source-side monolingual data,in which they jointly perform machine translation on syn-thetic bilingual data and sentence reordering with source-side monolingual data. Cheng et al. (2016) reconstructmonolingual data by auto-encoder, in which the source-to-target and target-to-source translation models form a closedloop and are jointly updated. Different from their method,our approach extends Sennrich, Haddow, and Birch (2016a)by directly introducing source-side monolingual data to im-prove reverse NMT models and adopts EM algorithm to it-eratively update bidirectional NMT models. Our approachcan better exploit both target and source monolingual data,while they show no improvement when using both target andsource monolingual data compared just target monolingualdata. He et al. (2016) treat the source-to-target and target-to-source models as the primal and dual tasks respectively, sim-ilar to the work of Cheng et al. (2016), they also employedround-trip translations for each monolingual sentence to ob-tain feedback signals. Ramachandran, Liu, and Le (2017)adopt pre-trained weights of two language models to initialthe encoder and decoder of a seq2seq model, and then ﬁne-tune it with labeled data. Their approach is complementary to our mechanism by leveraging pre-trained language modelto initial bidirectional NMT models, and it may lead to ad-ditional gains.

Conclusion

In this paper, we propose a new semi-supervised trainingapproach to integrating the training of a pair of transla-tion models in a uniﬁed learning process with the help ofmonolingual data from both source and target sides. In ourmethod, a joint-EM training algorithm is employed to opti-mize two translation models cooperatively, in which the twomodels are able to mutually boost their translation perfor-mance. Translation probability of the other model is usedas the weight to estimate translation accuracy and punishthe bad translations. Empirical evaluations are conducted inChinese ↔ English and English ↔ German translation tasks,and demonstrate that our approach leads to signiﬁcant im-provements, compared with strong baseline systems. In thefuture work, we plan to extend this method to jointly trainmultiple NMT systems for 3+ languages using massivemonolingual data.

Acknowledgments

This research was partially supported by grants from theNational Natural Science Foundation of China (Grants No.61727809, 61325010 and U1605251). We appreciate Dong-dong Zhang, Shuangzhi Wu, Wenhu Chen, Guanlin Li forthe fruitful discussions. We also thank the anonymous re-viewers for their careful reading of our paper and insightfulcomments. eferences

Bahdanau, D.; Cho, K.; and Bengio, Y. 2014. Neural ma-chine translation by jointly learning to align and translate.

CoRR abs/1409.0473.Cheng, Y.; Xu, W.; He, Z.; He, W.; Wu, H.; Sun, M.; andLiu, Y. 2016. Semi-supervised learning for neural machinetranslation. In

Proceedings of ACL 2016 .Chiang, D. 2007. Hierarchical phrase-based translation. computational linguistics

Proceedings of EMNLP 2014 .Glorot, X., and Bengio, Y. 2010. Understanding the difﬁ-culty of training deep feedforward neural networks. In

Ais-tats , volume 9, 249–256.Gulcehre, C.; Firat, O.; Xu, K.; Cho, K.; Barrault, L.; Lin,H.-C.; Bougares, F.; Schwenk, H.; and Bengio, Y. 2015.On using monolingual corpora in neural machine transla-tion. arXiv preprint arXiv:1503.03535 .He, D.; Xia, Y.; Qin, T.; Wang, L.; Yu, N.; Liu, T.; and Ma,W.-Y. 2016. Dual learning for machine translation. In

Ad-vances in Neural Information Processing Systems , 820–828.Hochreiter, S., and Schmidhuber, J. 1997. Long short-termmemory.

Neural computation

Proceedings of ACL 2015 .Kalchbrenner, N., and Blunsom, P. 2013. Recurrent contin-uous translation models. In

EMNLP , volume 3, 413.Koehn, P.; Och, F. J.; and Marcu, D. 2003. Statistical phrase-based translation. In

HLT-NAACL .Luong, T.; Sutskever, I.; Le, Q.; Vinyals, O.; and Zaremba,W. 2015. Addressing the rare word problem in neural ma-chine translation. In

Proceedings of ACL 2015 .Luong, T.; Pham, H.; and Manning, C. D. 2015. Effectiveapproaches to attention-based neural machine translation. In

Proceedings of EMNLP 2015 .Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W.-J. 2002.Bleu: a method for automatic evaluation of machine transla-tion. In

Proceedings of the 40th annual meeting on associa-tion for computational linguistics , 311–318. Association forComputational Linguistics.Pascanu, R.; Mikolov, T.; and Bengio, Y. 2013. On the difﬁ-culty of training recurrent neural networks. In

InternationalConference on Machine Learning , 1310–1318.Ramachandran, P.; Liu, P.; and Le, Q. 2017. Unsupervisedpretraining for sequence to sequence learning. In

Proceed-ings of EMNLP 2017 .Sennrich, R.; Haddow, B.; and Birch, A. 2016a. Improvingneural machine translation models with monolingual data.In

Proceedings of ACL 2016 .Sennrich, R.; Haddow, B.; and Birch, A. 2016b. Neuralmachine translation of rare words with subword units. In

Proceedings of ACL 2016 . Shen, S.; Cheng, Y.; He, Z.; He, W.; Wu, H.; Sun, M.; andLiu, Y. 2016. Minimum risk training for neural machinetranslation. In

Proceedings of ACL 2016 .Sutskever, I.; Vinyals, O.; and Le, Q. V. 2014. Sequenceto sequence learning with neural networks. In

Advances inneural information processing systems , 3104–3112.Tu, Z.; Lu, Z.; Liu, Y.; Liu, X.; and Li, H. 2016. Modelingcoverage for neural machine translation. In

Proceedings ofACL 2016 .Uefﬁng, N.; Haffari, G.; Sarkar, A.; et al. 2007. Trans-ductive learning for statistical machine translation. In

An-nual Meeting-Association for Computational Linguistics ,volume 45, 25.Wu, Y.; Schuster, M.; Chen, Z.; Le, Q. V.; Norouzi, M.;Macherey, W.; Krikun, M.; Cao, Y.; Gao, Q.; Macherey, K.;Klingner, J.; Shah, A.; Johnson, M.; Liu, X.; Kaiser, L.;Gouws, S.; Kato, Y.; Kudo, T.; Kazawa, H.; Stevens, K.;Kurian, G.; Patil, N.; Wang, W.; Young, C.; Smith, J.; Riesa,J.; Rudnick, A.; Vinyals, O.; Corrado, G. S.; Hughes, M.;and Dean, J. 2016. Google’s neural machine translation sys-tem: Bridging the gap between human and machine transla-tion.

CoRR abs/1609.08144.Zeiler, M. D. 2012. Adadelta: an adaptive learning ratemethod. arXiv preprint arXiv:1212.5701 .Zhang, J., and Zong, C. 2016. Exploiting source-sidemonolingual data in neural machine translation. In