[PDF] Improving Domain Adaptation Translation with Domain Invariant and Specific Information

Abstract

In domain adaptation for neural machine translation, translation performance can benefit from separating features into domain-specific features and common features. In this paper, we propose a method to explicitly model the two kinds of information in the encoder-decoder framework so as to exploit out-of-domain data in in-domain training. In our method, we maintain a private encoder and a private decoder for each domain which are used to model domain-specific information. In the meantime, we introduce a common encoder and a common decoder shared by all the domains which can only have domain-independent information flow through. Besides, we add a discriminator to the shared encoder and employ adversarial training for the whole model to reinforce the performance of information separation and machine translation simultaneously. Experiment results show that our method can outperform competitive baselines greatly on multiple data sets.

Full PDF

IImproving Domain Adaptation Translationwith Domain Invariant and Speciﬁc Information

Shuhao Gu

Yang Feng

Qun Liu Key Laboratory of Intelligent Information ProcessingInstitute of Computing Technology, Chinese Academy of Sciences (ICT/CAS) University of Chinese Academy of Sciences Huawei Noah’s Ark Lab, Hong Kong, China {gushuhao17g, fengyang}@ict.ac.cn [email protected] Abstract

In domain adaptation for neural machine trans-lation, translation performance can beneﬁtfrom separating features into domain-speciﬁcfeatures and common features. In this pa-per, we propose a method to explicitly modelthe two kinds of information in the encoder-decoder framework so as to exploit out-of-domain data in in-domain training. In ourmethod, we maintain a private encoder anda private decoder for each domain which areused to model domain-speciﬁc information. Inthe meantime, we introduce a common en-coder and a common decoder shared by allthe domains which can only have domain-independent information ﬂow through. Be-sides, we add a discriminator to the shared en-coder and employ adversarial training for thewhole model to reinforce the performance ofinformation separation and machine transla-tion simultaneously. Experiment results showthat our method can outperform competitivebaselines greatly on multiple data sets.

Neural machine translation (NMT) (Kalchbrennerand Blunsom, 2013; Cho et al., 2014; Sutskeveret al., 2014; Bahdanau et al., 2014; Gehring et al.,2017) has made great progress and drawn muchattention recently. Most NMT models are basedon the encoder-decoder architecture, where allthe sentence pairs share the same set of param-eters for the encoder and decoder which makesNMT models have a tendency towards overﬁttingto frequent observations (e.g., words, word co-occurrences, translation patterns), but overlook-ing special cases that are not frequently observed.However, in practical applications, NMT mod-els usually need to perform translation for somespeciﬁc domain with only a small quantity of in- *Corresponding Author domain training data but a large amount of out-of-domain data. Simply combining in-domain train-ing data with out-of-domain data will lead to over-ﬁtting to the out-of-domain data. Therefore, somedomain adaptation technique should be adopted toimprove in-domain translation.Fortunately, out-of-domain data still embod-ies common knowledge shared between domains.And incorporating the common knowledge fromout-of-domain data can help in-domain transla-tion. Britz et al. (2017) have done this kind of at-tempts and managed to improve in-domain trans-lation. The common architecture of this methodis to share a single encoder and decoder amongall the domains and add a discriminator to theencoder to distinguish the domains of the inputsentences. The training is based on adversariallearning between the discriminator and the trans-lation , ensuring the encoder can learn commonknowledge across domains that can help to gener-ate target translation. Zeng et al. (2018) extendthis line of work by introducing a private encoderto learn some domain speciﬁc knowledge. Theyhave proven that domain speciﬁc knowledge is acomplement to domain invariant knowledge andindispensable for domain adaptation. Intuitively,besides the encoder, the knowledge inferred bythe decoder can also be divided into domain spe-ciﬁc and domain invariant and further improve-ment will be achieved by employing private de-coders.In this paper, in order to produce in-domaintranslation with not only common knowledge butin-domain knowledge, we employ a common en-coder and decoder among all the domains and alsoa private encoder and decoder for each domainseparately. The differences between our methodand the above methods are in two points: ﬁrst, weemploy multiple private encoders rather where allthe domains only have one private encoder; sec- a r X i v : . [ c s . C L ] S e p nd, we also introduce multiple private decoderscontrast to no private decoder. This architectureis based on the consideration that out-of-domaindata is far more than in-domain data and only us-ing one private encoder and/or decoder has therisk of overﬁtting. Under the framework of ourmethod, the translation of each domain is pre-dicted on the output of both the common decoderand its private decoder. In this way, the in-domainprivate decoder has direct inﬂuence to the gen-eration of in-domain translation and the out-of-domain decoder is used to help train the commonencoder and decoder better which can also help in-domain translation. We conducted experiments onEnglish → Chinese and English ↔ German domainadaptation tasks for machine translation under theframework of RNNSearch (Bahdanau et al., 2014)and Transformer (Vaswani et al., 2017) and getconsistently signiﬁcant improvements over severalstrong baselines.

The task of domain adaptation for NMT is to trans-late a text in-domain for which only a small num-ber of parallel sentences is available. The mainidea of the work for domain adaptation is to intro-duce external information to help in-domain trans-lation which may include in-domain monolingualdata, meta information or out-of-domain paralleldata.To exploit in-domain monolingual data,Gülçehre et al. (2015) train a RNNLM on thetarget side monolingual data ﬁrst and then useit in decoding. Domhan and Hieber (2017)further extend this work by training the RNNLMpart and translation part jointly. Sennrich et al.(2015a) propose to conduct back translation forthe monolingual target data so as to generate thecorresponding parallel data. Zhang and Zong(2016) employs the self-learning algorithm togenerate the synthetic large-scale parallel data forNMT training. To introduce meta information,Chen et al. (2016) use the topic or category infor-mation of the input text to assistant the decoderand Kobus et al. (2017) extend the generic NMTmodels, which are trained on a diverse set ofdata to, speciﬁc domains with the specializedterminology and style.To make use of out-of-domain parallel data, Lu-ong and Manning (2015) ﬁrst train an NMT modelwith a large amount of out-of-domain data, then ﬁne tune the model with in-domain data. Wanget al. (2017a) select sentence pairs from the out-of-domain data set according to their similarity tothe in-domain data and then add them to the in-domain training data. Chu et al. (2017) constructthe training data set for the NMT model by com-bining out-of-domain data with the over-sampledin-domain data. Wang et al. (2017b) combine thein-domain and out-of-domain data together as thetraining data but apply instance weighting to get aweight for each sentence pair in the out-of-domaindata which is used in the parameter updating dur-ing back propagation. Britz et al. (2017) employa common encoder to encode the sentences fromboth the in-domain and out-of-domain data andmeanwhile add a discriminator to the encoder tomake sure that only domain-invariant informationis transferred to the decoder. They focus on the sit-uation that the quantity of the out-of-domain datais almost the same as the in-domain data whileour method can handle more generic situationsand there is no speciﬁc demand for the ratio ofthe quantity between the in-domain and out-of-domain data. Besides, our method employs a pri-vate encoder-decoder for each domain which canhold the domain-speciﬁc features. In addition tothe common encoder, Zeng et al. (2018) further in-troduce a domain-speciﬁc encoder to each domaintogether with a domain-speciﬁc classiﬁer to ensurethe features extracted by the domain-speciﬁc en-coder is proper. Compared to our method, theyfocus on the encoder and do not distinguish the in-formation in the decoder.Adversarial Networks have achieved great suc-cess in some areas (Ganin et al., 2016; Goodfellowet al., 2014). Inspired by these work, we also em-ploy a domain discriminator to extract some do-main invariant features which has already shownits effectiveness in some related NLP tasks. Chenet al. (2017) use a classiﬁer to exploit the sharedinformation between different Chinese word seg-ment criteria. Gui et al. (2017) tries to learncommon features of the out-domain data and in-domain data through adversarial discriminator forthe part-of-speech tagging problem. Kim et al.(2017) train a cross-lingual model with language-adversarial training to generate the general infor-mation across different languages for the POS tag-ging problem. All these work try to utilize a dis-criminator to distinguish invariant features acrossthe divergence. ncoder x −→ h ←− h h −→ h ←− h ←− h ℓ s −→ h ℓ s . . . . . . x x ℓ s h . . . h ℓ s Attention a j = ! ℓ s i =1 α ij h i α ji h i . . .. . . Decoder . . . s j − RNN Unit y ∗ j − Logistic Layer s j y j y ∗ j . . . MLE

Figure 1: The architecture of the attention-based NMT.

Our method can be applied to both the RNN-based NMT model (Bahdanau et al., 2014) andself-attention-based NMT model (Vaswani et al.,2017). In this paper, we will introduce our methodunder the RNN-based framework and the applica-tion to the self-attention-based framework can beimplemented in a similar way. Before introduc-ing our method, we will ﬁrst brieﬂy describe theRNN-based NMT model with attention shown inFigure 1.

The encoder uses two GRUs to go throughsource words bidirectionally to get two hiddenstates −→ h i and ←− h i for the source word x i , whichare then concatenated to produce the ﬁnal hiddenstates for x i as follows h i = [ −→ h i ; ←− h i ] (1) The attention layer aims to extract the sourceinformation which is most related to the genera-tion of each target word. First it evaluates the cor-relation between the previous decoder hidden state s j − and each source hidden state h i by e ij = v Tα tanh ( W α s j − + U α h i ) , (2)next calculates α ij which is the correlation degreeto each target hidden state h i , and then gets theattention c j . The formulation is as follows α ij = exp( e ij ) (cid:80) l s i (cid:48) =1 exp( e i (cid:48) j ) ; c j = l s (cid:88) i =1 α ij h i (3) The decoder also employs a GRU to get thehidden state s j for the target word y j as s j = g ( y j − , s j − , c j ) . (4) SharedEnocder

Out-of DomainDataInDomainData

Out-of EncoderInEncoder

SharedDecoder

Out-of DecoderInDecoder ++ PredictionLayer

Domain Discriminator Translation Part

GRL

Figure 2: The architecture of the proposed method.GRL means the gradient reversal layer which will mul-tiply a negative constant to the gradients during back-propagation.

Then the probability of the target word y j is de-ﬁned as follows p ( y j | s j , y j − , c j ) ∝ exp( y T j W o t j ) (5)where t j is computed by t j = U o s j − + V o E y j − + C o c j (6) Assume that we have two kinds of training data:out-of-domain and in-domain, and we want to getthe translation for the in-domain input. The out-of-domain and in-domain data can be representedas out = { ( x k , y ∗ k ) } N out k =1 ∼ D out ; in = { ( x k , y ∗ k ) } N in k =1 ∼ D in (7)The main idea of our method is to extractdomain invariant information from the out-of-domain data to improve in-domain translation. Tothis end, we employ a common encoder and acommon decoder shared by both of the domains,and a private encoder and a private decoder foreach domain. The main architecture given in Fig-ure 2.The working scenario of our method is as fol-lows. When a sentence comes, it is inputted intothe shared encoder and the private encoder of thecorresponding domain simultaneously. Then theoutput of the shared encoder is fed into the shareddecoder and the output of the private encoder intoits corresponding private decoder. Finally, theshared decoder and the private decoder collaborateogether to generate the current target word with agate to decide the contribution ratio.In addition, our method also introduce a dis-criminator to distinguish the domain of the in-put sentence based on the output of the sharedencoder. When the discriminator cannot predictthe domain of the input sentence, we can thinkthe knowledge encoded in the shared encoder isdomain invariant. This is achieved with a gra-dient reversal layer (GRL) so that the gradientsare reversed during back-propagation. In this way,the adversarial training is performed between thetranslation and the discriminator. Our model has a shared encoder, an in-domainprivate encoder and an out-of-domain private en-coder, where the shared encoder accepts inputfrom the two domains. Given a sentence of do-main p ( p ∈ { in , out } ), the shared encoder andthe private encoder of domain p will roll the sen-tence as the encoder shown in Section 3 and theoutputs of the shared encoder and the private en-coder for word x j are represented as h c j and h p j respectively. The Attention Layer

As the output of the shared encoder is only fedto the shared decoder and the output of the privateencoder of domain p only ﬂows to the private de-coder of domain p , we only need to calculate theattention of the shared decoder over the shared en-coder and the attention of the private decoder ofdomain p over the private encoder of domain p .We calculate these two attentions as in Section 3and denote them as c c j and c p j for the shared de-coder and the private decoder, respectively. The Decoder

We also maintain a shared decoder, an in-domain private decoder and an out-of-domain pri-vate decoder For a sentence of domain p ( p ∈{ in , out } ), the shared decoder and the private de-coder of domain p act in the same way as shownin Equation 4 and Equation 6 and then produce thehidden states s c j and t c j for the shared decoder, and s p j and t p j for the private decoder.To predict the target word y j , t c j and t p j areweighted added to get t j as z j = σ ( W z t c j + U z t p j ); t j = z j · t c j + (1 − z j ) · t p j (8) Where σ ( · ) is the sigmoid function and W z and U z are shared by in-domain and out-of-domain.Finally the probability of the target word y j iscomputed with P ( y j | . . . ) ∝ exp( y T j W o t j ); (9) The domain discriminator acts as a classiﬁer to de-termine the knowledge encoded in the shared en-coder is from in-domain or from out-of-domain.When a well trained discriminator can’t classifythe domain properly, we can think the knowledgein the shared encoder is domain invariant (Ganinet al., 2016). As CNN has shown its effectivenessin some related classiﬁcation tasks (Zhang et al.,2015; Yu et al., 2017), we construct our discrimi-nator with CNN.First, the input to the CNN is the representationof the whole source sentence which is got by con-catenating the sequence of hidden states generatedby the shared encoder as Π I = h ⊕ h ⊕ · · · ⊕ h I (10)where I is the length of the source sentence and h , ..., h I is the hidden state of the correspondingsource word. ⊕ stands for the concatenation op-eration of the hidden states, and we can get theﬁnal source sentence representation Π I ∈ R I × m where m is the dimension of the hidden state.We then employ a kernel w ∈ R l × m to apply aconvolutional operation to produce a new featuremap: f = ρ ( w ⊗ Π I + b f ) (11)where ρ is the ReLU activation function, ⊗ standsfor the convolutional operation of the kernel andb is the bias term. A number of different kinds ofkernels with different windows sizes are used inour work to extract different features at differentscales. Next, we apply a max-over-time poolingoperation over the feature maps to get a new fea-ture map. To further improve the performance ofthe discriminator, following the work (Yu et al.,2017), we also add the highway architecture (Sri-vastava et al., 2015; Zhang et al., 2018) behind thepooled feature maps where we use a gate to controlthe information ﬂow between the two layers. Fi-nally, the combined feature map is fed into a fullyconnected network with a sigmoid activation func-tion to make the ﬁnal predictions: p ( d ) ∝ exp( W d · f + b d ); (12)here d is the domain label of in-domain or out-of-domain. Our ﬁnal loss considers the translation loss and thedomain prediction loss. For the translation loss,we employ cross entropy to maximize the trans-lation probability of the ground truth, so we havethis loss as follows and the training objective is tominimize the loss. L MT = − N in + N out (cid:88) k =1 J k (cid:88) j =1 log p ( y ∗ kj ) (13)where N in and N out are the number of trainingsentences for in-domain and out-of-domain datarespectively, J k is the length of the k -th groundtruth sentence, and p ( y ∗ kj ) is the predicted prob-ability of the j -th word for the k -th ground truthsentence.Note that we have three different encoders andthree different decoders in total, including theshared encoder and decoder, the in-domain privateencoder and decoder, and the out-of-domain pri-vate encoder and decoder, and all of them havetheir own parameters.For the domain prediction loss, we also usecross-entropy to minimize the following loss L D = − N in + N out (cid:88) k =1 log p ( d ∗ k ) (14)where d ∗ k is the ground truth domain label of the k -th input sequence.Then the ﬁnal loss is deﬁned as L = L MT + λ L D (15)where λ is a hyper-parameter to balance the effectsof the two parts of loss. We gradually tried λ from0.1 to 2.5 and set it to 1.5 in our ﬁnal experiments.Borrowing ideas from Ganin et al. (2016), weintroduce a special gradient reversal layer (GRL)between the shared encoder and the domain dis-criminator. During forward propagation, the GRLhas no inﬂuence to the model, while during back-propagation training, it multiplies a certain neg-ative constant to the gradients back propagatedfrom the discriminator to the shared encoder. Inthis way, an adversarial learning is applied be-tween the translation part and the discriminator. At the beginning of the training, we just use the L MT to train the translation part on the combineddata, including the shared encoder-decoder andthe in-domain and out-of-domain private encoder-decoder. Then we use L D to only train the do-main discriminator until the precision of the dis-criminator reach 90% while the parameters of theshared encoder keep unupdated. Finally, we trainthe whole model with the complete loss L with allthe parameters updated. In the training process,the sentences in each batch is sampled from in-domain and out-of-domain data at the same rate.During testing, we just use the shared encoder-decoder and the private in-domain encoder-decoder to perform in-domain translation. We evaluated our method on theEnglish → Chinese (En-Zh), German → English(De-En) and English → German (En-De) domainadaptation translation task. → Chinese

For this task, out-of-domaindata is from the LDC corpus that contains 1.25Msentence pairs. The LDC data is mainly relatedto the News domain. We chose the parallel sen-tences with the domain label

Laws from the UM-Corpus (Tian et al., 2014) as our in-domain data.We chose 109K, 1K and 1K sentences from theUM-Corpus randomly as our training, develop-ment and test data. We tokenized and lowercasedthe English sentences with Moses scripts. Forthe Chinese data, we performed word segmenta-tion using Stanford Segmenter . German → English

For this task, the train-ing data is from the Europarl corpus distributedfor the shared domain adaptation task of WMT2007 (Callison-Burch et al., 2007) where the out-of-domain data is mainly related to the

News domain, containing about 1.25M sentence pairs,and in-domain data is mainly related to the

NewsCommentary domain which is more informalcompared to the news corpus, containing about59.1K sentences. We also used the developmentset of the domain adaptation shared task. Finally,we tested our method on the NC test set of WMT https://nlp.stanford.edu/

006 and WMT 2007. We tokenized and lower-cased the corpora.

English → German

For this task, out-of-domaincorpus is from the WMT 2015 en-de transla-tion task which are mainly

News texts (Bojaret al., 2015) containing about 4.2M sentence pairs.For the in-domain corpus, we used the paral-lel training data from the IWSLT 2015 whichis mainly from the the

TED talks containingabout 190K sentences. In addition, dev2012 andtest2013/2014/2015 of IWSLT 2015 were selectedas the development and test data, respectively. Wetokenized and truecased the corpora.Besides, 16K, 16K and 32K merging opera-tions were performed to learn byte-pair encod-ing(BPE) (Sennrich et al., 2015b) on both sidesof the parallel training data and sentences longerthan 50, 50 and 80 tokens were removed from thetraining data, respectively.

We implemented the baseline and our model byPyTorch framework . For the En-Zh and De-Entranslation task, batch size was set to 80 and vo-cabulary size was set to 25k which covers all thewords in the training set. The source and targetembedding sizes were both set to 256 and the sizeof the hidden units in the shared encoder-decoderRNNs was also set to 256. During experiments,we found that the shared encoder-decoder playeda major role in the model and the size of the pri-vate encoder-decoder didn’t inﬂuence the resultstoo much. Thus we just set the size of the pri-vate encoder-decoder one-quarter of the sharedencoder-decoder considering the training and de-coding speed.For the En-De translation task, batch size wasset to 40 and vocabulary size was set to 35K inthe experiment. The source and target embeddingsizes were both set to 620 and the size of the hid-den units in the shared encoder and decoder RNNswas set to 1000. As mentioned before, the size ofthe private encoder-decoder was just one-quarterof the shared encoder-decoder.All the parameters were initialized by using uni-form distribution over [ − . , . . The adadeltaalgorithm was employed to train the model. Wereshufﬂed the training set between epochs. Be-sides, the beam size was set to 10. Contrast Methods

We compared our model http://pytorch.org with the following models, namely:• In : This model was trained only with the in-domain data.• Out + In : This model was trained with bothof the in-domain and out-of-domain data.•

Sampler (Chu et al., 2017) : This methodover-sampled the in-domain data and concatenatedit with the out-of-domain data.•

Fine Tune (Luong and Manning, 2015) : Thismodel was trained ﬁrst on the out-of-domain dataand then ﬁne-tuned using the in-domain data.•

Domain Control (DC) (Kobus et al., 2017): This method extend word embedding with an ar-bitrary number of cells to encode domain informa-tion.•

Discriminative Mixing (DM) (Britz et al.,2017) : This method adds a discriminator on top ofthe encoder which is trained to predict the correctclass label of the input data. The discriminator isoptimized jointly with the translation part.•

Target Token Mixing (TTM) (Britz et al.,2017) : This method append a domain token tothe target sequence.•

Adversarial Discriminative Mix-ing(ADM) (Britz et al., 2017) : This methodis similar with our model which also add adiscriminator to extract common features acrossdomains. The biggest difference is that we addprivate parts to preserve the domain speciﬁcfeatures. Besides we also applied a differenttraining strategy as the section 5 describes so thatour method can handle more generic situations.Noting that our model has a private encoder-decoder which brings extra parameters, we justslightly extend the hidden size of the contrastmodel to make sure that the total parameter num-ber of the contrast model is equal to the number ofour model’s translation part.

The

En-Zh Experiments

Results are measuredusing char based 5-gram BLEU score (Papineniet al., 2002) by the multi-bleu.pl script. The mainresults are shown in Table 1. On both of the de-velopment set and test set, our model signiﬁcantlyoutperforms the baseline models and other con-trast models. Furthermore, we got the followingconclusions:First, the baseline model ’In’ surpass the ’Out +In’ model which shows that the NMT model tendsto ﬁt out-of-domain features if we directly include

100 −50 0 50 100−100−50050100 (a) without discriminator −100 −50 0 50 100−100−50050100 (b) full model

Figure 3: The shared encoder’s hidden state of the two models. Data from the out-of-domain are presented as Pinkdots while data from the in-domain are presented as Yellow dots. There is an obvious separation of the results ofthe without discriminator model but the hidden states of the shared encoder of our full model are well-distributed.

En-Zh dev test averageIn 32.45 30.42 31.44Out + In 30.37 28.76 29.57Sampler 35.06 32.97 34.02Fine Tune 35.02 33.36 34.19DC 31.08 29.59 30.34DM 30.98 29.73 30.36TTM 31.77 30.11 30.94ADM 31.23 29.88 30.56our method 36.55** 34.84** 35.70

Table 1: Results of the en-zh translation experiments.The marks indicate whether the proposed methodswere signiﬁcantly better than the best performed con-trast models(**: better at signiﬁcance level α =0.01,*: α =0.05)(Collins et al., 2005) out-of-domain data into the in-domain data, asthe domain speciﬁc features embodied in out-of-domain data is in greater quantity than that in in-domain data. However, we also found that themodel will over ﬁt so soon if we only use the in-domain data so it is necessary to make use of theout-of-domain data to improve the translation per-formance.Second, we found that when the in-domain datais much less than the out-of-domain data, somecontrast methods for domain adaptation, such asDC, DM TTM and ADM, didn’t perform well.They were worse than the baseline model ’in’ andonly slightly better than ’out + in’. These methods De-En test06 test07 averageIn 23.36 25.00 24.18Out + In 20.69 22.43 21.56Sampler 26.83 29.01 27.92Fine Tune 27.02 29.19 28.11our method 27.97* 30.67** 29.32

Table 2: Results of the WMT 07 De-En translation ex-periments. all try to take domain information into translationin their own ways which actually brings improve-ment compared with the ’out + in’ model. How-ever, as the out-of-domain data is much more thanthe in-domain data, the model will still tends toﬁt out-of-domain data and ignore the in-domaininformation which will degrade the ﬁnal perfor-mance. Therefore, it is necessary to handle thein-domain data separately in some way. The’Sampler’ and ’Fine Tune’ perform better becausethey receive much more information from the in-domain data compared with other methods, butthey don’t make use of the domain informationwhen translating.Last, our model achieves the best performanceamong all the contrast models. The shared en-coder extract the domain invariant features of thetwo domains with the help of the discriminatorso that the shared part will be well trained usingall the in-domain and out-of-domain data. At themeantime, we also consider the domain speciﬁcfeatures and the private encoder-decoder can re- n-De test13 test14 test15In 25.83 21.97 24.64Out + In 26.45 23.21 25.85Sampler 29.70 25.71 28.29Fine Tune 30.48 26.55 28.62Sennrich et al. (2015a) 28.20 24.40 26.70Wang et al. (2017b) 28.58 24.12 -our method 30.99 26.94 29.30*

Table 3: Results of the IWSLT 15 En-De experiments.The second part results were directly taken from theirpapers. ceive enough information from the in-domain datato prevent the whole model from overﬁtting theout-of-domain features.The

De-En Experiments and

En-De Exper-iments results are shown in the Table 2 and Ta-ble 3. Results are measured using word based 4-gram BLEU score (Papineni et al., 2002) by the multi-bleu.pl script. In these two experiments,we only compared our method with the baselinemodel and the competitive contrast methods ’Sam-pler’ (Kobus et al., 2017) and ’Fine Tune’ (Lu-ong and Manning, 2015). Similar to the previousexperiment results, our method still achieves thebest performance compared to all contrast mod-els, which demonstrates again that our model is ef-fective and general to different language pairs anddifferent domains.

We made some some detailed analysis to empiri-cally show the effectiveness of our model based onEn-Zh translation task.

In order to further understand the impact of thecomponents of the proposed model, we performedsome further studies by training multiple versionsof our model by removing some components: Theﬁrst model removed the domain discriminator butpreserved the private part. The second one re-moved the private encoder-decoder but kept thedomain discriminator. The last one just removedboth of those two parts.Results are shown in the Table 4. As expected,the best performance is obtained with the simul-taneous use of all the tested elements. When we DCN Private dev test average √ √ √ × × √ × ×

Table 4: Results of the ablation study. "DCN"means the discriminator and "Private" means the pri-vate encoder-decoder.

Out-of-Domain In-Domain distance- DCN (12.9, 5.8) (-38.9,-17.5) 56.8Full Model (-4.4, 3.0) (13.2,-9.0) 21.3

Table 5: The average coordinates value and its distanceof the the hidden states. ’- DCN’ is the model withoutdomain discriminator. removed the private encoder-decoder, the resultshows that the score was reduced by 0.79, whichindicates that our private part can preserve someuseful domain speciﬁc information which is aban-doned by the shared encoder. When we removedthe discriminator, the result was reduced by 0.76.This result supports our idea that modeling com-mon features from out-of-domain data can bene-ﬁt in-domain translation. When we removed bothof the two components, we got the lowest score.The total result shows that every component of ourmodel plays an important role in our model.

To verify whether the discriminator have learnedthe domain invariant knowledge, we did the fol-lowing experiments using model without dis-criminator and our full model with the domaindiscriminator of the former subsection.We sampled 3000 sentences randomly from theout-of-domain and 1000 sentences from the in-domain En-Zh parallel sentences as the test data.Then they were fed into the shared encoders ofthe two models to get the reshaped feature mapsas the Equation 11 describes. Next, we used thet-Distributed Stochastic Neighbor Embedding(t-SNE) technique to do a dimensionality reductionto the hidden state. Results are shown in the Fig-ure 3. We also calculate the average value of thecoordinates of each domain’s hidden state. The re-sults are shown in Table 5 http://lvdmaaten.github.io/tsne/ n-Zh test1 test2 test3Out + In 22.31 18.82 17.59Sampler 21.60 18.64 16.93Fine Tune 13.18 11.94 11.55our method 22.61 19.36 17.78 Table 6: Results of the out-of-domain translation task.The test sets are from the NIST test sets but we ex-change the translation directions.

From the ﬁgure, we can ﬁnd that here is an ob-vious separation in the results of the model with-out discriminator and the numerical analysis alsosupport this point, which indicates that the sharedencoder without the help of discriminator will treatthe data from different domains differently. All thedomain shared features and domain speciﬁc fea-tures are just mixed together. On the contrary, theoutput of the shared encoder of our full model iswell distributed. This proves that the discrimina-tor can help the shared encoder to extract domaininvariant features which then help to improve thetranslation performance in in-domain.

Despite the fact that the purpose of our work isto improve the in-domain translation performance,the domain invariant features extracted from thetraining data are also beneﬁcial to the out-of-domain translation performance. To prove this,we use the NIST 03 04 and 05 test sets whichare mainly related to the

News domain as ourout-of-domain test set. Noting that the origin setwas designed for the Zh-En translation task andeach sentence has four English references, we justchose the ﬁrst reference as the source side sen-tence for our En-Zh translation task. The resultsare shown in the Table 6 We can conclude fromthe results that the "Fine Tune" method suffered acatastrophic forgetting caused by parameter shiftduring the training process. On the contrary, ourmethod can achieve a mild improvement on theout-of-domain compared to the baseline system.

Transformer (Vaswani et al., 2017) is an efﬁcientNMT architecture. To test the generality of ourmethod, we also conducted relevant experimentsbased on the transformer model. We did the ex-

En-Zh dev test averageIn 32.61 30.33 31.47Sampler 35.84 33.68 34.76Fine Tune 36.01 34.03 35.02our method 37.26** 35.39** 36.33

Table 7: Results of the En-Zh experiments based on thetransformer model. periment based on the Fairseq code . The imple-mentation on this translation framework is simi-lar with the way on the RNN based models. Theencoder and decoder of our ﬁnal model consist 3sublayers. The number of the multi-head atten-tion was set to 4 and the embedding dim was setto 256. We also compared with the ’Sampler’ and’Fine Tune’ method based on transformer. The re-sults are shown in 7. According to the table, ourmethod still outperforms than other models, whichcan prove that our method has a good generalityacross different translation architecture. In this paper, we present a method to make use ofout-of-domain data to help in-domain translation.The key idea is to divide the knowledge into do-main invariant and domain speciﬁc. The realiza-tion way is to employ a shared encoder-decoderto process domain invariant knowledge and a pri-vate encoder-decoder for each domain to processknowledge of the corresponding domain. In ad-dition, a discriminator is added to the shared en-coder and adversarial learning is applied to makesure the shared encoder can learn domain invariantknowledge. We conducted experiments on mul-tiple data sets and get consistent signiﬁcant im-provements. We also veriﬁed via experiments thatthe shared encoder, the domain speciﬁc privateencoder-decoder and the discriminator all makecontribution to the performance improvements.

Acknowledgements

We thank the three anonymous reviewers for theircomments, Jinchao Zhang, Wen Zhang for sugges-tions. This work was supported by the NationalNatural Science Foundation of China (NSFC) un-der the project NO.61876174, NO.61662077 andNO.61472428. https://fairseq.readthedocs.io/en/latest/index.html eferences Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2014. Neural machine translation by jointlylearning to align and translate. arXiv preprintarXiv:1409.0473 .Ondrej Bojar, Rajen Chatterjee, Christian Federmann,Barry Haddow, Matthias Huck, Chris Hokamp,Philipp Koehn, Varvara Logacheva, Christof Monz,Matteo Negri, Matt Post, Carolina Scarton, LuciaSpecia, and Marco Turchi. 2015. Findings of the2015 workshop on statistical machine translation. In

WMT@EMNLP .Denny Britz, Quoc Le, and Reid Pryzant. 2017. Ef-fective domain mixing for neural machine transla-tion. In

Proceedings of the Second Conference onMachine Translation , pages 118–126.Chris Callison-Burch, Cameron Fordyce, PhilippKoehn, Christof Monz, and Josh Schroeder. 2007.(meta-) evaluation of machine translation. In

Pro-ceedings of the Second Workshop on Statistical Ma-chine Translation , pages 136–158. Association forComputational Linguistics.Wenhu Chen, Evgeny Matusov, Shahram Khadivi,and Jan-Thorsten Peter. 2016. Guided alignmenttraining for topic-aware neural machine translation. arXiv preprint arXiv:1607.01628 .Xinchi Chen, Zhan Shi, Xipeng Qiu, and XuanjingHuang. 2017. Adversarial multi-criteria learningfor chinese word segmentation. arXiv preprintarXiv:1704.07556 .Kyunghyun Cho, Bart Van Merriënboer, Caglar Gul-cehre, Dzmitry Bahdanau, Fethi Bougares, HolgerSchwenk, and Yoshua Bengio. 2014. Learningphrase representations using rnn encoder-decoderfor statistical machine translation. arXiv preprintarXiv:1406.1078 .Chenhui Chu, Raj Dabre, and Sadao Kurohashi. 2017.An empirical comparison of simple domain adapta-tion methods for neural machine translation. arXivpreprint arXiv:1701.03214 .Michael Collins, Philipp Koehn, and Ivona Kuˇcerová.2005. Clause restructuring for statistical machinetranslation. In

Proceedings of the 43rd annualmeeting on association for computational linguis-tics , pages 531–540. Association for ComputationalLinguistics.Tobias Domhan and Felix Hieber. 2017. Using target-side monolingual data for neural machine translationthrough multi-task learning. In

Proceedings of the2017 Conference on Empirical Methods in NaturalLanguage Processing , pages 1500–1505.Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan,Pascal Germain, Hugo Larochelle, François Lavi-olette, Mario Marchand, and Victor Lempitsky. 2016. Domain-adversarial training of neural net-works.

The Journal of Machine Learning Research ,17(1):2096–2030.Jonas Gehring, Michael Auli, David Grangier, De-nis Yarats, and Yann N Dauphin. 2017. Convolu-tional sequence to sequence learning. arXiv preprintarXiv:1705.03122 .Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza,Bing Xu, David Warde-Farley, Sherjil Ozair, AaronCourville, and Yoshua Bengio. 2014. Generative ad-versarial nets. In

Advances in neural informationprocessing systems , pages 2672–2680.Tao Gui, Qi Zhang, Haoran Huang, Minlong Peng, andXuanjing Huang. 2017. Part-of-speech tagging fortwitter with adversarial neural networks. In

Pro-ceedings of the 2017 Conference on Empirical Meth-ods in Natural Language Processing , pages 2411–2420.Caglar Gülçehre, Orhan Firat, Kelvin Xu, KyunghyunCho, Loıc Barrault, Huei-Chi Lin, Fethi Bougares,Holger Schwenk, and Yoshua Bengio. 2015. On us-ing monolingual corpora in neural machine transla-tion.

CoRR, abs/1503.03535 , 15.Nal Kalchbrenner and Phil Blunsom. 2013. Recurrentcontinuous translation models. In

Proceedings ofthe 2013 Conference on Empirical Methods in Nat-ural Language Processing , pages 1700–1709.Joo-Kyung Kim, Young-Bum Kim, Ruhi Sarikaya, andEric Fosler-Lussier. 2017. Cross-lingual transferlearning for pos tagging without cross-lingual re-sources. In

Proceedings of the 2017 Conference onEmpirical Methods in Natural Language Process-ing , pages 2832–2838.Catherine Kobus, Josep Crego, and Jean Senellart.2017. Domain control for neural machine transla-tion. In

Proceedings of the International ConferenceRecent Advances in Natural Language Processing,RANLP 2017 , pages 372–378.Minh-Thang Luong and Christopher D Manning. 2015.Stanford neural machine translation systems for spo-ken language domains. In

Proceedings of the In-ternational Workshop on Spoken Language Transla-tion , pages 76–79.Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic eval-uation of machine translation. In

Proceedings ofthe 40th annual meeting on association for compu-tational linguistics , pages 311–318. Association forComputational Linguistics.Rico Sennrich, Barry Haddow, and Alexandra Birch.2015a. Improving neural machine translationmodels with monolingual data. arXiv preprintarXiv:1511.06709 .ico Sennrich, Barry Haddow, and Alexandra Birch.2015b. Neural machine translation of rarewords with subword units. arXiv preprintarXiv:1508.07909 .Rupesh Kumar Srivastava, Klaus Greff, and JürgenSchmidhuber. 2015. Highway networks. arXivpreprint arXiv:1505.00387 .Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.Sequence to sequence learning with neural net-works. In

Advances in neural information process-ing systems , pages 3104–3112.Liang Tian, Derek F Wong, Lidia S Chao, PauloQuaresma, Francisco Oliveira, and Lu Yi. 2014.Um-corpus: A large english-chinese parallel corpusfor statistical machine translation. In

LREC , pages1837–1842.Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In

Advances in Neural Information Pro-cessing Systems , pages 6000–6010.Rui Wang, Andrew Finch, Masao Utiyama, and Ei-ichiro Sumita. 2017a. Sentence embedding for neu-ral machine translation domain adaptation. In

Pro-ceedings of the 55th Annual Meeting of the Associa-tion for Computational Linguistics (Volume 2: ShortPapers) , volume 2, pages 560–566.Rui Wang, Masao Utiyama, Lemao Liu, Kehai Chen,and Eiichiro Sumita. 2017b. Instance weightingfor neural machine translation domain adaptation.In

Proceedings of the 2017 Conference on Empiri-cal Methods in Natural Language Processing , pages1482–1488.Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu.2017. Seqgan: Sequence generative adversarial netswith policy gradient. In

Thirty-First AAAI Confer-ence on Artiﬁcial Intelligence .Jiali Zeng, Jinsong Su, Huating Wen, Yang Liu,Jun Xie, Yongjing Yin, and Jianqiang Zhao. 2018.Multi-domain neural machine translation with word-level domain context discrimination. In

Proceed-ings of the 2018 Conference on Empirical Methodsin Natural Language Processing , pages 447–457.Jiajun Zhang and Chengqing Zong. 2016. Exploit-ing source-side monolingual data in neural machinetranslation. In

Proceedings of the 2016 Conferenceon Empirical Methods in Natural Language Pro-cessing , pages 1535–1545.Wen Zhang, Jiawei Hu, Yang Feng, and Qun Liu. 2018.Reﬁning source representations with relation net-works for neural machine translation. In

Proceed-ings of the 27th International Conference on Com-putational Linguistics , pages 1292–1303. Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015.Character-level convolutional networks for text clas-siﬁcation. In