[PDF] Revisiting Simple Domain Adaptation Methods in Unsupervised Neural Machine Translation

Abstract

Domain adaptation has been well-studied in supervised neural machine translation (SNMT). However, it has not been well-studied for unsupervised neural machine translation (UNMT), although UNMT has recently achieved remarkable results in several domain-specific language pairs. Besides the inconsistent domains between training data and test data for SNMT, there sometimes exists an inconsistent domain between two monolingual training data for UNMT. In this work, we empirically show different scenarios for unsupervised neural machine translation. Based on these scenarios, we revisit the effect of the existing domain adaptation methods including batch weighting and fine tuning methods in UNMT. Finally, we propose modified methods to improve the performances of domain-specific UNMT systems.

Full PDF

aa r X i v : . [ c s . C L ] M a y Revisiting Simple Domain Adaptation Methodsin Unsupervised Neural Machine Translation

Haipeng Sun ∗ , Rui Wang , Kehai Chen , Masao Utiyama ,Eiichiro Sumita , Tiejun Zhao , and Chenhui Chu Harbin Institute of Technology, Harbin, China National Institute of Information and Communications Technology (NICT), Kyoto, Japan Osaka University, Osaka, Japan [email protected] , [email protected] , [email protected] { wangrui, khchen, mutiyama, eiichiro.sumita } @nict.go.jp Abstract

Domain adaptation has been well-studied in supervised neural machine translation (SNMT).However, it has not been well-studied for unsupervised neural machine translation (UNMT),although UNMT has recently achieved remarkable results in several domain-speciﬁc languagepairs. Besides the domain inconsistence between parallel training data and test data for SNMT,there sometimes exists domain inconsistence between two monolingual training data for UNMT.In this work, we empirically categorize different domain adaptation scenarios for UNMT. Basedon these scenarios, we revisit the effect of the existing representative domain adaptation methodsincluding batch weighting and ﬁne tuning methods in UNMT. Finally, we propose modiﬁedmethods to improve the performances of domain-speciﬁc UNMT systems.

Neural Machine Translations (NMT) have set several state-of-the-art new benchmarks(Bojar et al., 2018; Barrault et al., 2019). Recently, unsupervised NMT (UNMT) has attractedgreat interests in the machine translation community (Artetxe et al., 2018; Lample et al., 2018a;Yang et al., 2018; Lample et al., 2018b; Sun et al., 2019). Typically, UNMT relies solely on monolingualcorpora in similar domain rather than bilingual parallel data for supervised NMT (SNMT) to modeltranslations between the source language and target language and has achieved remarkable results onseveral translation tasks (Lample and Conneau, 2019).The available training data is ever increasing; however, only the related-domain corpora, also calledin-domain corpora, are able to improve the NMT performance (Koehn and Knowles, 2017). Additionalunrelated corpora, also called out-of-domain corpora, are unable to improve or even harm the NMTperformance for some domains such as TED talks and some tasks such as IWSLT (Wang et al., 2017b).Domain adaptation methods have been well-studied in SNMT (Chu et al., 2017; Chen et al., 2017;Wang et al., 2017a; Wang et al., 2017b; van der Wees et al., 2017; Farajian et al., 2017;Chu and Wang, 2018) while they have not been well-studied in UNMT. For UNMT, in additionto inconsistent domains between training data and test data for SNMT, there also exist other inconsistentdomains between monolingual training data in two languages. Actually, it is difﬁcult for some languagepairs to obtain enough source and target monolingual corpora from the same domain in the real-worldscenario. In this paper, we ﬁrst deﬁne and analyze several scenarios for UNMT with speciﬁc domain.On the basis of the characteristics of these scenarios, we revisit the existing domain adaptation methodsincluding batch weighting and ﬁne tuning methods in UNMT. Finally, we proposed modiﬁed domainadaptation methods to improve the performance of UNMT in these scenarios.To the best of our knowledge, this paper is the ﬁrst work to explore domain adaptation problem inUNMT.

In SNMT, all the corpora are parallel and the domains of source and target corpora are the same.Therefore, the domain adaptation technologies focus on the domain shift between the training and test ∗ Haipeng Sun was an internship research fellow at NICT when conducting this work. cenarios Abbreviation L in-domain L in-domain L out-of-domain L out-of-domainMonolingual corporafrom same domains II X X × × OO × × X X

IIOO

X X X X

Monolingual corporafrom different domains

IOO × X X XX × X X

IIO

X X X × X X × X IO × X X × X × × X Table 1: The statistics of monolingual training corpora for different scenarios. X denotes having thismonolingual corpus in one scenario; × denotes having no this monolingual corpus in one scenario.corpora. In UNMT, there is only monolingual corpora and the domains of source and target corporaare sometimes different. Therefore, there are more scenarios of UNMT domain adaptation. Giventwo different languages L and L , we deﬁne two main scenarios according to the domains of twolanguages in the training set: monolingual training corpora from the same domain, and monolingualtraining corpora from different domains, as shown in Table 1.Take monolingual corpora from different domains as an example, we further divide this scenario intothree sub-scenarios: IOO , IIO , and IO , where “ I ” denotes the in-domain data for one language and“ O ” denotes the out-of-domain data for one language. Further, IOO denotes there are resource-rich out-of-domain monolingual corpora for both languages and resource-poor in-domain monolingual corporafor language L . Especially, we regard “ L in-domain + L out-of-domain” and “ L in-domain + L out-of-domain” as the same scenario IO . Note that scenario II and OO were only as the baselinesto evaluate other four scenarios. In this paper, we consider other four scenarios to improve translationperformance. According to our introduced scenarios, we revisited two simple domain adaptation methods, that is, batchweighting and ﬁne tuning.

The batch weighting method (Wang et al., 2017b) for SNMT is difﬁcult to be directlytransferred to the UNMT training because the training data of the source and target languages aresometimes unbalanced (such as the IO and IIO scenarios). Regardless of training cross-linguallanguage model or UNMT model, the model causes over-ﬁtting in one language which includes thesmaller amount of in-domain monolingual corpus. In other words, the large-scale out-of-domainmonolingual corpus for other language is not fully utilized.

Modiﬁed:

To address this issue, we propose a batch weighting method for UNMT domain adaptationto make full use of out-of-domain corpus to build a robust UNMT model when there exists only onelarge-scale out-of-domain monolingual corpus in one scenario. Speciﬁcally, we adjust the weight ofout-of-domain sentences to increase the amount of out-of-domain sentences rather than to increase thatof in-domain sentences (Wang et al., 2018) in every training batch. In our batch weighting method, theout-of-domain sentence ratio is estimated as R out = N out N out + N in , (1)where N in is the number of mini-batches loaded from in-domain monolingual corpora in intervals of N out mini-batches loaded from out-of-domain monolingual corpora.For the IO and IIO scenario, we apply the proposed batch weighting method to train cross-linguallanguage model and UNMT model in turn since the quantity of training data in two languages is quiteifferent in the IO and IIO scenario. For

IOO and

IIOO scenario, there are two large-scale out-of-domain monolingual corpora and their quantity is similar. Therefore, batch weighting method is not sonecessary for these scenarios.

For the

IIOO and

IIO scenarios, we ﬁrst train UNMT model on the corresponding corporauntil convergence. Then we further ﬁne tune parameters of the UNMT model on the resource-poor in-domain monolingual corpora for both languages. However, The original ﬁne tuning method is difﬁcult todirectly transferred to the UNMT training under the

IOO , IO scenarios since there only exist in-domaindata for language L under these scenarios as shown in Table 1. Modiﬁed:

We propose modiﬁed data selection method (Moore and Lewis, 2010; Axelrod et al., 2011)to select pseudo in-domain data from out-of-domain data for another language L . The traditional dataselection for SNMT domain adaptation (Wang et al., 2017a; Wang et al., 2018) is not suitable for UNMTbecause in-domain language model could not be trained where does not exist in-domain corpus forlanguage L .To address this issue, we back-translate the language L in-domain data to language L pseudo in-domain data, using an UNMT baseline system. Then, we use these corpora to train a cross-linguallanguage model as the in-domain language model. For the IO scenario that just exists out-of-domaincorpus for language L as shown in Table 1, we randomly select language L out-of-domain corpus thatis similar in size to the language L in-domain corpus and take the same approach to train a cross-linguallanguage model as the out-of-domain language model. For the IOO scenario that exist out-of-domaincorpora for both languages, we randomly select out-of-domain corpora that are similar in size to thelanguage L in-domain corpus, respectively. Then we train a cross-lingual out-of-domain languagemodel, using these corpora.In practice, we adopt the data selection method (Moore and Lewis, 2010; Axelrod et al., 2011), andrank an out-of-domain sentence s using: CE I ( s ) − CE O ( s ) , (2)where CE I ( s ) denotes the cross-entropy of the sentence s computed by the in-domain language model; CE O ( s ) denotes cross-entropy of the sentence s computed by the out-of-domain language model. Thismeasure biases towards sentences that are both like the in-domain corpus and unlike the out-of-domaincorpus. Then we select the lowest-scoring sentences as the pseudo in-domain corpus.Finally, we further ﬁne tune parameters of the UNMT model on the resource-poor in-domainmonolingual corpora for language L and the pseudo in-domain corpus for language L after we applymodiﬁed data selection method to achieve the pseudo in-domain corpus for language L .Scenarios Batch weighting Fine tuning IIOO - X IOO - X IIO

X X IO X X

Table 2: The suitability of the proposed methods for different scenarios. X denotes that the method isused in this scenario; − denotes that the method is not used in this scenario.Overall, batch weighting method is used in the case that there is no out-of-domain monolingual corpusfor one language, including scenario IIO and IO ; ﬁne tuning method is suitable to all our consideredscenarios, including scenario IIOO , IOO , IIO , and IO , as shown in Table 2. Experiments

We considered two language pairs to do simulated experiments on the French (Fr) ↔ English (En) andGerman (De) ↔ En translation tasks. For out-of-domain corpora, we used 50M sentences from WMTmonolingual news crawl datasets for each language. For in-domain corpora, we used 200k sentencesfrom the IWSLT TED-talk based shufﬂed training corpora for each language. To make our experimentscomparable with previous work (Wang et al., 2018), we reported results on IWSLT test2010 and test2011for Fr ↔ En and IWSLT test2012 and test2013 for De ↔ En.For preprocessing, we followed the same method of Lample et al. (2018b). That is, we used a sharedvocabulary for both languages with 60k subword tokens based on BPE (Sennrich et al., 2016b). We usedthe same vocabulary including in-domain and out-of-domain corpora for different scenarios. If thereexists only one in-domain monolingual corpus in one scenario, we chose Fr/De in-domain monolingualcorpus; if there exists only one out-of-domain monolingual corpus in one scenario, we chose En out-of-domain monolingual corpus for uniform comparison.

We used the XLM UNMT toolkit and followed settings of Lample and Conneau (2019). We ﬁrst trainedcross-lingual language model, and followed settings of Lample and Conneau (2019): 6 layers for theencoder. The dimension of hidden layers was set to 1024. The Adam optimizer (Kingma and Ba, 2015)was used to optimize the model parameters. The initial learning rate was 0.0001, β = 0 . , and β =0 . . We trained a speciﬁc cross-lingual language model for each scenario, respectively. The cross-lingual language model was used to initialize the encoder and decoder of the whole UNMT model andselect pseudo in-domain monolingual corpus.The UNMT model included 6 layers for the encoder and the decoder. The other parameters were thesame as that of language model. We used the case-sensitive 4-gram BLEU score computed by multi − bleu.perl script from Moses (Koehn et al., 2007) to evaluate the test sets. The baselines in differentscenarios are the UNMT systems trained on the mixed monolingual corpora including in-domain andout-of-domain data in the corresponding scenarios. Table 3 shows the detailed BLEU scores of all UNMT systems on the De ↔ En and Fr ↔ En test sets.

IIOO , IOO , IIO , and IO scenario were presented in the IIOO , ﬁnetuning method could further improve UNMT performance, achieving an average improvement of 4.8BLEU scores on all test sets.3) In the scenario where monolingual training corpora are from different domains (unique scenario forUNMT domain adaptation), our modiﬁed methods achieved average improvements of 4.4, 11.9, and 6.6BLEU scores in the scenario

IOO , IIO , and IO , respectively.4) Our modiﬁed batch weighting method improved UNMT performance in the case that there is noout-of-domain monolingual corpora for one language such as scenario IIO and IO . Our modiﬁed ﬁnetuning method could further improve translation performance in the case that there is no in-domainmonolingual corpora for one language such as scenario IOO and IO . https://github.com/facebookresearch/XLM Scenario Supervision Method De-En En-De Fr-En En-Frtest2012 test2013 test2012 test2013 test2010 test2011 test2010 test20111 II Yes Wang et al. (2018) n/a n/a 23.07 25.40 n/a n/a 32.11 35.222 Base 33.68 35.41 28.09 30.48 36.13 40.07 36.43 37.583 II No Base 24.42 25.65 21.99 22.72 25.94 29.73 25.32 27.064 OO Base 21.21 21.66 10.25 9.90 24.28 28.77 23.08 26.085

IIOO

No Base 24.87 26.00 21.64 22.57 26.05 30.18 26.35 30.126 FT 29.82 31.57 26.48 28.18 31.23 35.94 29.08 33.677

IOO

No Base 20.94 21.52 16.53 16.80 25.16 29.88 25.18 28.738 FT(original) 22.75 23.14 21.09 21.78 28.37 33.57 26.16 30.149 FT(modiﬁed) 24.33 24.77 24.43 25.59 29.13 34.38 26.45 30.6910

IIO

No Base 11.11 10.30 11.54 11.95 17.88 20.32 17.02 18.1611 FT+BW(original) 19.91 20.19 17.05 17.23 26.84 29.61 23.18 25.1812 FT+BW(modiﬁed) 26.12 27.33 22.63 23.72 27.88 32.16 25.42 28.0513 IO No Base 10.79 10.77 11.44 11.82 18.00 20.91 16.19 16.8414 BW(original) 8.15 7.05 9.28 9.70 18.00 19.52 16.39 17.7215 FT+BW(modiﬁed) 19.76 20.22 18.32 18.99 22.59 26.55 20.61 22.79

Table 3: The BLEU scores in the different scenarios for En-De and En-Fr language pairs. Base denotesthe baseline in the different scenarios; FT denotes ﬁne tuning method; BW denotes batch weightingmethod. Original denotes the original method for SNMT; modiﬁed denotes our modiﬁed method forUNMT. N in = 10 , N out = 1 in original batch weighting method, N in = 1 , N out = 30 in modiﬁed batch weighting method,and selected pseudo in-domain corpus size is set to 20K for ﬁne tuning method in scenario IO and IOO .Note that L in-domain data and all L out-of-domain data were used in original ﬁne tuning method forscenario IOO . We now further analyze batch weighting and ﬁne tuning methods and perform an ablation analysis in theunique scenarios for UNMT domain adaptation.

In Figure 1, we empirically investigated how the out-of-domain ratio R out in Eq. (1) affects the UNMTperformance on the En ↔ De task in the IO scenario. N in was set to 1. The selection of N out inﬂuencesthe weight of out-of-domain sentences every batch across the entire UNMT training process. Largervalues of N out enable more out-of-domain sentences utilized in the UNMT training. The smaller thevalue of N out is, the more important are in-domain sentences. As the Figure 1 shows, N out ranging from10 to 100 all enhanced UNMT performance and a balanced N out = 30 achieved the best performance. N out value B LE U s c o r e En-De-test2012 En-De-test2013De-En-test2012 De-En-test2013

Figure 1: Effect of mini-batch size N out for UNMT performance after introducing batch weightingmethod on the En ↔ De dataset in the IO scenario.Moreover, We explored the performance of two batch weighting methods, that is, the existingatch weighting method (Wang et al., 2018) used in NMT domain adaptation and our modiﬁed bacthweighting method focused on UNMT domain adaptation. As shown in Table 4, +BW (Wang et al., 2018)( N in = 10 , N out = 1 ) achieved worse performance than the baseline. Our modiﬁed batch weightingmethod outperformed the baseline by 4.6 ∼ IO scenario on En-De language pairs.In addition, we also investigated the training time cost between our batch weighting method andthe baseline in the IO scenario. As shown in Figure 2, both our batch weighting method and thebaseline take 30 hours during the whole training process on the IO scenario. The BLEU score ofthe baseline decreased rapidly after certain epochs due to over-ﬁtting while our proposed batch weightmethod could continuously improve translation performance during training process. Over the course oftraining process, our proposed batch weight method performed signiﬁcantly better than baseline. Thesedemonstrate that our proposed batch weighting method is robust and effective. Training time (hour) B LE U s c o r e En-De-BW En-De-BaseDe-En-BW De-En-Base

Figure 2: The learning curve between baseline and batch weighting model on the En ↔ De test2012 in IO scenario. As shown in Figure 3, we empirically investigated how the selected pseudo in-domain corpus size forﬁne tuning affects the performance of ﬁne tuning UNMT on the En ↔ De task in the IO scenario. Thelarger corpus size brought more pseudo in-domain corpus participate in UNMT further training; thesmaller corpus size made pseudo in-domain corpus more precise. Corpus size ranging from 5k to 10Mall enhanced UNMT performance and UNMT model achieved the best performance when corpus sizewas set to 20K as shown in Figure 3. This indicates that our modiﬁed ﬁne tuning method is robust andeffective.Moreover, we evaluated the different data selection criteria before ﬁne tuning UNMT system on theEn ↔ De task in IO scenario. CED outperformed CE by approximately 1 BLEU score as shown in Table5. This demonstrates that pseudo in-domain corpus selected by CED is more precise for improvingUNMT performance.We also investigated the necessity of denoising auto-encoder during ﬁne tuning process in the IIOO scenario on the En-De language pairs. As shown in Table 6, the ﬁne tuning model with denoising K K K K K M M Corpus size B LE U s c o r e En-De-test2012 En-De-test2013De-En-test2012 De-En-test2013

Figure 3: Effect of selected in-domain corpus size for the performance of ﬁne tuning UNMT model onthe En ↔ De dataset in the IO scenario. Corpus size “0” indicates the result of the UNMT model onlywith batch weighting method.Method De-En En-Detest2012 test2013 test2012 test2013CED 19.76 20.22 18.32 18.99CE 18.53 18.87 17.19 17.81Table 5: Different data selection criteria on the En-De language pairs in IO scenario. CED denotes crossentropy difference criterion CE I ( s ) − CE O ( s ) ; CE denotes cross entropy criterion CE I ( s ) . Pseudoin-domain corpus size is set to 20K.performed slightly better than that without denoising. This demonstrates that denoising auto-encodercan further enhance model learning ability during ﬁne-tuning on in-domain data.Method De-En En-Detest2012 test2013 test2012 test2013w/o denoising 29.80 30.99 26.39 27.84w denoising 29.82 31.57 26.48 28.18Table 6: Denoising analysis on the En-De language pairs in IIOO scenario.

We performed an ablation analysis to understand the importance of our proposed methods in the IO and IIO scenarios (unique scenarios for UNMT domain adaptation).As shown in Table 7, we observed that both of +FT and +BW outperformed the Base in the IO scenarioand +BW was more suitable for this scenario, achieving much more improvement in BLEU score.Moreover, +FT+BW can complement each other to further improve UNMT performance, achievingthe best performance in the IO scenario.As shown in Table 8, we observed that both of +FT and +BW outperformed the Base in the IIO scenario. In particular, the +FT+BW was further better than both +FT and +BW. This means that ourmodiﬁed batch weighting and ﬁne tuning methods can improve the performance of UNMT in this

IIO scenario, especially, both of them can complement each other to further improve translation performance.

Recently, UNMT (Artetxe et al., 2018; Lample et al., 2018a; Yang et al., 2018), that has been trainedvia bilingual word embedding initialization, denoising auto-encoder, and back-translation and sharingethod De-En En-Detest2012 test2013 test2012 test2013Base 10.79 10.77 11.44 11.82+FT 12.63 12.36 12.22 13.32+BW 17.78 18.00 16.01 16.60+FT+BW 19.76 20.22 18.32 18.99Table 7: Ablation analysis on the En ↔ De dataset in IO scenario. +BW denotes that a UNMT systemwas trained with batch weighting method; +FT denotes that ﬁne tuning was applied to a UNMT baselinesystem; +FT+BW denotes that ﬁne tuning was applied to a UNMT system trained with batch weighting.Method De-En En-Detest2012 test2013 test2012 test2013Base 11.11 10.30 11.54 11.95+BW 18.96 18.87 20.23 20.81+FT 19.78 20.70 17.24 18.02+FT+BW 26.12 27.33 22.63 23.72Table 8: Ablation analysis on the En ↔ De dataset in

IIO scenario.latent representation mechanisms, has attracted great interest in the machine translation community.Lample et al. (2018b) achieved remarkable results on some similar language pairs by concatenatingtwo bilingual corpora as one monolingual corpus and using monolingual embedding to initialize theembedding layer of UNMT. Wu et al. (2019) proposed an extract-edit approach, to extract and thenedit real sentences from the target monolingual corpora instead of back-translation. Sun et al. (2019)proposed bilingual word embedding agreement mechanisms to improve UNMT performance. Morerecently, Lample and Conneau (2019) achieved state-of-the-art UNMT performance by introducing thepretrained cross-lingual language model. However, previous work only focuses on how to build state-of-the-art UNMT systems on speciﬁc domain and ignore the effect of UNMT on different domain. Researchon domain adaptation for UNMT has been limited while domain adaptation methods have been well-studied in SNMT.Chu and Wang (2018) gave a survey of domain adaptation techniques for SNMT. Domainadaptation for SNMT could be categorized into two main categories: data optimization andmodel optimization. Data optimization methods included synthetic parallel corpora generationusing in-domain monolingual corpus (Sennrich et al., 2016a; Hu et al., 2019) and data selection forout-of-domain parallel corpora (Wang et al., 2017a; van der Wees et al., 2017; Zhang et al., 2019).Training objective optimization including instance weighting (Wang et al., 2017b; Chen et al., 2017)and ﬁne tuning (Luong and Manning, 2015; Sennrich et al., 2016a; Freitag and Al-Onaizan, 2016;Servan et al., 2016; Chu et al., 2017), architecture optimization (Kobus et al., 2017; Britz et al., 2017;Gu et al., 2019) and decoding optimization (Freitag and Al-Onaizan, 2016; Khayrallah et al., 2017;Saunders et al., 2019) were common model optimization methods for domain adaptation.

In this paper, we mainly raise the issue of UNMT domain adaptation since domain adaptation methods forUNMT have never been proposed. We empirically show different scenarios for domain-speciﬁc UNMT.Based on these scenarios, we revisit the effect of the existing domain adaptation methods including batchweighting and ﬁne tuning methods in UNMT. Experimental results show our modiﬁed correspondingmethods improve the performance of UNMT in these scenarios. In the future, we will try to investigateother unsupervised domain adaptation methods to further improve domain-speciﬁc UNMT performance. eferences

Mikel Artetxe, Gorka Labaka, Eneko Agirre, and Kyunghyun Cho. 2018. Unsupervised neural machinetranslation. In

ICLR , Vancouver, Canada, April.Amittai Axelrod, Xiaodong He, and Jianfeng Gao. 2011. Domain adaptation via pseudo in-domain data selection.In

EMNLP , Edinburgh, Scotland, UK., July.Lo¨ıc Barrault, Ondˇrej Bojar, Marta R. Costa-juss`a, Christian Federmann, Mark Fishel, Yvette Graham, BarryHaddow, Matthias Huck, Philipp Koehn, Shervin Malmasi, Christof Monz, Mathias M¨uller, Santanu Pal, MattPost, and Marcos Zampieri. 2019. Findings of the 2019 conference on machine translation (WMT19). In

WMT ,Florence, Italy, August.Ondˇrej Bojar, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Philipp Koehn, and ChristofMonz. 2018. Findings of the 2018 conference on machine translation (WMT18). In

WMT , Brussels, Belgium,October.Denny Britz, Quoc Le, and Reid Pryzant. 2017. Effective domain mixing for neural machine translation. In

WMT ,Copenhagen, Denmark, September.Boxing Chen, Colin Cherry, George F. Foster, and Samuel Larkin. 2017. Cost weighting for neural machinetranslation domain adaptation. In

NMT@ACL , Vancouver, Canada.Chenhui Chu and Rui Wang. 2018. A survey of domain adaptation for neural machine translation. In

COLING ,Santa Fe, New Mexico, USA.Chenhui Chu, Raj Dabre, and Sadao Kurohashi. 2017. An empirical comparison of domain adaptation methodsfor neural machine translation. In

ACL , Vancouver, Canada, July.M. Amin Farajian, Marco Turchi, Matteo Negri, and Marcello Federico. 2017. Multi-domain neural machinetranslation through unsupervised adaptation. In

WMT , Copenhagen, Denmark.Markus Freitag and Yaser Al-Onaizan. 2016. Fast domain adaptation for neural machine translation.

CoRR ,abs/1612.06897.Shuhao Gu, Yang Feng, and Qun Liu. 2019. Improving domain adaptation translation with domain invariant andspeciﬁc information. In

NAACL , Minneapolis, Minnesota, June.Junjie Hu, Mengzhou Xia, Graham Neubig, and Jaime Carbonell. 2019. Domain adaptation of neural machinetranslation by lexicon induction. In

ACL , Florence, Italy, July.Huda Khayrallah, Gaurav Kumar, Kevin Duh, Matt Post, and Philipp Koehn. 2017. Neural lattice search fordomain adaptation in machine translation. In

IJCNLP , Taipei, Taiwan, November.Diederik P Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In

ICLR , San Diego,California, USA.Catherine Kobus, Josep Maria Crego, and Jean Senellart. 2017. Domain control for neural machine translation. In

RANLP , pages 372–378, Varna, Bulgaria.Philipp Koehn and Rebecca Knowles. 2017. Six challenges for neural machine translation. In

WMT , Vancouver,Canada, August.Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, BrookeCowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and EvanHerbst. 2007. Moses: Open source toolkit for statistical machine translation. In

ACL , Prague, Czech Republic,June.Guillaume Lample and Alexis Conneau. 2019. Cross-lingual language model pretraining.

CoRR , abs/1901.07291.Guillaume Lample, Alexis Conneau, Ludovic Denoyer, and Marc’Aurelio Ranzato. 2018a. Unsupervised machinetranslation using monolingual corpora only. In

ICLR , Vancouver, Canada.Guillaume Lample, Myle Ott, Alexis Conneau, Ludovic Denoyer, and Marc’Aurelio Ranzato. 2018b. Phrase-based & neural unsupervised machine translation. In

EMNLP , Brussels, Belgium.Minh-Thang Luong and Christopher D. Manning. 2015. Stanford neural machine translation systems for spokenlanguage domain. In

IWSLT , Da Nang, Vietnam.obert C. Moore and William Lewis. 2010. Intelligent selection of language model training data. In

ACL ,Uppsala, Sweden, July.Danielle Saunders, Felix Stahlberg, Adri`a de Gispert, and Bill Byrne. 2019. Domain adaptive inference for neuralmachine translation. In

ACL , Florence, Italy, July.Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016a. Improving neural machine translation models withmonolingual data. In

ACL , Berlin, Germany, August.Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016b. Neural machine translation of rare words withsubword units. In

ACL , Berlin, Germany.Christophe Servan, Josep Maria Crego, and Jean Senellart. 2016. Domain specialization: a post-training domainadaptation for neural machine translation.

CoRR , abs/1612.06141.Haipeng Sun, Rui Wang, Kehai Chen, Masao Utiyama, Eiichiro Sumita, and Tiejun Zhao. 2019. Unsupervisedbilingual word embedding agreement for unsupervised neural machine translation. In

ACL , Florence, Italy, July.Marlies van der Wees, Arianna Bisazza, and Christof Monz. 2017. Dynamic data selection for neural machinetranslation. In

EMNLP , Copenhagen, Denmark.Rui Wang, Andrew Finch, Masao Utiyama, and Eiichiro Sumita. 2017a. Sentence embedding for neural machinetranslation domain adaptation. In

ACL , Vancouver, Canada, July.Rui Wang, Masao Utiyama, Lemao Liu, Kehai Chen, and Eiichiro Sumita. 2017b. Instance weighting for neuralmachine translation domain adaptation. In

EMNLP , Copenhagen, Denmark, September.Rui. Wang, Masao Utiyama, Andrew Finch, Lemao Liu, Kehai Chen, and Eiichiro Sumita. 2018. Sentenceselection and weighting for neural machine translation domain adaptation.

TASLP , 26, Oct.Jiawei Wu, Xin Wang, and William Yang Wang. 2019. Extract and edit: An alternative to back-translation forunsupervised neural machine translation. In

NAACL , Minneapolis, Minnesota, June.Zhen Yang, Wei Chen, Feng Wang, and Bo Xu. 2018. Unsupervised neural machine translation with weightsharing. In

ACL , Melbourne, Australia.Xuan Zhang, Pamela Shapiro, Gaurav Kumar, Paul McNamee, Marine Carpuat, and Kevin Duh. 2019. Curriculumlearning for domain adaptation in neural machine translation. In