[PDF] Code-switching pre-training for neural machine translation

Abstract

This paper proposes a new pre-training method, called Code-Switching Pre-training (CSP for short) for Neural Machine Translation (NMT). Unlike traditional pre-training method which randomly masks some fragments of the input sentence, the proposed CSP randomly replaces some words in the source sentence with their translation words in the target language. Specifically, we firstly perform lexicon induction with unsupervised word embedding mapping between the source and target languages, and then randomly replace some words in the input sentence with their translation words according to the extracted translation lexicons. CSP adopts the encoder-decoder framework: its encoder takes the code-mixed sentence as input, and its decoder predicts the replaced fragment of the input sentence. In this way, CSP is able to pre-train the NMT model by explicitly making the most of the cross-lingual alignment information extracted from the source and target monolingual corpus. Additionally, we relieve the pretrain-finetune discrepancy caused by the artificial symbols like [mask]. To verify the effectiveness of the proposed method, we conduct extensive experiments on unsupervised and supervised NMT. Experimental results show that CSP achieves significant improvements over baselines without pre-training or with other pre-training methods.

Full PDF

CCSP: Code-Switching Pre-training for Neural Machine Translation

Zhen Yang, Bojie Hu, Ambyera Han, Shen Huang and Qi Ju ∗ Tencent Minority-Mandarin Translation { zieenyang, bojiehu, ambyera, springhuang, damonju } @tencent.com Abstract

This paper proposes a new pre-trainingmethod, called Code-Switching Pre-training(CSP for short) for Neural Machine Trans-lation (NMT). Unlike traditional pre-trainingmethod which randomly masks some frag-ments of the input sentence, the proposed CSPrandomly replaces some words in the sourcesentence with their translation words in the tar-get language. Speciﬁcally, we ﬁrstly performlexicon induction with unsupervised word em-bedding mapping between the source and tar-get languages, and then randomly replacesome words in the input sentence with theirtranslation words according to the extractedtranslation lexicons. CSP adopts the encoder-decoder framework: its encoder takes the code-mixed sentence as input, and its decoder pre-dicts the replaced fragment of the input sen-tence. In this way, CSP is able to pre-trainthe NMT model by explicitly making the mostof the cross-lingual alignment information ex-tracted from the source and target monolingualcorpus. Additionally, we relieve the pretrain-ﬁnetune discrepancy caused by the artiﬁcialsymbols like [mask]. To verify the effective-ness of the proposed method, we conduct ex-tensive experiments on unsupervised and su-pervised NMT. Experimental results show thatCSP achieves signiﬁcant improvements overbaselines without pre-training or with otherpre-training methods.

Neural machine translation (Kalchbrenner andBlunsom, 2013; Sutskever et al., 2014; Cho et al.,2014; Bahdanau et al., 2015) which typically fol-lows the encoder-decoder framework, directly ap-plies a single neural network to transform thesource sentence into the target sentence. With * indicates corresponding author. tens of millions of trainable parameters in theNMT model, translation tasks are usually data-hungry, and many of them are low-resource or evenzero-resource in terms of training data. Follow-ing the idea of unsupervised and self-supervisedpre-training methods in the NLP area (Peterset al., 2018; Radford et al., 2018, 2019; Devlinet al., 2019; Yang et al., 2019), some works areproposed to improve the NMT model with pre-training, by making full use of the widely avail-able monolingual corpora (Lample and Conneau,2019; Song et al., 2019b; Edunov et al., 2019;Huang et al., 2019; Wang et al., 2019; Rotheet al., 2019; Clinchant et al., 2019). Typically, twodifferent branches of pre-training approaches areproposed for NMT: model-fusion and parameter-initialization .The model-fusion approaches seek to incorpo-rate the sentence representation provided by the pre-trained model, such as BERT, into the NMT model(Yang et al., 2019b; Clinchant et al., 2019; Wenget al., 2019; Zhu et al., 2020; Lewis et al., 2019; Liuet al., 2020). These approaches are able to leveragethe publicly available pre-trained checkpoints in thewebsite but they need to change the NMT modelto fuse the sentence embedding calculated by thepre-trained model. Large-scale parameters of thepre-trained model signiﬁcantly increase the stor-age cost and inference time, which makes it hardfor this branch of approaches to be directly usedin production. As opposed to model-fusion ap-proaches, the parameter-initialization approachesaim to directly pre-train the whole or part of theNMT model with tailored objectives, and then ini-tialize the NMT model with pre-trained parameters(Lample and Conneau, 2019; Song et al., 2019b).These approaches are more production-ready sincethey keep the size and structure of the model sameas standard NMT systems.While achieving substantial improvements, these a r X i v : . [ c s . C L ] S e p re-training approaches have two main cons.Firstly, as pointed out by Yang et al. (2019), the arti-ﬁcial symbols like [mask] used by these approachesduring pre-training are absent from real data at ﬁne-tuning time, resulting in a pretrain-ﬁnetune dis-crepancy. Secondly, while each pre-training steponly involves sentences from the same language,these approaches are unable to make use of thecross-lingual alignment information contained inthe source and target monolingual corpus. We ar-gue that, as a cross-lingual sequence generationtask, NMT requires a tailored pre-training objec-tive which is capable of making use of cross-lingualalignment signals explicitly, e.g., word-pair infor-mation extracted from the source and target mono-lingual corpus, to improve the performance.To address the limitations mentioned above, wepropose Code-Switching Pre-training (CSP) forNMT. We extract the word-pair alignment infor-mation from the source and target monolingualcorpus automatically, and then apply the extractedalignment information to enhance the pre-trainingperformance. The detailed training process of CSPcan be presented in two steps: 1) perform lexi-con induction to get translation lexicons by unsu-pervised word embedding mapping (Artetxe et al.,2018a; Conneau et al., 2018); 2) randomly replacesome words in the input sentence with their transla-tion words in the extracted translation lexicons andtrain the NMT model to predict the replaced words.CSP adopts the encoder-decoder framework: its en-coder takes the code-mixed sentence as input, andits decoder predicts the replaced fragments basedon the context calculated by the encoder. By pre-dicting the sentence fragment which is replaced onthe encoder side, CSP is able to either attend to theremaining words in the source language or to thetranslation words of the replaced fragment in thetarget language. Therefore, CSP trains the NMTmodel to: 1) learn how to build the sentence repre-sentation for the input sentence as the traditionalpre-training methods do; 2) learn how to performcross-lingual translation with extracted word-pairalignment information. In summary, we mainlymake the following contributions: • We propose the code-switching pre-trainingfor NMT, which makes full use of the cross-lingual alignment information contained insource and target monolingual corpus to im-prove the pre-training for NMT. • We conduct extensive experiments on super- vised and unsupervised translation tasks. Ex-perimental results show that the proposed ap-proach consistently achieves substantial im-provements. • Last but not least, we ﬁnd that CSP can suc-cessfully handle the code-switching inputs.

Several approaches have been proposed to improveNMT with pre-training. Edunov et al. (2019) pro-posed to feed the last layer of ELMo to the encoderof NMT and investigated several different ways toadd pre-trained language model representations tothe NMT model. Weng et al. (2019) proposed abi-directional self-attention language model to getsentence representation and introduced two individ-ual methods, namely weighted-fusion mechanismand knowledge transfer paradigm, to enhance theencoder and decoder. Yang et al. (2019b) proposeda concerted training framework to make the mostuse of BERT in NMT. Zhu et al. (2020) proposedto fuse the representations from BERT with eachlayer of the encoder and decoder of the NMT modelthrough attention mechanisms. Large-scale param-eters of the pre-trained model in these approachesdiscussed above signiﬁcantly increase the storagecost and inference time, which makes these ap-proaches a little far from production. The otherbranch of approaches aims to keep the structureand size the same to the standard NMT system anddesigns some pre-training objectives tailored forNMT. Lample and Conneau (2019) proposed Cross-Lingual Language Model (XLM) objective andbuilt a universal cross-lingual encoder. To improvethe cross-lingual pre-training, they introduced su-pervised translation language modeling objectiverelying on the parallel data available. Song et al.(2019b) proposed the MASS objective to pre-trainthe whole NMT model instead of only pre-trainingthe encoder by XLM. CSP builds on top of Lam-ple and Conneau (2019) and Song et al. (2019b),and it explicitly makes full use of the alignmentinformation extracted from the source and targetmonolingual corpus to enhance pre-training.There have also been works on applying pre-speciﬁed translation lexicons to improve the perfor-mance of NMT. Hokamp and Liu (2017) and Post To be used in production easily, these models need to bedistilled into a student model with the structure and size sameas standard NMT systems. nd Vilar (2018) proposed an altered beam searchalgorithm, which took target-side pre-speciﬁedtranslations as lexical constraints during beamsearch. Song et al. (2019a) investigated a data aug-mentation method, making code-switched trainingdata by replacing source phrases with their targettranslations according to the pre-speciﬁed transla-tion lexicons. Recently, motivated by the successof unsupervised cross-lingual embeddings, Artetxeet al. (2018b), Lample et al. (2018a) and Yang et al.(2018) applied the pre-trained translation lexiconsto initialize the word embeddings of the unsuper-vised NMT model. Sun et al. (2019) applied trans-lation lexicons to unsupervised domain adaptationin NMT. In this paper, we apply the translation lexi-cons automatically extracted from the monolingualcorpus to improve the pre-training of NMT.

In this section, we ﬁrstly describe how to build theshared vocabulary for the NMT model; then wepresent the way extracting the probabilistic transla-tion lexicons; and we introduce the detailed trainingprocess of CSP ﬁnally.

This paper processes the source and target lan-guages with the same shared vocabulary createdthrough the sub-word toolkits, such as Sentence-Piece (SP) and Byte-Pair Encoding (BPE) (Sen-nrich et al., 2016b). We learn the sub-word splitson the concatenation of the sentences equally sam-pled from the source and target corpus. The motiva-tion behind is two-fold: Firstly, with processing thesource and target languages by the shared vocabu-lary, the encoder of the NMT model is able to sharethe same vocabulary with the decoder. Sharing thevocabulary between the encoder and decoder makesit possible for CSP to replace the source words inthe input sentence with their translation words inthe target language. Secondly, as pointed out byLample and Conneau (2019), the shared vocabu-lary greatly improves the alignment of embeddingspaces.

Recently, some works successfully learned trans-lation equivalences between word pairs from twomonolingual corpus and extracted translation lexi-cons (Artetxe et al., 2018a; Conneau et al., 2018).Following Artetxe et al. (2018a), we utilize unsu- pervised word embedding mapping to extract prob-abilistic translation lexicons with monolingual cor-pus only. The probabilistic translation lexicons inthis paper are deﬁned as one-to-many source-targetword translations. Speciﬁcally, giving separatesource and target word embeddings, i.e., X e and Y e trained on source and target monolingual corpus X and Y , unsupervised word embedding mapping uti-lizes self-learning or adversarial-training to learn amapping function f ( X ) = W X , which transformssource and target word embeddings to a shared em-bedding space. With word embeddings in the samelatent space, we measure the similarities betweensource and target words with the cosine distanceof word embeddings. Then, we extract the proba-bilistic translation lexicons by selecting the top k nearest neighbors in the shared embedding space.Formally, considering the word x i in the source lan-guage, its top k nearest neighbor words in the targetlanguage, denoted as y (cid:48) i , y (cid:48) i , . . . , y (cid:48) ik are extractedas its translation words, and the corresponding nor-malized similarities s (cid:48) i , s (cid:48) i , . . . , s (cid:48) ik are deﬁned asthe translation probabilities. CSP only requires monolingual data to pre-train theNMT model. Given an unpaired source sentence x ∈ X , where x = ( x , x , . . . , x m ) is the sourcesentence with m tokens, we denote x [ u : v ] as thesentence fragment of x from u to v where

Similar to Song et al. (2019b), CSP pre-trains asequence to sequence model by predicting the sen-tence fragment x [ u : v ] with the modiﬁed sequence x \ u : v as input. With the log likelihood as the ob-jective function, CSP trains the NMT model on themonolingual corpora X as: L ( θ ; X ) = | X | (cid:80) x ∈ X logP ( x [ u : v ] | x \ u : v ; θ )= | X | (cid:80) x ∈ X log v (cid:81) t = u P ( x t | x

We choose Transformeras the basic model structure. Following the basemodel in Vaswani et al. (2017), we set the dimen-sion of word embedding as 512, dropout rate as 0.1and the head number as 8. To be comparable withprevious works, we set the model as 4-layer en-coder and 4-layer decoder for unsupervised NMT,and 6-layer encoder and 6-layer decoder for super-vised NMT. The encoder and decoder share thesame word embeddings.

Datasets and pre-processing

Following thework of Song et al. (2019b), we use the monolin-gual data sampled from WMT News Crawl datasetsfor English, German and French, with 50M sen-tences for each language. For Chinese, we choose10M sentences from the combination of LDC andWMT2018 corpora. For each translation task, thesource and target languages are jointly tokenizedinto sub-word units with BPE (Sennrich et al.,2016b). The vocabulary is extracted from the to-kenized corpora and shared by the source and tar-get languages. For English-German and English-French translation tasks, we set the vocabulary sizeas 32k. For Chinese-English, the vocabulary size isset as 60k since few tokens are shared by Chinese In this paper, we lower-cased all of the case-sensitivelanguages by default, such as English, German and French. ystem en-de de-en en-fr fr-en zh-enYang et al. (2018) 10.86 14.62 16.97 15.58 14.52Lample et al. (2018b) 17.16 21.0 25.14 24.18 -Lample and Conneau (2019) 27.0 34.3 33.4 33.3 -Song et al. (2019b) 28.1 35.0 37.5 -Lample and Conneau (2019) (our reproduction) 27.3 33.8 32.9 33.5 22.1Song et al. (2019b) (our reproduction) 27.9 34.7 37.3 34.1 22.8

CSP and ﬁne-tuning (ours) 28.7 35.7 37.9

Table 1: The translation performance of the ﬁne-tuned unsupervised NMT models. To reproduce the results ofLample and Conneau (2019) and Song et al. (2019b), we directly run their released codes on the website. and English. To extract the probabilistic translationlexicons, we utilize the monolingual corpora de-scribed above to train the embeddings for each lan-guage independently by using word2vec (Mikolovet al., 2013) . We then apply the public implementa-tion of the method proposed by Artetxe et al. (2017)to map the source and target word embeddings to ashared-latent space. Training details

We replace the consecutive to-kens in the source input with their translation wordssampled from the probabilistic translation lexicons,with random start position u . Following Song et al.(2019b), the length of the replaced fragment is em-pirically set as roughly 50% of the total numberof tokens in the sentence, and the replaced tokensin the encoder will be the translation tokens 80%of the time, a random token 10% of the time andan unchanged token 10% of the time. In the ex-tracted probabilistic translation lexicons, we onlykeep top three translation words for each sourceword and also investigate how the number of trans-lation words produces an effect on the trainingprocess. All of the models are implemented onPy-Torch and trained on 8 P40 GPU cards. Weuse Adam optimizer with a learning rate of 0.0005for pre-training.

In this section, we describe the experiments on theunsupervised NMT, where we only utilize mono-lingual data to ﬁne-tune the NMT model based on https://github.com/facebookresearch/XLMhttps://github.com/microsoft/MASS The conﬁguration we used to run these open-source toolkits can be found in appendix A. We test different length of the replaced segment and reportthe results in the appendix B. We ﬁnd similar results to Songet al. (2019b). The code we used can be found in the attached ﬁle. the pre-trained model.

Experimental settings

For the unsupervisedEnglish-German and English-French translationtasks, we take the similar experimental settings toLample and Conneau (2019); Song et al. (2019b).Speciﬁcally, we randomly sample 5M monolingualsentences from the monolingual data used duringpre-training and report BLEU scores on WMT14English-French and WMT16 English-German. Forﬁne-tuning on the unsupervised Chinese-to-Englishtranslation task, we also randomly sample 1.6Mmonolingual sentences for Chinese and English re-spectively similar to Yang et al. (2018). We take

N IST as the development set and report theBLEU score averaged on the test sets N IST , N IST and N IST . To be consistent with thebaseline systems, we apply the script multi-bleu.pl to evaluate the translation performance for all ofthe translation tasks. Baseline systems

We take the following fourstrong baseline systems. Lample et al. (2018b)achieved state-of-the-art (SOTA) translation per-formance on unsupervised English-German andEnglish-French translation tasks, by utilizing cross-lingual vocabulary, denoising auto-encoding andback-translation. Yang et al. (2018) proposedthe weight-sharing architecture for unsupervisedNMT and achieved SOTA results on unsupervisedChinese-to-English translation task. Lample andConneau (2019) and Song et al. (2019b) are amongthe ﬁrst endeavors to apply pre-training to unsuper-vised NMT, and both of them achieved substantialimprovements compared to the methods withoututilizing pre-training.

Results

Table 1 shows the experimental resultson the unsupervised NMT. From Table 1, we canﬁnd that the proposed CSP outperforms all of theprevious works on English-to-German, German-to-ystem en-de en-fr zh-enVaswani et al. (2017) 27.3 38.1 -Vaswani et al. (2017) (our reproduction) / + BT 27.0 / 28.6 37.9 / 39.3 42.1 / 43.7Lample and Conneau (2019) (our reproduction) / + BT 28.1 / 29.4 38.3 / 39.6 42.0 / 43.7Song et al. (2019b) (our reproduction) / + BT 28.4 / 29.6 38.4 / 39.6 42.5 / 44.1

CSP and ﬁne-tuning (ours) / + BT

Table 2: The translation performance of supervised NMT on English-German, English-French and Chinese-to-English test sets. (+ BT: trains the model with back-translation method.)

English, English-to-French and Chinese-to-Englishunsupervised translation tasks, with as high as+0.7 BLEU points improvement in German-to-English translation task. In French-to-Englishtranslation direction, CSP also achieves compa-rable results with the SOTA baseline of Song et al.(2019b). In Chinese-to-English translation task,CSP even achieves +1.1 BLEU points improvementcompared to the reproduced result of Song et al.(2019b). These results indicate that ﬁne-tuning un-supervised NMT on the model pre-trained by CSPconsistently outperforms the previous unsupervisedNMT systems with or without pre-training.

This section describes our experiments on super-vised NMT where we ﬁne-tune the pre-trainedmodel with bilingual data.

Experimental settings

For supervised NMT,we conduct experiments on the publicly availabledata sets, i.e., WMT14 English-French, WMT14English-German and LDC Chinese-to-English cor-pora, which are used extensively as benchmarks forNMT systems. We use the full WMT14 English-German and WMT14 English-French corpus asour training sets, which contain 4.5M and 36M sen-tence pairs respectively. For Chinese-to-Englishtranslation task, our training data consists of 1.6Msentence pairs randomly extracted from LDC cor-pora. All of the sentences are encoded with thesame BPE codes utilized in pre-training.

Baseline systems

For supervised NMT, we con-sider the following three baseline systems. Theﬁrst one is the work of Vaswani et al. (2017), LDC2002L27,LDC2002T01,LDC2002E18,LDC2003E07,LDC2004T08,LDC2004E12,LDC2005T10 Since model-fusion approaches incorporate too much ex-tra parameters, it is not fair to take them as baselines here.We leave the comparison between CSP and mode-fusion ap-proaches in the appendix C. which achieves SOTA results on WMT14 English-German and English-French translation tasks. Theother two baseline systems are proposed by Lampleand Conneau (2019) and Song et al. (2019b), bothof which ﬁne-tune the supervised NMT tasks on thepre-trained models. Furthermore, we compare withthe back-translation method which has shown itsgreat effectiveness on improving NMT model withmonolingual data (Sennrich et al., 2016a). Specif-ically, for each baseline system, we translate thetarget monolingual data used during pre-trainingback to the source language by a reversely-trainedmodel, and get the pseudo-parallel corpus by com-bining the translation with its original data. Atlast, the training data which includes pseudo andparallel sentence pairs is shufﬂed and used to trainthe NMT system.

Results

The experimental results on supervisedNMT are presented at Table 2. We report the BLEUscores on English-to-German, English-to-Frenchand Chinese-to-English translation directions. Foreach translation task, we report the BLEU scoresfor the standard NMT model and the model trainedwith back-translation respectively. As shown inTable 2, compared to the baseline system withoutpre-training (Vaswani et al., 2017), the proposedmodel achieves +1.6 and +0.7 BLEU points im-provements on English-to-German and English-to-French translation directions respectively. Evencompared to stronger baseline system with pre-training (Song et al., 2019b), we also achieve +0.5and +0.4 BLEU points improvements respectivelyon these two translation directions. On Chinese-to-English translation task, the proposed modelachieves +0.7 BLEU points improvement com-pared to the baseline system of Song et al. (2019b).With back-translation, the proposed model still out-performs all of the baseline systems. Experimentalresults above show that ﬁne-tuning the supervised E n D e PP L (a) E n F r PP L (b) E n D e B L E U (c) E n F r B L E U (d) Figure 2: The performance of CSP with the probabilistic translation lexicons keeping different translation wordsfor each source word, which includes: (a) the PPL score of the pre-trained English-to-German model; (b) the PPLscore of the pre-trained English-to-French model; (c) the BLEU score of the ﬁne-tuned unsupervised English-to-German NMT model; (d)the BLEU score of the ﬁne-tuned unsupervised English-to-French NMT model.

NMT on the model pre-trained by CSP achievessubstantial improvements over previous supervisedNMT systems with or without pre-training. Ad-ditionally, it has been veriﬁed that CSP is able towork together with back-translation.

In CSP, the probabilistic translation lexicons onlykeep the top k translation words for each sourceword. For each word in the translation lexicons,the number of translation words k is viewed as animportant hyper-parameter and can be set carefullyduring the process of pre-training. A natural ques-tion is that how much of translation words do weneed to keep for each source word? Intuitively, if k is set as a small number, the model may loseits generality since each source word can be re-placed with only a few translation words, whichseverely limits the diversity of the context. Andif otherwise, the accuracy of the extracted proba-bilistic translation lexicons may get signiﬁcantlydiminished, which shall introduce too much noisefor pre-training. Therefore, there is a trade-offbetween the generality and accuracy. We investi-gate this problem by studying the translation per-formance of unsupervised NMT with different k ,where we vary k from 1 to 10 with the interval We randomly select the target monolingual data with thesame size to the bilingual data.

2. We observe both the performance of CSP afterpre-training and the translation performance afterﬁne-tuning on the unsupervised NMT tasks, includ-ing the English-to-German and English-to-Frenchtranslation directions. For each translation direc-tion, we ﬁrstly present the perplexity (PPL) score ofthe pre-trained model averaged on the monolingualvalidation sets of the source and target languages. And then we show the BLEU score of the ﬁne-tuned model on the bilingual validation set. Figure2 (a) and (c) illustrate the PPL score of the pre-trained model and BLEU score of the ﬁne-tunedunsupervised NMT model respectively on English-to-German translation. Figure 2 (b) and (d) presentthe PPL and BLEU score respectively for English-to-French translation. From Figure 2, it can be seenthat, when k is set around 3, the pre-trained modelachieves the best validation PPL scores on bothof the English-to-German and English-to-Frenchtranslation directions. Similarly, CSP also achievesthe best BLEU scores on the unsupervised transla-tion tasks when k is set around 3. To understand the importance of different compo-nents of the model pre-trained by CSP, we performan ablation study by training multiple versions of For English-German translation, the monolingual valida-tion set for English is built by including all English sentencesin the bilingual English-German validation set, and the mono-lingual validation set for German is built in the same way. he supervised NMT model with some componentsinitialized randomly: the word embeddings, theencoder, the attention module between the encoderand decoder, and the decoder. Experiments areconducted on English-to-German and English-to-French translation tasks. All models are trainedwithout back-translation and results are reported inTable 3. We can ﬁnd that the two most critical com-ponents are the pre-trained encoder and attentionmodule. It shows that CSP enhances NMT not onlyon the ability of building sentence representationfor the input sentence, but also on the ability ofaligning the source and target languages with thehelp of word-pair alignment information. Addi-tionally, the experimental results indicate that thepre-trained decoder shows little effect on the trans-lation performance. This is mainly because thedecoder only predicts the source-side words duringpre-training but predicts the target-side words dur-ing ﬁne-tuning. This pretrain-ﬁnetune mismatchmakes the pre-trained decoder less helpful for per-formance improvement.

System en-de en-frNo pre-trained embeddings 28.4 38.5No pre-trained encoder 27.9 38.2No pre-trained attention module 28.1 38.3No pre-trained decoder 28.8 38.8Full model pre-trained by CSP 28.9 38.8

Table 3: Ablation study on English-German andEnglish-French translation tasks. The embeddings in-clude the source-side and target-side word embeddings.

Code-switching, which contains words from dif-ferent languages in single input, has aroused moreand more attention in NMT (Johnson et al., 2017;Menacer et al., 2019). In this section, we show thatthe proposed CSP is able to enhance the ability ofthe ﬁne-tuned NMT model on handling the code-switching input. To present quantitative results,we build two test sets for the supervised Chinese-to-English translation task to evaluate the perfor-mance of the translation model on handling code-switching inputs. We randomly select 200 Chinese-English sentence pairs from

N IST , based onwhich we build two code-switching test sets. Theﬁrst test set, referred to as test A, is built by ran-domly replacing some phrases in each Chinesesentence with their counterpart English phrases, where the English phrase is the translation resultby feeding the corresponding Chinese phrase tothe Google Chinese-to-English translator; The sec-ond test set, referred to as test B, is constructedby randomly replacing parts of the words in eachChinese sentence with their nearest target wordsin the shared latent embedding space (the sameway used by CSP in Section 3.2). Table 4 showsthe translation performance of NMT systems onthe two code-switching test sets. Besides thebaseline systems mentioned in section 4.3, we alsotrain a Chinese-English multi-lingual system (John-son et al., 2017) based on Transformer, which hasshown the ability of handling code-switching in-puts. From Table 4, We can ﬁnd that the proposedapproach achieves signiﬁcant improvements overprevious works. Compared to multi-lingual system,we achieve +2.3 and +3.0 BLEU points improve-ments respectively on test A and test B. The casestudy can be found in appendix D.

System test A test BVaswani et al. (2017) 28.17 32.51Lample and Conneau (2019) 28.82 32.90Song et al. (2019b) 28.70 33.21Multi-lingual system 30.51 35.10

CSP and ﬁne-tuning 32.84 38.17

Table 4: The performance of Chinese-to-English trans-lation on in-house code-switching test sets.

This work proposes a simple yet effective pre-training approach, i.e., CSP for NMT, which ran-domly replaces some words in the source sentencewith their translation words in the probabilistictranslation lexicons extracted from monolingualcorpus only. To verify the effectiveness of CSP, weinvestigate two downstream tasks, supervised andunsupervised NMT, on English-German, English-French and Chinese-to-English translation tasks.Experimental results show that the proposed ap-proach achieves substantial improvements overstrong baselines consistently. Additionally, weshow that CSP is able to enhance the ability ofNMT on handling code-switching inputs. Thereare two promising directions for the future work.Firstly, we are interested in applying CSP to other The two in-house code-switching test sets can be foundin the attached ﬁles. elated NLP areas for code-switching problems.Secondly, we plan to investigate the pre-trainingobjectives which are more effective in utilizing thecross-lingual alignment information for NMT.

We sincerely thank the anonymous reviewers fortheir thorough reviewing and valuable suggestions.

References

Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2017.Learning bilingual word embeddings with (almost)no bilingual data. In

Proceedings of the 55th AnnualMeeting of the Association for Computational Lin-guistics (Volume 1: Long Papers) , pages 451–462,Vancouver, Canada. Association for ComputationalLinguistics.Mikel Artetxe, Gorka Labaka, and Eneko Agirre.2018a. A robust self-learning method for fully un-supervised cross-lingual mappings of word embed-dings. In

Proceedings of the 56th Annual Meeting ofthe Association for Computational Linguistics (Vol-ume 1: Long Papers) , pages 789–798, Melbourne,Australia. Association for Computational Linguis-tics.Mikel Artetxe, Gorka Labaka, Eneko Agirre, andKyunghyun Cho. 2018b. Unsupervised neural ma-chine translation. In .Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2015. Neural machine translation by jointlylearning to align and translate. In .Kyunghyun Cho, Bart van Merri¨enboer, Caglar Gul-cehre, Dzmitry Bahdanau, Fethi Bougares, HolgerSchwenk, and Yoshua Bengio. 2014. Learningphrase representations using RNN encoder–decoderfor statistical machine translation. In

Proceedings ofthe 2014 Conference on Empirical Methods in Nat-ural Language Processing (EMNLP) , pages 1724–1734, Doha, Qatar. Association for ComputationalLinguistics.St´ephane Clinchant, Kweon Woo Jung, and VassilinaNikoulina. 2019. On the use of bert for neural ma-chine translation. In

Proceedings of the 3rd Work-shop on Neural Generation and Translation (WNGT2019) , pages 108–117.Alexis Conneau, Guillaume Lample, Marc’AurelioRanzato, Ludovic Denoyer, and Herv´e J´egou. 2018.Word translation without parallel data. In .Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In

Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers) ,pages 4171–4186, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.Sergey Edunov, Alexei Baevski, and Michael Auli.2019. Pre-trained language model representationsfor language generation. In

Proceedings of the 2019Conference of the North American Chapter of theAssociation for Computational Linguistics: HumanLanguage Technologies, Volume 1 (Long and ShortPapers) , pages 4052–4059, Minneapolis, Minnesota.Association for Computational Linguistics.Chris Hokamp and Qun Liu. 2017. Lexically con-strained decoding for sequence generation using gridbeam search. In

Proceedings of the 55th AnnualMeeting of the Association for Computational Lin-guistics (Volume 1: Long Papers) , pages 1535–1546,Vancouver, Canada. Association for ComputationalLinguistics.Luyao Huang, Chi Sun, Xipeng Qiu, and XuanjingHuang. 2019. GlossBERT: BERT for word sensedisambiguation with gloss knowledge. In

Proceed-ings of the 2019 Conference on Empirical Methodsin Natural Language Processing and the 9th Inter-national Joint Conference on Natural Language Pro-cessing (EMNLP-IJCNLP) , pages 3509–3514, HongKong, China. Association for Computational Lin-guistics.Melvin Johnson, Mike Schuster, Quoc V Le, MaximKrikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat,Fernanda Vi´egas, Martin Wattenberg, Greg Corrado,et al. 2017. Googles multilingual neural machinetranslation system: Enabling zero-shot translation.

Transactions of the Association for ComputationalLinguistics , 5:339–351.Nal Kalchbrenner and Phil Blunsom. 2013. Recurrentcontinuous translation models. In

Proceedings ofthe 2013 Conference on Empirical Methods in Natu-ral Language Processing , pages 1700–1709, Seattle,Washington, USA. Association for ComputationalLinguistics.Guillaume Lample and Alexis Conneau. 2019. Cross-lingual language model pretraining. In neural in-formation processing systems (2019) , pages 7057–7067.Guillaume Lample, Ludovic Denoyer, andMarc’Aurelio Ranzato. 2018a. Unsupervisedmachine translation using monolingual corporaonly. In

International Conference on LearningRepresentations .uillaume Lample, Myle Ott, Alexis Conneau, Lu-dovic Denoyer, and Marc’Aurelio Ranzato. 2018b.Phrase-based & neural unsupervised machine trans-lation. In

Proceedings of the 2018 Conference onEmpirical Methods in Natural Language Processing ,pages 5039–5049, Brussels, Belgium. Associationfor Computational Linguistics.Mike Lewis, Yinhan Liu, Naman Goyal, Mar-jan Ghazvininejad, Abdelrahman Mohamed, OmerLevy, Ves Stoyanov, and Luke Zettlemoyer. 2019.Bart: Denoising sequence-to-sequence pre-trainingfor natural language generation, translation, andcomprehension. arXiv preprint arXiv:1910.13461 .Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, SergeyEdunov, Marjan Ghazvininejad, Mike Lewis, andLuke Zettlemoyer. 2020. Multilingual denoisingpre-training for neural machine translation. arXivpreprint arXiv:2001.08210 .Mohamed Amine Menacer, David Langlois, DenisJouvet, Dominique Fohr, Odile Mella, and KamelSma¨ıli. 2019. Machine translation on a parallelcode-switched corpus. In

Canadian Conference onArtiﬁcial Intelligence , pages 426–432. Springer.Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-rado, and Jeff Dean. 2013. Distributed representa-tions of words and phrases and their compositional-ity. In

Advances in neural information processingsystems , pages 3111–3119.Matthew Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word rep-resentations. In

Proceedings of the 2018 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics: Human Lan-guage Technologies, Volume 1 (Long Papers) , pages2227–2237, New Orleans, Louisiana. Associationfor Computational Linguistics.Matt Post and David Vilar. 2018. Fast lexically con-strained decoding with dynamic beam allocation forneural machine translation. In

Proceedings of the2018 Conference of the North American Chapter ofthe Association for Computational Linguistics: Hu-man Language Technologies, Volume 1 (Long Pa-pers) , pages 1314–1324, New Orleans, Louisiana.Association for Computational Linguistics.Alec Radford, Karthik Narasimhan, Tim Salimans,and Ilya Sutskever. 2018. Improving languageunderstanding by generative pre-training.

URLhttps://s3-us-west-2. amazonaws. com/openai-assets/researchcovers/languageunsupervised/languageunderstanding paper. pdf .Alec Radford, Jeffrey Wu, Rewon Child, David Luan,Dario Amodei, and Ilya Sutskever. 2019. Languagemodels are unsupervised multitask learners.

OpenAIBlog , 1(8). Sascha Rothe, Shashi Narayan, and Aliaksei Sev-eryn. 2019. Leveraging pre-trained checkpointsfor sequence generation tasks. arXiv preprintarXiv:1907.12461 .Rico Sennrich, Barry Haddow, and Alexandra Birch.2016a. Improving neural machine translation mod-els with monolingual data. In

Proceedings of the54th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers) , pages86–96, Berlin, Germany. Association for Computa-tional Linguistics.Rico Sennrich, Barry Haddow, and Alexandra Birch.2016b. Neural machine translation of rare wordswith subword units. In

Proceedings of the 54th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers) , pages 1715–1725, Berlin, Germany. Association for Computa-tional Linguistics.Kai Song, Yue Zhang, Heng Yu, Weihua Luo, KunWang, and Min Zhang. 2019a. Code-switching forenhancing NMT with pre-speciﬁed translation. In

Proceedings of the 2019 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,Volume 1 (Long and Short Papers) , pages 449–459,Minneapolis, Minnesota. Association for Computa-tional Linguistics.Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. 2019b. Mass: Masked sequence to se-quence pre-training for language generation. arXivpreprint arXiv:1905.02450 .Haipeng Sun, Rui Wang, Kehai Chen, Masao Utiyama,Eiichiro Sumita, and Tiejun Zhao. 2019. Unsuper-vised bilingual word embedding agreement for unsu-pervised neural machine translation. In

Proceedingsof the 57th Annual Meeting of the Association forComputational Linguistics , pages 1235–1245, Flo-rence, Italy. Association for Computational Linguis-tics.Ilya Sutskever, Oriol Vinyals, and Quoc VV Le. 2014.Sequence to sequence learning with neural networks.

Advances in neural information processing systems ,pages 3104–3112.Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In

Advances in neural information pro-cessing systems , pages 5998–6008.Liang Wang, Wei Zhao, Ruoyu Jia, Sujian Li, andJingming Liu. 2019. Denoising based sequence-to-sequence pre-training for text generation. In

Proceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the9th International Joint Conference on Natural Lan-guage Processing (EMNLP-IJCNLP) , pages 4003–4015, Hong Kong, China. Association for Computa-tional Linguistics.ongxiang Weng, Heng Yu, Shujian Huang, WeihuaLuo, and Jiajun Chen. 2019. Improving neuralmachine translation with pre-trained representation. arXiv preprint arXiv:1908.07688 .Jiacheng Yang, Mingxuan Wang, Hao Zhou, ChengqiZhao, Yong Yu, Weinan Zhang, and Lei Li. 2019b.Towards making the most of bert in neural machinetranslation. arXiv preprint arXiv:1908.05672 .Zhen Yang, Wei Chen, Feng Wang, and Bo Xu.2018. Unsupervised neural machine translation withweight sharing. In

Proceedings of the 56th AnnualMeeting of the Association for Computational Lin-guistics (Volume 1: Long Papers) , pages 46–55, Mel-bourne, Australia. Association for ComputationalLinguistics.Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car-bonell, Ruslan Salakhutdinov, and Quoc V Le.2019. Xlnet: Generalized autoregressive pretrain-ing for language understanding. arXiv preprintarXiv:1906.08237 .Jinhua Zhu, Yingce Xia, Lijun Wu, Di He, Tao Qin,Wengang Zhou, Houqiang Li, and Tie-Yan Liu.2020. Incorporating bert into neural machine trans-lation. arXiv preprint arXiv:2002.06823arXiv preprint arXiv:2002.06823