Improving Zero-shot Neural Machine Translation on Language-specific Encoders-Decoders
Junwei Liao, Yu Shi, Ming Gong, Linjun Shou, Hong Qu, Michael Zeng
IImproving Zero-shot Neural Machine Translationon Language-specific Encoders-Decoders
Junwei Liao
Department of Computer ScienceUniversity of Electronic Science and Technology of China
Chengdu, [email protected]
Yu Shi
Cognitive Services Research GroupMicrosoft
Seattle, [email protected]
Ming Gong
STCA NLP GroupMicrosoft
Beijing, [email protected]
Linjun Shou
STCA NLP GroupMicrosoft
Beijing, [email protected]
Hong Qu
Department of Computer ScienceUniversity of Electronic Science and Technology of China
Chengdu, [email protected]
Michael Zeng
Cognitive Services Research GroupMicrosoft
Seattle, [email protected]
Abstract —Recently, universal neural machine translation(NMT) with shared encoder-decoder gained good performanceon zero-shot translation. Unlike universal NMT, jointly trainedlanguage-specific encoders-decoders aim to achieve universalrepresentation across non-shared modules, each of which is fora language or language family. The non-shared architecturehas the advantage of mitigating internal language competition,especially when the shared vocabulary and model parametersare restricted in their size. However, the performance of usingmultiple encoders and decoders on zero-shot translation still lagsbehind universal NMT. In this work, we study zero-shot trans-lation using language-specific encoders-decoders. We propose togeneralize the non-shared architecture and universal NMT bydifferentiating the Transformer layers between language-specificand interlingua. By selectively sharing parameters and applyingcross-attentions, we explore maximizing the representation uni-versality and realizing the best alignment of language-agnosticinformation. We also introduce a denoising auto-encoding (DAE)objective to jointly train the model with the translation task ina multi-task manner. Experiments on two public multilingualparallel datasets show that our proposed model achieves acompetitive or better results than universal NMT and strongpivot baseline. Moreover, we experiment incrementally addingnew language to the trained model by only updating the newmodel parameters. With this little effort, the zero-shot translationbetween this newly added language and existing languagesachieves a comparable result with the model trained jointly fromscratch on all languages.
Index Terms —multilingual neural machine translation, zero-shot, denoising auto-encoding, language-specific encoders-decoders
I. I
NTRODUCTION
Universal neural machine translation (NMT) draws muchattention from the machine translation community in re-cent years, especially for zero-shot translation, which is firstdemonstrated by [1]. Later on, many subsequent works areproposed to further improve the zero-shot performance ofuniversal NMT [2]–[7]. However, there are several short-comings with this shared architecture (Fig. 1a): (1) Sinceall languages share the same vocabulary and weights, the
Univ Univ En De Fr + Encoder
Decoder (a) Universal encoder-decoder
En EnDe DeFr Fr
Target
Language
Token (b) Language-specific encoders-decoders
Fig. 1: Two typical architectures of multilingual NMT.entire model needs to be retrained from scratch when adaptingto a new group of languages; (2) The shared vocabularygrows dramatically when adding many languages, especiallyfor those who do not share the alphabet, such as Englishand Chinese. The unnecessarily huge vocabulary makes thecomputation costly in decoding and hinders the deployment ofcommercial products; (3) This structure can only take text asinput and cannot directly add raw signals from other modalitiessuch as image or speech.Another research direction in machine translation is thejointly trained language-specific encoders-decoders [8]–[13],as shown in Fig. 1b. Due to the unshared architecture, theseworks’ main goal is to improve the multilingual translationrather than zero-shot transfer. Hence the zero-shot performanceof language-specific encoders-decoders still lags behind theuniversal model. To solve this problem, previous works addextra network layers as “interlingua” that is shared by allencoders and decoders [8]–[10].In this paper, we focus on improving the zero-shot trans-lation on language-specific encoders-decoders. We propose anovel interlingua mechanism that leverages the advantages of a r X i v : . [ c s . C L ] F e b oth architectures. Specifically, we fully exploit the charac-teristics of Transformer architecture [14] without introduc-ing extra network layers. By selectively sharing parametersand applying cross-attentions, we explore maximizing therepresentation universality and realizing the best alignmentof language-agnostic information. Our method can providegood enough universal representation for zero-shot translation,making explicit representation alignment not necessary.Previous works have proved that monolingual data is helpfulto zero-shot translation [9], [10], [15], and reconstructionobjective is usually used. However, we found that jointlytrain the model using a denoising auto-encoding (DAE) tasktogether with the translation task is much more beneficialto zero-shot transfer. We argue that denoising objective canexploit monolingual data more efficiently than reconstructionobjective which brings a spurious correlation between sourceand target sentences.We verify our method on two public multilingual datasets.Europarl has four languages (En, De, Fr, Es) from close relatedlanguages. MultiUN has a group of distant languages (En, Ar,Ru, Zh). The results show that our approach achieve compet-itive or better results than pivot-based baseline and universalarchitecture on zero-shot translation. Furthermore, we showthat our model can add new languages using incrementaltraining without retraining the existing modules.Our main contributions can be summarized as: • We focus on improving zero-shot translation of language-specific encoders-decoders. It keeps the advantage ofadding new languages without retraining the system andachieves comparable or better performance than universalencoder-decoder counterpart and pivot-based methods atthe same time. • We propose a novel interlingua mechanism within theTransformer architecture without introducing extra net-work layers. We also propose a multi-task training ofDAE and translation to exploit monolingual data moreefficiently. These methods bring a significant improve-ment in the zero-shot performance of language-specificencoders-decoders. • We empirically explore several important aspects of pro-posed methods and give a detailed analysis.II. R
ELATED W ORKS
A. Zero-shot Neural Machine Translation
Zero-shot NMT has received more interest in recent yearsincreasingly. [2] used decoder pretraining and back-translationto ignore spurious correlations in zero-shot translation; [3]proposed a cross-lingual pretraining on encoder before trainingthe whole model with parallel data; [4] introduced a consistentagreement-based training method that encourages the modelto produce equivalent translations of parallel sentences inauxiliary languages; [5] exploited an explicit alignment lossto align the sentence representations of different languageswith same means in the high-level latent space; [6] proposed afine-tuning technique that uses dual learning [7]. These works all use universal encoder-decoder, while our approaches adoptlanguage-specific encoders-decoders.
B. Language-specific Encoders-Decoders
Most of the works on multilingual NMT use universalencoder-decoder. Few works studied the language-specificencoders-decoders that is more flexible than universal encoder-decoder. [8] proposed extending the bilingual recurrent NMTarchitecture to the multilingual case by designing a sharedattention-based mechanism between the language-specific en-coders and decoders. [9] introduced a neural interlinguainto language-specific encoders-decoders NMT architecturethat captures language-independent semantic information inits sentence representation. [10] incorporated a self-attentionlayer, shared among all language pairs, that serves as a neuralinterlingua. A series of works [11]–[13] adopted differenttraining strategies to improve multilingual NMT performanceon language-specific encoders-decoders without any sharedparameters. Although these works make some progress on im-proving the performance of multilingual translation at differentlevels, they still lag behind universal encoder-decoder on zero-shot translation.
C. Parameter Shared Methods for Multilingual NMT
Several proposals promote cross-lingual transfer, the key tozero-shot translation, by modifying the model’s architectureand selectively sharing parameters. [16] proposed sharing allparameters but the attention mechanism. [17] utilized recurrentunits with multiple blocks together with a trainable routingnetwork. [18] develop a contextual parameter generator thatcan be used to generate the encoder-decoder parameters forany source-target language pair. Our work uses parametersharing based on Transformer [14]. The most similar worklike ours is [19] but with some major differences: (1)Theystudy the parameter sharing methods only on one-to-manytranslation while we focus on zero-shot translation in many-to-many scenario; (2) They share the partial self-attention weightsin all decoders layers while we share the selective layers ofencoders; (3) They use one shared vocabulary while we useseparated vocabulary for each language, which is more flexibleto add new languages without retraining the whole system.III. A
PPROACH
In this section, we first introduce the language-specificencoders-decoders adopted in our method. Then we proposeour approach – combining interlingua via parameter sharingand DAE task jointly trained with translation task – to im-prove the zero-shot translation on language-specific encoders-decoders.
A. Background: Language-specific Encoders-Decoders forMultilingual NMT
Language-specific encoder-decoder for multilingual NMTis based on the sequence-to-sequence model except that, asFig. 1b illustrates, each language has its own encoder anddecoder. We denote the encoder and the decoder for the i th anguage in the system as enc i and dec i , respectively. Forlanguage-specific scenarios, both the encoder and decoderare considered independent modules that can be freely inter-changed to work in all translation directions. We use ( x i , y j ) where i, j ∈ { , ..., K } to represent a pair of sentencestranslating from a source language i to a target language j . K languages are considered in total. Our model is trainedby maximizing the likelihood over training sets D i,j of allavailable language pairs S . Formally, we aim to maximize L mt : L mt ( θ ) = (cid:88) ( x i ,y j ) ∈ D i,j , ( i,j ) ∈S log p ( y j | x i ; θ ) , (1)where the probability p ( y j | x i ) is modeled as p ( y j | x i ) = dec j ( enc i ( x i )) . (2)[1] showed that a trained multilingual NMT system couldautomatically translate between unseen pairs without anydirect supervision, as long as both source and target languageswere included in training. In other words, a model trained, forinstance, on Spanish → English and English → French candirectly translate from Spanish to French. Such an emergentproperty of a multilingual system is called zero-shot transla-tion. It is conjectured that zero-shot NMT is possible becausethe optimization encourages different languages to be encodedinto a shared space so that the decoder is detached fromthe source languages. Universal encoder-decoder can naturallypossess this property because all language pairs share thesame encoder and decoder. But for language-specific encoders-decoders, there is no shared part among languages. That makestransfer learning hardly take effect to transfer the knowledgelearned in high-source language to low-source language. Thatexplains why language-specific encoders-decoders underper-form universal encoder-decoder on zero-shot translation. [12]also attributes this problem to limited shared information.To improve the zero-shot translation on language-specificencoders-decoders, we propose the interlingua via parametersharing and DAE task.
B. Interlingua via Parameter Sharing and Selective Cross-attention
Previous works [8]–[10] propose to use an interlingualayer to bring the shared space to language-specific encodes-decoders. Interlingua is a shared component between encodersand decoders, which maps the output of language-specificencoders into a common space and gets an intermediateuniversal representation as the input to decoders. They im-plemented interlingua as extra network layers that make themodel complicated and training inefficient. Unlike their meth-ods, we propose implementing an interlingua by sharing theTransformer [14] layers of language-specific encoders. Fig. 2gives a schematic diagram for our proposed model.[20], [21] show that transfer is possible even when thereis no shared vocabulary across the monolingual corpora. Theonly requirement is that there are some shared parameters inthe top layers of the multilingual encoder. Inspired by their
Layer m … Layer k+1Layer 1EmbeddingLayer k … Layer 1EmbeddingLayer k … Layer n … Softmax Layer 1EmbeddingLayer k … Layer 1EmbeddingLayer k … Layer n … Softmax
Interlingua
Decoders
Encoders En FrEn (shifted right) Fr (shifted right)En Fr
Fig. 2: Interlingua via parameter sharing and selective cross-attention. We use English and French as two example lan-guages for illustrating.work, we share the top layers of language-specific encoderswhile keeping the embedding and low layers intact. In thisway, the model can get the transfer learning ability via param-eter sharing while maintaining the multi-encoders’ flexibilityvia independent vocabulary. In the language-specific decoders’side, we keep the decoders separately without sharing param-eters.Intuitively, to generate zero-shot language sentences frominterlingua representation in a common shared space, cross-attention between interlingua and decoder layers should cap-ture the high-level information, such as semantics, whichis language-agnostic. [22] shows that for multilingual NMT,the top layers of the decoder capture more language-specificinformation. Base on their conclusion, we can infer that lowlayers of decoder capture more language-agnostic information.Combining the two points above, we only conduct cross-attention between the interlingua and low layers of the decoderto capture language-agnostic knowledge more effectively. Wedenote the enc (cid:48) i as the language-specific part of encoder i , dec (cid:48) j as the decoder j with modified cross-attention. Equation(2) is updated to p ( y j | x i ) = dec (cid:48) j ( interlingua ( enc (cid:48) i ( x i ))) , (3)where interlingua denotes the interlingua via shared Trans-former layers.ABLE I: Overall dataset statistics where each pair has a sim-ilar number of training samples. All the remaining directionsare used to evaluate zero-shot performance. Dataset parallel pairs size/pairEuroparl Es-En, De-En, Fr-En 0.6MMultiUN Ar-En, Ru-En, Zh-En 2M
C. Denoising Auto-encoder (DAE) Task
We propose to use a DAE task and train with translationtask jointly. Our training data cover K languages, and each D k is a collection of monolingual sentences in language k .We assume access to a noising function g , defined below, thatcorrupts text, and train the model to predict the original text x k given g ( x k ) . More formally, we aim to maximize L dae as L dae ( θ ) = (cid:88) x k ∈ D k ,k ∈K log p ( x k | g ( x k ); θ ) , (4)where x k is an instance in language k and the probability p ( x k | g ( x k )) is defined by p ( x k | g ( x k )) = dec (cid:48) k ( interlingua ( enc (cid:48) k ( g ( x k )))) . (5)The noising function g injects three types of noise to obtainthe randomly perturbed text: First, we randomly drop tokens ofthe sentence with a probability; Second, we substitute tokenswith the special masking token with another probability; Third,the token order is locally shuffled. The candidate tokens canbe in a unit of subword or whole word and span to N-grams.Finally, the objective function of our learning algorithm is L = L mt + L dae . (6)We jointly train the multiple encoders and decoders by ran-domly selecting the two tasks with equal probability, i.e.multilingual translation and DAE task.IV. E XPERIMENTS
A. Setup
We evaluate the proposed approaches against several strongbaselines on two public multilingual datasets across a vari-ety of language, Europarl [23] and MultiUN [24]. In allexperiments, we use BLEU [25] as the automatic metric fortranslation evaluation.
1) Datasets:
The detailed statistics of Europarl and Mul-tiUN datasets are in Table I. To simulate the zero-shot settings,the training set only allowing parallel sentences from/to En-glish, where English acts as the pivot language. For Europarlcorpus, to compare with previous work fairly, we follow [4]’smethod to preprocess the Europarl to avoid multi-parallel sen-tences in training data and ensure the zero-shot setting. We usethe dev2006 as the validation set and the test2006 as the test set http://opus.nlpl.eu/MultiUN.php We calculate BLEU scores using toolkit SacreBLEU [26] with defaulttokenizer except Chinese which uses Zh tokenizer. which contain 2,000 multi-parallel sentences. For vocabulary,We use SentencePiece [27] to encode text as WordPiece tokens[28]. Due to the language-specific architecture of our model,each language has its own vocabulary. We choose a vocabularyof 32K wordpieces for each language.For MultiUN corpus, We randomly sub-sample the 2M sen-tences per language pair for the training set and 2000 sentencesper language pair for validation and test sets. Vocabulary sizefor each language is also 32K.For DAE task, we only use monolingual data extracted fromparallel training data. No extra monolingual data are used.
2) Model:
Our code was implemented using PyTorch ontop of the Huggingface Transformers library . To decouplethe output representation of an encoder from the task, we usethe target language code as the initial token of the decoderinput following the practice of [2]. The Transformer layersin our model use the parameters of d model = 512 , d ff =2048 , n heads = 8 . The encoders have 9 layers with 6 top layerssharing parameters as interlingua. The decoders have 12 layerswith 6 low layers conducting cross-attention with interlingua.The output softmax layer is tied with input embeddings [29].
3) Training:
We train our models with the RAdam opti-mizer [30] and the inverse square root learning rate schedulerof [14] with 5e-4 learning rate and 64K linear warmup steps.For each model, we train it using 8 NVIDIA V100 GPUs with32GB of memory and mixed precision. It takes around threedays to train one model. We use batch size 32 for each GPU.We stop training at optimization step 200K and select the bestmodel based on the validation set. We search the followinghyperparameter for zero-shot translation: batch size {
32, 96 } ;learning rate { } ; linear warmup steps { } . We use dropout 0.1 throughout thewhole model. For decoding, we use beam-search with beamsize 1 for all directions. The noise function uses token deletionrate 0.2, token masking rate 0.1, WordPiece token as the tokenunit, and unigram as span width.
4) Evaluation:
We focus our evaluation mainly on zero-shot performance of the following methods: • Univ. , which stands for directly evaluating a multilingualuniversal encoder-decoder model after standard training[1]. • Pivot , which performs pivot-based translation using amultilingual universal encoder-decoder model (after stan-dard training); often regarded as gold-standard. • Ours , which represents language-specific encoders-decoders with shared interlingua and jointly trained withmultilingual translation and DAE tasks.To ensure a fair comparison in terms of model capacity, allthe techniques above use the same Transformer architecturedescribed above, i.e. 9 layers of encoder and 12 layers ofdecoder. All other results provided in the tables are as reportedin the literature. https://github.com/huggingface/Transformers ABLE II: Zero-shot results on Europarl corpus.
Previous work Our baselinesBRLM-SA † Agree ‡ Univ. Pivot Ours En → De De → En — 29.07 — 31.81 En → Fr Fr → En — 33.30 35.02 — En → Es — 34.98 Es → En — 34.53 36.34 — Supervised (avg.) — 30.95 33.18 — De → Fr Fr → De — 19.15 20.05 25.02 De → Es — 22.45 28.28 30.10 Es → De — 20.70 20.55 25.15 Fr → Es Es → Fr — 30.94 34.24 34.90 Zero-shot (avg.) — 24.60 27.33 29.88 † BRLM-SA [3]. ‡ Agree [4].
TABLE III: Zero-shot results on MultiUN corpus.
Previous work Our baselinesZS+LM † Univ. Pivot Ours En → Ar — 37.07 — Ar → En — 51.82 — En → Ru — 40.83 — Ru → En — 47.52 — En → Zh — 52.51 — Zh → En — 47.74 — Supervised (avg.) 45.80 46.25 — Ar → Ru Ru → Ar Ar → Zh Zh → Ar Ru → Zh Zh → Ru † ZS+LM [2].
B. Main Results
The supervised and zero-shot translation results on Europarland MultiUN corpora are shown in Table II and Table III.
1) Results on Europarl Dataset:
For zero-shot translation,our model outperforms all other baselines except on De → Frto BRSM-SA [3] that use a large scale monolingual datato pretrain encoder. Especially, Our model outperforms thepivot-based translation in all directions. For Fr ↔ Es pairs, ourmodel improves about 3.4 BLEU points over the pivoting.The pivot-based translation is a strong baseline in the zero-shot scenario that often beats the other multilingual NMTsystem baselines ( [1], [4]). Pivoting translates source to pivotthen to target in two steps, causing an inefficient translationprocess. Our approaches translate source to target directlybetween any zero-shot directions, which is more efficient thanpivoting. For supervised translation, our model gets the sameaveraged score with universal NMT and receives a better scorein most direction (En → De, Fr → En, Es → En). Baseline Agree[4] using agreement loss manifests the worst performance.We conjecture that it is because their model is based onLSTM [31], which was proved inferior to Transformer onmultilingual NMT [32]. But regarding the BRLM-SA [3]that uses Transfomer-big and cross-lingual pretraining withlarge scale monolingual, our model only uses the parallel dataand get a better score on most of supervised and zero-shot direction.
2) Results on MultiUN Dataset:
Our model performs betterfor supervised and zero-shot translation than universal NMTand the enhanced universal NMT with language model pre-training for decoder [2] that denoted as ZS+LM in Table III.But our model still lags behind the pivoting about 2.2 BLEUpoints. Similar results were observed in previous works [2],[3] where their methods based on universal encoder-decoderalso underperform pivoting on MultiUN dataset. To surpass thepivoting, they introduced the back-translation [33] to generatepseudo parallel sentences for all zero-shot direction basedon pretrained models, and further trained their model withthese pseudo data. Strictly speaking, their methods are “zero-resource” translation rather than “zero-shot” translation sincethe model has seen explicit examples of zero-shot languagepair during training, which zero-shot translation should notdo [1]. Considering this reason, although our model canalso benefit from adding training data augmented by back-translation, we decide not to use it to further improve ourmodel just for beating the pivoting baseline.V. D
ISCUSSION
In this section, we discuss some important aspects of theproposed approach.
A. Incremental Language Expansion
First, we explore our model’s zero-shot transfer capabilitywhen incrementally adding new languages. The incrementaltraining is similar to that in [11]. For illustration, let us assumewe have gotten a language-specific encoders-decoders modelwith initial training on En ↔ Es,Fr parallel data. Now we wantto add a new language, say De, into the existing modulesvia incremental training only on En ↔ De parallel data. Inthe initial training, we jointly train three encoders and threedecoders for languages En, Es, and Fr, respectively. In theincremental training step, we add both a new encoder and anew decoder for De into the initial model, share the interlingualayers with the initial modules, and randomly initialize thenon-shared layers. The weight update is only conducted on thenewly added non-shared layers using the En ↔ De parallel databy freezing the weights of the six initial modules. Both stepsuse the proposed multi-task training of DAE and translationobjectives.Following the same process, we also experimented withadding Fr or Es to the model initially trained on other threelanguages, respectively. We use the same experimental settingas in the main experiment unless stated otherwise. We comparethe zero-shot translation results of the incremental trainingwith the joint training on the Europarl dataset (En ↔ De,Fr,ES).The results are shown in Table IV.Table IV shows that the universal representation learnedin the initial model is easily transferred to the new lan-guage. The gap between the incremental training and the jointtraining is within 0.2 ∼ ∼ Joint , Init. , Incr. represent joint training, initial training, andincremental training, respectively.
Europarl De, Es, Fr ↔ EnDirection Fr-De Es-De Es-Fr Zero Parallel ← → ← → ← →
Avg Avg
Joint
Intial training on En ↔ Es,Fr, then incremental training on En ↔ De.
Init. — — — — 37.99 37.40 — —
Incr.
Intial training on En ↔ De,Es, then incremental training on En ↔ Fr.
Init. — — 30.58 25.18 — — — —
Incr.
Intial training on En ↔ De,Fr, then incremental training on En ↔ Es.
Init.
Incr. best for incremental training. In addition, incremental trainingfollowed by continued joint training could further improve theperformance. We leave those experiments in future work.Besides the good zero-shot transferability, incremental train-ing is also lightweight. Due to the interlingua mechanism, weonly update the model weights of the non-interlingua layersof the newly added modules. This dramatically reduces thenumber of trainable parameters. In contrast with the jointtraining that takes three days, incremental training only takeshalf a day to get a comparable result.
1) Conclusion:
The proposed incremental training ap-proach is a lightweight language expansion method that canfully leverage the initial model’s interlingua layers and theuniversal representation learned from the initial training forboth parallel and zero-shot languages.
B. Interlingua Structure Analysis
In this section, we empirically examine the following ques-tions: 1) How important it is to share the interlingua layers ofthe encoders? 2) How to partition the language-specific andinterlingua layers in the encoders given a fixed total numberof layers? 3) Which layers in the decoders should be chosento conduct cross-attention to the encoder output? 4) Whethersharing interlingua layers among the decoders can benefit thezero-shot translation?Before the discussion, we first define some shorthands. Wedenote the encoder interlingua layers between layer i and layer j as E i - j , the decoder interlingua layers as D i - j , and thedecoder layers that have cross-attention to interlingua outputas C i - j . Both i and j are inclusive and 1-based.We use the original Europarl corpus without preprocessing.To accelerate the experiment, we use a small version of theproposed model. It has a similar architecture to that in ourmain experiment except that the language-specific encodershave 8 layers and language-specific decoders have 10 layers.All the experiments use the same random seeds to ensure theresults are comparable. TABLE V: Dissecting the interlingua structure based on zero-shot performance. The best score in each group is bold. E / D / C represent encoder/decoder/cross-attention. The number afterthe letter denotes the range of shared encoder layers/shareddecoder layers/cross-attention layers respectively. Europarl De, Es, Fr ↔ EnDirection Fr-De Es-De Es-Fr Zero Parallel ← → ← → ← →
Avg Avg
1. Ablation of sharing encoders layers.
E3-8, C1-10
2. Change number of encoders shared layers.
E1-8, DC3-8
E3-8, DC3-8
E5-8, DC3-8
E7-8, DC3-8
DC3-8
3. Change number of decoder cross-attention layers.
E3-8, C5-10
E3-8, C1-6
4. Ablation of sharing decoders layers.
E3-8, DC3-8
E3-8, C3-8
We design four groups of experiments to answer the ques-tions. Because there can be exponentially many combinationsconsidering all the different feasible sets of configuration, weintuitively only select a subset of these combinations. Theresults are reported in Table V.The first group in Table V conducts an ablation study ofthe interlingua layer sharing among the encoders. Obviously,without the interlingua layers, there is a gap of as large as 18.2points in BLEU score on zero-shot translation when comparedwith the setup with interlingua layers. This clearly shows thatencoder interlingua layers are essential to the zero-shot transferof translation.The second group in Table V shows zero-shot translationperformance when adjusting the language-specific and interlin-gua partition of the encoder layers under the condition of fixingother settings. Because the low layers of encoders carry moresyntactic information, we treat the top layers as interlingua.The average score of zero-shot translation decreases from 30.1to 26.2 along with the fewer interlingua layers. This trendis consistent with the observation of [20] in zero-shot cross-lingual transfer learning using pretrained language model thatis equivalent to a Transformer encoder. However, the zero-shot difference between the first two rows (
E1-8,DC3-8 and
E3-8,DC3-8 ) is as small as 0.2 (30.1-29.9), meaning thatcertain language-specific layers could be equally important tointerlingua layers. As a result, in the main experiment andthe incremental language expansion, we choose to share thetop 6 layers among the encoders and leave the bottom 3 notshared to intentionally give the model more language-specificcapacity while not losing the transfer learning ability in zero-shot translation.he third group in Table V shows the impact of cross-attention to the zero-shot translation. We compare 3 casesin which the cross-attentions happen in the top, middle, andbottom of the decoder stack, respectively. The average zero-shot score increases from 30.9 to 32.1 when cross-attentionlayers change from top (
C5-10 ) to bottom (
C1-6 ). Intuitively,the middle layers of the decoders capture more language-agnostic information than the top and the bottom. So cross-attentions should only happen in the middle stack. Otherwise,the universal representation from encoders could be messed upby language-specific information. However, the experimentalresult does not match our intuition. [22] also points out thatthe bottom layers of decoders capture more language-agnosticinformation. One possible explanation could be that compar-ing with language understanding which most likely happensin the bottom layers, language generation in the top stackrequires much more capacity to model the complex languagephenomenons. Furthermore, conducting cross-attentions in thebottom layers is also better than in all layers (
E3-8,C1-10 in the first group), which proves that universal representationshould not be messed up by language-specific information.The last group is an ablation test about whether or notsharing decoder layers. The result shows that not sharingdecoder layers achieves higher BLEU scores on both zero-shot (31.5 vs. 29.0) and supervised (31.7 vs. 31.5) translation.Interestingly, when comparing
C1-10 in group 1 and
DC3-8 in group 2, and assuming that the impact of different cross-attentions is not more than 2 points (based on the results ingroup 3), we find that decoder sharing is very beneficial inthe case of no encoder sharing. Hence, we suspect that thedecoder takes the responsibility to tradeoff between keepingthe language characteristics and ensuring the representationuniversality. As a result, we don’t share decoder layers inour experiments. However, for applications that depend onfinetuning to transfer the decoder to other language generationtasks, it’s worth well to further explore the sharing mechanismof the decoder interlingua layers.
1) Conclusion:
Summarised by Table V, the recommendedconfiguration for the proposed interlingua structure is sharingthe most top layers of the encoders, not sharing any parameterof the decoders, and conducting cross-attentions in the bottomlayers of the decoders. Based on this principle, in our mainexperiment, we use the configuration
E4-9,C1-6 for the modelwith 9 encoder layers and 12 decoder layers.
C. Denoising Auto-encoder (DAE) Task Anaysis
In this section, We conduct an ablation test on the DAEtask and compare it with reconstruction task used by previousworks [9], [10], which we denote as AE (Auto-Encoder) task.We use the same experiment settings as in the main experimentunless stated otherwise. The results are shown in Table VI.When trained only using MT task, the zero-shot BLEUscore was less than 1.0 for all zero-shot directions. Sim-ilar results are also observed by previous works [9], [10]using language-specific encoders-decoders. To encourage themodel to share the encoder representations across English and TABLE VI: Compare the different auxiliary tasks based onzero-shot translation performance. The metric is BLEU. Thebest scores are in bold. Europarl De, Es, Fr ↔ EnDirection Fr-De Es-De Es-Fr Zero Parallel ← → ← → ← →
Avg Avg MT AE DAE
DAE + Align non-English source sentences, they added an extra identitylanguage pair ( AE task) to joint training. The identity pairforces the non-English source embeddings to be close withEnglish source embeddings. We also experiment with AE task.The result is much better than training only with MT task.The average BLEU score on zero-shot translation is 12.12.When training with DAE task, we get 30.58 BLEU points onzero-shot translation, which is close to supervised translation(33.29). We conjecture that AE task is too simple for modelto learn aligning the non-English sentence representation withEnglish sentence representation. To better understand DAE’sability to align the sentence embedding, we also add anexplicit alignment loss for the encoder representations usedby previous works [5], [11], [15] to improve the zero-shottranslation. The result ( DAE+Align ) shows that, comparing
DAE task, explicitly aligning the latent representation onlyincrease 0.3 BLEU points on zero-shot translation. It provesthat DAE can provide good enough universal representation forzero-shot translation that makes the representation alignmentunnecessary. VI. C
ONCLUSION
In this paper, to improve the zero-shot translation onlanguage-specific encoders-decoders, we first introduce aninterlingua to language-specific encoders-decoders via param-eter sharing methods on Transformer layers and selectivecross-attention between interlingua and decoder. Then weuse denoising auto-encoder task to better align the semanticrepresentation of different languages. Experiments on the Eu-roparl and MultiUN corpora show that our proposed methodssignificantly improve the zero-shot translation on language-specific encoders-decoders and achieve competitive or betterresults than universal encoder-decoder counterparts and pivot-based methods while keeping the superiority of adding a newlanguage without retraining the existing modules.R
EFERENCES[1] M. Johnson, M. Schuster, Q. V. Le, M. Krikun, Y. Wu, Z. Chen, N. Tho-rat, F. Vi´egas, M. Wattenberg, G. Corrado et al. , “Google’s multilin-gual neural machine translation system: Enabling zero-shot translation,”
Transactions of the Association for Computational Linguistics , vol. 5,pp. 339–351, 2017.[2] J. Gu, Y. Wang, K. Cho, and V. O. Li, “Improved zero-shot neuralmachine translation via ignoring spurious correlations,” in
Proceedingsof the 57th Annual Meeting of the Association for ComputationalLinguistics , 2019, pp. 1258–1268.3] B. Ji, Z. Zhang, X. Duan, M. Zhang, B. Chen, and W. Luo, “Cross-lingual pre-training based transfer for zero-shot neural machine transla-tion,” in
Proceedings of the AAAI Conference on Artificial Intelligence ,vol. 34, 2020, pp. 115–122.[4] M. Al-Shedivat and A. Parikh, “Consistency by agreement in zero-shotneural machine translation,” in
Proceedings of the 2019 Conferenceof the North American Chapter of the Association for ComputationalLinguistics: Human Language Technologies, Volume 1 (Long and ShortPapers) , 2019, pp. 1184–1197.[5] N. Arivazhagan, A. Bapna, O. Firat, R. Aharoni, M. Johnson, andW. Macherey, “The missing ingredient in zero-shot neural machinetranslation,” arXiv preprint arXiv:1903.07091 , 2019.[6] L. Sestorain, M. Ciaramita, C. Buck, and T. Hofmann, “Zero-shot dualmachine translation,” arXiv preprint arXiv:1805.10338 , 2018.[7] D. He, Y. Xia, T. Qin, L. Wang, N. Yu, T.-Y. Liu, and W.-Y. Ma, “Duallearning for machine translation,” in
Advances in neural informationprocessing systems , 2016, pp. 820–828.[8] O. Firat, K. Cho, and Y. Bengio, “Multi-way, multilingual neural ma-chine translation with a shared attention mechanism,” in
Proceedings ofthe 2016 Conference of the North American Chapter of the Associationfor Computational Linguistics: Human Language Technologies , 2016,pp. 866–875.[9] Y. Lu, P. Keung, F. Ladhak, V. Bhardwaj, S. Zhang, and J. Sun, “A neuralinterlingua for multilingual machine translation,” in
Proceedings of theThird Conference on Machine Translation: Research Papers , 2018, pp.84–92.[10] R. V´azquez, A. Raganato, J. Tiedemann, and M. Creutz, “Multilingualnmt with a language-independent attention bridge,” in
Proceedings of the4th Workshop on Representation Learning for NLP (RepL4NLP-2019) ,2019, pp. 33–39.[11] C. Escolano, M. R. Costa-juss`a, and J. A. Fonollosa, “From bilingualto multilingual neural machine translation by incremental training,”in
Proceedings of the 57th Annual Meeting of the Association forComputational Linguistics: Student Research Workshop , 2019, pp. 236–242.[12] C. Escolano, M. R. Costa-juss`a, J. A. Fonollosa, and M. Artetxe,“Multilingual machine translation: Closing the gap between shared andlanguage-specific encoder-decoders,” arXiv preprint arXiv:2004.06575 ,2020.[13] ——, “Training multilingual machine translation by alternately freezinglanguage-specific encoders-decoders,” arXiv preprint arXiv:2006.01594 ,2020.[14] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in
Advancesin neural information processing systems , 2017, pp. 5998–6008.[15] C. Zhu, H. Yu, S. Cheng, and W. Luo, “Language-aware interlinguafor multilingual neural machine translation,” in
Proceedings of the 58thAnnual Meeting of the Association for Computational Linguistics , 2020,pp. 1650–1655.[16] G. Blackwood, M. Ballesteros, and T. Ward, “Multilingual neuralmachine translation with task-specific attention,” in
Proceedings of the27th International Conference on Computational Linguistics , 2018, pp.3112–3122.[17] P. Zaremoodi, W. Buntine, and G. Haffari, “Adaptive knowledge sharingin multi-task learning: Improving low-resource neural machine transla-tion,” in
Proceedings of the 56th Annual Meeting of the Association forComputational Linguistics (Volume 2: Short Papers) , 2018, pp. 656–661.[18] E. A. Platanios, M. Sachan, G. Neubig, and T. Mitchell, “Contextualparameter generation for universal neural machine translation,” in
Pro-ceedings of the 2018 Conference on Empirical Methods in NaturalLanguage Processing , 2018, pp. 425–435.[19] D. Sachan and G. Neubig, “Parameter sharing methods for multilingualself-attentional translation models,” in
Proceedings of the Third Confer-ence on Machine Translation: Research Papers , 2018, pp. 261–271.[20] A. Conneau, S. Wu, H. Li, L. Zettlemoyer, and V. Stoyanov, “Emergingcross-lingual structure in pretrained language models,” in
Proceedingsof the 58th Annual Meeting of the Association for ComputationalLinguistics , 2020, pp. 6022–6034.[21] M. Artetxe, S. Ruder, and D. Yogatama, “On the cross-lingual trans-ferability of monolingual representations,” in
Proceedings of the 58thAnnual Meeting of the Association for Computational Linguistics , 2020,pp. 4623–4637. [22] X. Tan, Y. Leng, J. Chen, Y. Ren, T. Qin, and T.-Y. Liu, “A study of mul-tilingual neural machine translation,” arXiv preprint arXiv:1912.11625 ,2019.[23] P. Koehn, “Europarl: A parallel corpus for statistical machine transla-tion,” in
MT summit , vol. 5. Citeseer, 2005, pp. 79–86.[24] A. Eisele and Y. Chen, “Multiun: A multilingual corpus from unitednation documents.” in
LREC , 2010.[25] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method forautomatic evaluation of machine translation,” in
Proceedings of the 40thannual meeting of the Association for Computational Linguistics , 2002,pp. 311–318.[26] M. Post, “A call for clarity in reporting bleu scores,” in
Proceedings ofthe Third Conference on Machine Translation: Research Papers , 2018,pp. 186–191.[27] T. Kudo and J. Richardson, “Sentencepiece: A simple and language inde-pendent subword tokenizer and detokenizer for neural text processing,”in
Proceedings of the 2018 Conference on Empirical Methods in NaturalLanguage Processing: System Demonstrations , 2018, pp. 66–71.[28] T. Kudo, “Subword regularization: Improving neural network transla-tion models with multiple subword candidates,” in
Proceedings of the56th Annual Meeting of the Association for Computational Linguistics(Volume 1: Long Papers) , 2018, pp. 66–75.[29] O. Press and L. Wolf, “Using the output embedding to improve languagemodels,” in
Proceedings of the 15th Conference of the European Chapterof the Association for Computational Linguistics: Volume 2, ShortPapers , 2017, pp. 157–163.[30] L. Liu, H. Jiang, P. He, W. Chen, X. Liu, J. Gao, and J. Han, “Onthe variance of the adaptive learning rate and beyond,” in
InternationalConference on Learning Representations , 2019.[31] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”
Neuralcomputation , vol. 9, no. 8, pp. 1735–1780, 1997.[32] S. M. Lakew, M. Cettolo, and M. Federico, “A comparison of trans-former and recurrent neural networks on multilingual neural machinetranslation,” in
Proceedings of the 27th International Conference onComputational Linguistics , 2018, pp. 641–652.[33] R. Sennrich, B. Haddow, and A. Birch, “Improving neural machine trans-lation models with monolingual data,” in