[PDF] M3P: Learning Universal Representations via Multitask Multilingual Multimodal Pre-training

Abstract

We present M3P, a Multitask Multilingual Multimodal Pre-trained model that combines multilingual pre-training and multimodal pre-training into a unified framework via multitask pre-training. Our goal is to learn universal representations that can map objects occurred in different modalities or texts expressed in different languages into a common semantic space. In addition, to explicitly encourage fine-grained alignment between images and non-English languages, we also propose Multimodal Code-switched Training (MCT) to combine monolingual pre-training and multimodal pre-training via a code-switch strategy. Experiments are performed on the multilingual image retrieval task across two benchmark datasets, including MSCOCO and Multi30K. M3P can achieve comparable results for English and new state-of-the-art results for non-English languages.

Full PDF

MM P: Learning Universal Representations viaMultitask Multilingual Multimodal Pre-training

Haoyang Huang ∗ Lin Su ∗ Di Qi (cid:0)

Nan Duan Edward Cui Taroon Bharti Lei ZhangLijuan Wang Jianfeng Gao Bei Liu Jianlong Fu Dongdong Zhang Xin Liu Ming Zhou {haohua,lins,diqi,nanduan,edwac,tbharti,leizhang}@microsoft.com{lijuanw,jfgao,Bei.Liu,jianf,Dongdong.Zhang,xinliu,mingzhou}@microsoft.com

Abstract

This paper presents a M ultitask M ultilingual M ultimodal P re-trained model ( M P ) thatcombines multilingual-monomodal pre-training and monolingual-multimodal pre-traininginto a uniﬁed framework via multitask learning and weight sharing. The model learnsuniversal representations that can map objects that occurred in diﬀerent modalities orexpressed in diﬀerent languages to vectors in a common semantic space. To verify thegeneralization capability of M P, we ﬁne-tune the pre-trained model for diﬀerent typesof downstream tasks: multilingual image-text retrieval, multilingual image captioning,multimodal machine translation, multilingual natural language inference and multilingualtext generation. Evaluation shows that M P can (i) achieve comparable results on multilingualtasks and English multimodal tasks, compared to the state-of-the-art models pre-trained forthese two types of tasks separately, and (ii) obtain new state-of-the-art results on non-Englishmultimodal tasks in the zero-shot or few-shot setting. We also build a new M ultilingual I mage- L anguage D ataset ( MILD ) by collecting large amounts of (text-query, image, context)triplets in 8 languages from the logs of a commercial search engine.

Recently, we witness the rise of a new paradigm of natural language processing (NLP), where generalknowledge is learned from raw texts by self-supervised pre-training and then applied to downstream tasksby task-speciﬁc ﬁne-tuning. Now, these state-of-the-art monolingual pre-trained language models, such asBERT [1], have been expanded to multilingual scenarios , such as Multilingual BERT [1], XLM/XLM-R [2, 3],Unicoder [4], and multimodal scenarios , such as ViLBERT [5], Unicoder-VL [6], UNITER [7], VLP [8], Oscar[9]. However, it is still challenging to extend these pre-trained models to multilingual-multimodal scenarios dueto the lack of large amounts of aligned multimodal corpora in multiple languages for multilingual-multimodalpre-training. As a result, many multilingual pre-trained models cannot handle vision data (e.g. images andvideos) whereas many multimodal pre-trained models, which are trained using texts mainly in English, cannothandle multiple languages.To address this challenge, this paper presents a M ultitask M ultilingual M ultimodal P re-trained model( M P ), which aims to learn universal representations that can map objects occurred in diﬀerent modalitiesor expressed in diﬀerent languages to vectors in a common semantic space. This goal is achieved by (i)learning to represent multilingual data using multilingual corpora (i.e. sentences from Wikipedia covering100 languages) by multilingual-monomodal pre-training, (ii) learning to represent multimodal data usingmultimodal corpora (i.e. image-caption pairs labeled in English) by monolingual-multimodal pre-training,and (iii) generalizing these representations to deal with multilingual-multimodal tasks by multi-task learningand weight sharing. ∗ These authors contributed equally to this work. a r X i v : . [ c s . C L ] J un o verify the generalization capability of M P, we ﬁne-tune the pre-trained model on diﬀerent types ofdownstream tasks: multilingual image-text retrieval, multilingual image captioning, multimodal machinetranslation, multilingual natural language inference and multilingual text generation. Evaluation shows thatM P (i) achieves comparable results on multilingual tasks and English multimodal tasks, compared to thestate-of-the-art models pre-trained for these two types of tasks separately, and (ii) obtains new state-of-the-artresults on non-English multimodal tasks in the zero-shot or few-shot setting. To further evaluate the learnedmultilingual multimodal representations in more languages, we also build a new M ultilingual I mage- L anguage D ataset ( MILD ) which includes (text-query, image, context) triplets in 8 languages, collected from the logsof a commercial search engine. Diﬀerent from other widely-used image-language datasets such as MSCOCOand Flickr30K, the texts in MILD are shorter and contain more entities, which makes the image-languagetask deﬁned on this dataset (such as image-text retrieval) much more challenging. We will release MILD as anew benchmark to facilitate multilingual multimodal research.

Multilingual Pre-trained Models

Multilingual BERT (M-BERT) [1] demonstrates that by performingmasked language modeling on a multilingual corpus with shared vocabulary and weights for 102 languages,surprisingly good results can be achieved on the cross-lingual natural language inference (XNLI) [10] taskin 15 languages. XLM [2] and Unicoder [4] further improve the multilingual BERT by introducing newpre-training tasks based on a bilingual corpus. XLM-R [3] shows that by performing masked languagemodeling on a large-scale multilingual corpus, new state-of-the-art results on XNLI, MLQA and NER can beobtained. mBART [11] and Unicoder described in XGLUE [12] extend the multilingual models to multilingualtext generation tasks based on the encoder-decoder framework and use diﬀerent denoising auto-encodingpre-training tasks. However, all such models work for NLP tasks only, and cannot be applied to multimodaltasks such as image captioning.

Multimodal Pre-trained Models

Recently, a large number of multimodal pre-trained models, such asViLBERT [5], Unicoder-VL [6], UNITER [7], VLP [8] and Oscar [9], are developed for vision-language tasksusing multi-layer Transformer as the backbone. These models are pre-trained using similar visual-linguistictasks and achieve comparable results on many vision-language tasks, such as visual question answering, visualcommonsense reasoning, image-text retrieval and image captioning. However, as it is not easy to collectwell-aligned visual-linguistic training data in multiple languages, all these models are pre-trained for Englishonly based on monolingual multimodal corpora, such as Conceptual Captions [13], SBU Captions [14], VisualGenome [15] and MSCOCO [16], and cannot be applied to multimodal tasks with non-English inputs.

Multimodal Machine Translation

Multimodal machine translation is a task that includes multilingualand multimodal factors at the same time. [17] proposes a multitask-learning-based method to learn amultimodal translation model and to link visual semantics with the corresponding textual semantics at thesame time. [18] proposes a multimodal simultaneous neural machine translation method, which leveragesvisual information as an additional input and veriﬁes its importance for simultaneous translation. However,due to the low-resource issue, these models are usually trained using very small amounts of (image, sourcecaption, target caption translation) triples. P) This section describes how to train M P using a multilingual-monomodal corpus (e.g. sentences extracted fromWikipedia) and a monolingual-multimodal corpus (e.g. English image-caption pairs). The M P model usesthe model architecture of BERT [1] for understanding tasks and a BERT-based encoder-decoder architecturefor generation tasks. We pre-train M P via multitask learning for optimizing a set of understanding andgeneration tasks, as shown in Figure 1.

Image Stream

Given an input image, we obtain its image region sequence v = { v , v , ..., v N } usingFaster-RCNN [19], where v n ∈ v denotes the n th image region, N denotes the length of v . The regionembedding of v n is the visual feature outputted by Faster-RCNN. The spatial embedding of v n is a 5-D vectorbased on its normalized top-left and bottom-right coordinates and the fraction of the image area covered.We project these two embeddings into the text embedding space using two fully-connected (FC) layers. Theﬁnal input representation of each image region v n is obtained by summing its projected region embedding2igure 1: Pre-training tasks used in M P. (Top) Four understanding tasks. (Bottom) Three generationtasks. The M P employs two separate shared weights for multiple understanding or generation pre-trainingobjectives. Blue color denotes text-based inputs/outputs. Yellow color denotes image-based inputs/outputs.For understanding (top row), the text stream is either used as a standalone input or concatenated with theimage stream. While for generation (bottom row), the text-based inputs and image-based inputs are fed intothe encoder individually.and spatial embedding. We also keep the most possible object category of each image region predicted byFaster-RCNN, which will be used in the pre-training procedure.

Text Stream

Given an input text, we obtain its BPE token sequence w l i = { w l i , w l i , ..., w l i M } usingSentence Piece [20], where w l i m ∈ w l i denotes the m th BPE token, M denotes the length of w l i , l i denotes alanguage in the language set L . The ﬁnal input representation of each BPE token w l i m is obtained by summingits token embedding and position embedding. Moreover, a language embedding [2] is added to each inputtoken to indicate diﬀerent languages during generation. We use the same vocabulary as XLM-R [3], whichincludes 250K BPE tokens and covers 100 languages. Similar to Multilingual BERT [1], XLM [2] andUnicoder [4], this task performs masked language model based on the multilingual corpus. At each iteration, abatch is composed of sentences sampled from diﬀerent languages. The sampling probability of a language l i isdeﬁned as λ l i = p αl i / P l i p αl i , where p l i is the percentage of l i in the entire multilingual corpus, the smoothingfactor α is set to 0.3. For each batch, we randomly sample 15% of the words and replace them with (i) aspecial symbol [MASK], (ii) a random token, or (iii) keep them unchanged with probability 80%, 10% and10%, respectively. A bilingual corpus can be used to further improve the multilingual pre-training [2, 4]. Butthis paper uses multilingual corpus only, as it is nontrivial to collect a bilingual corpus for 100 languages. Multimodal Masked Language Modeling (MMLM)

Similar to ViLBERT [5] and Unicoder-VL [6],this task aims to predict each masked token w enm in the input caption w en based on its surrounding tokens w en \ m and all image regions v . We follow the same masking strategy used in MMLM to mask tokens in theinput caption. The loss function is deﬁned as: L MMLM ( θ ) = − E ( w en , v ) ∼ D log p θ ( w enm | w en \ m , v )where D denotes the whole image-caption pairs, en denotes the input caption is in English. Masked Region Modeling (MRM)

This task aims to reconstruct each masked image region v n basedon the remaining regions v \ n and all the caption tokens w en . We randomly mask image regions with a3robability of 15%. The input representation of each masked image region is set to zeros or keeps the originalvalues with probability 90% and 10%, respectively. The loss function is deﬁned as: L MRM ( θ ) = − E ( w en , v ) ∼ D X k [MSE( h θ ( v k ) , f ( v k )) + CE( g θ ( v k ) , C ( v k ))]where k enumerates the indices of all masked image regions. MSE( h θ ( v k ) , f ( v k )) denotes the mean-square-error loss that tries to regress the Transformer output of each masked region v k to its visual feature f ( v k ).We apply an FC layer to convert the Transformer output of each masked region v k into a vector h θ ( v k ) ofsame dimension as the visual feature f ( v k ). CE( g θ ( v k ) , C ( v k )) denotes the cross-entropy loss that tries topredict the object category of each masked region v k . We apply another FC layer to convert the Transformeroutput of each masked region v k to predict the scores of K object classes, which further go through a softmaxfunction to be transformed into a normalized distribution g θ ( v k ). We take the predicted object category withthe highest conﬁdence score outputted by Faster-RCNN as the ground-truth label of v k , and convert it into aone-hot vector C ( v k ) ∈ R K . Due to the top-1 category predicted by Faster-RCNN is not always correct, weleave minimizing the KL divergence between two distributions for our future work. Visual-Linguistic Matching (VLM)

This task aims to learn the instance-level alignment between textsand images. An FC layer s θ ( w en , v ) is applied on the Transformer output of [CLS] to predict whether theinput image v and the input text w en are semantically matched. Negative image-caption pair are created byreplacing the image or text in a matched sample with a randomly-selected image or text from other samples.The loss function is deﬁned as: L VLM ( θ ) = − E ( w en , v ) ∼ D [ y log s θ ( w en , v ) + (1 − y ) log(1 − s θ ( w en , v ))]where y ∈ { , } indicates whether the input image-text pair is matched or not. This task aims to predict the original BPE tokensequence w l i based on its corrputed form c ( w l i ), which is a noising function that corrupts w l i by performingthe following three operations sequentially: (1) shuﬄe w l i by adding a noise α ∼ U (0 ,

3) to the input indicesand then re-ordering w l i based on the rank of the noised indices; (2) drop words with a probability of 30%; (3)sample a number of token spans from w l i with span lengths drawn from a Poisson distribution ( λ = 3), andthen replace each token span with a single [MASK] token. Here, 0-length spans correspond to the insertion of[MASK] tokens. The loss function is deﬁned as: L xDAE ( θ ) = − E ( w li ) ∼ D M X t =1 log p θ ( w l i t | w l i

This task aims to generate the caption w en based on the image region sequence v detected from the input image. The loss function is deﬁned as: L IC ( θ ) = − E ( w en , v ) ∼ D M X t =1 log p θ ( w ent | w en

Given the image region sequence v detected from an input image,this task aims to generate the caption w en of the input image based on c ( v ), which is a noising function thatcorrputs v by sampling n-gram regions from v and then replacing each n-gram region with a zero-initializedvector. The span lengths drawn from a Poisson distribution ( λ = 3). The loss function is deﬁned as: L DIC ( θ ) = − E ( w en , v ) ∼ D M X t =1 log p θ ( w ent | w en

English (en),

German (de),

French en de fr pt es it ja zhTrain 112,000 10,000 10,000 10,000 10,000 10,000 10,000 2,000Dev 5,000 5,000 5,000 5,000 5,000 5,000 5,000 1,000Test 5,000 5,000 5,000 5,000 5,000 5,000 5,000 1,000

Table 1: The data statistics in MILD. Each number denotes the number of unique images.(fr),

Portuguese (pt),

Spanish (es),

Italian (it),

Japanese (ja) and

Chinese (zh). Besides, this dataset alsoincludes contexts of the images, and we would like to evaluate our model both with and without the contextpresent. We construct the dataset in 5 steps.

Step-1 : We collect billions of image-text pairs from the logs of a commercial image search engine.Each text isa user query in one of the eight languages (en, de, fr, pt, es, it, ja, zh). Each image is clicked by a user query.

Step-2 : We perform image-based ﬁltering by (i) discarding low-quality images whose width or height issmaller than 300 pixels; (ii) discarding sensitive images with pornographic or racy content; (iii) applying abinary classiﬁer to ﬁlter images whose image feature cannot be reliably extracted.

Step-3 : We perform text-based ﬁltering by (i) discarding sensitive queries with pornographic or racy intent;(ii) using heuristic rules to remove queries with noisy words or numbers; (iii) discarding short queries whoselengths are less than 5 words.

Step-4 : We use an in-house image-text semantic model to predict a relevance score for each query-image pair.This semantic model is trained on millions of human-labeled instances using text features, image featuresand image-text similarity features. Based on the relevance scores, we keep at most 5 queries for each image,following MSCOCO and Flickr30k. We also include the original title of each image as its context information,which is extracted from the HTML of the web page where the image comes from.

Step-5 : We sample a portion of (query (Q), image (I), context (C)) triples generated in Step-4 to formMILD. Table 1 shows the statistics of MILD.MILD diﬀers from existing image-text benchmarks in three aspects: (1) The average query length in MILD is5.8, which is shorter than 10.6 in MSCOCO and 12.3 in Flickr30K. This makes the image-text retrieval taskon MILD harder, as the text caption is too brief to describe all elements occurred in the ground-truth image;(2) A portion of captions in MILD contain named entities such as person, location and organization names.For example, there are 39.2% English queries containing entities (

PER , LOC , ORG , DATE , PROD , EVENT or ZIP ) and the number for English context is 54.6%. It leaves a big room for future models to increase theirperformance on this dataset by introducing new mechanisms to handle these entities; (3) Each image has anadditional context text, which is extracted from the web page from which the image comes. Based on humanevaluation on the sampled image-query pairs, 80% of the pairs in MILD are matched pairs in that the queryis a plausible caption of its paired image. Figure 2 gives some examples in MILD.5 odel Multi30K MSCOCO en de fr cs en ja zh

Results without pre-training

EmbN [23] 72.0 60.3 54.8 46.3 76.8 73.2 73.5PAR. EmbN [24] 69.0 62.6 60.6 54.1 78.3 76.0 74.8S-LIWE [25] 76.3 72.1 63.4 59.4 80.9 73.6 70.0MULE [26] 70.3 64.1 62.3 57.7 79.0 75.9 75.6SMALR [27] 74.5 69.8 65.9 64.8 81.5 77.5 76.7

Results with monolingual multimodal pre-training

Unicoder-VL (w/o ﬁne-tune) [6] 72.0 - - - 63.7 - -Unicoder-VL (w/ ﬁne-tune on en) [6] - - - - -M P (w/o ﬁne-tune) 61.1 35.7 24.7 26.4 62.1 32.1 33.3M P (w/ ﬁne-tune on en) 86.0 48.8 39.4 38.8 87.4 54.4 55.8M P (w/ ﬁne-tune on each) 86.0 80.2 67.1 66.2 87.4 83.9 77.4M P (w/ ﬁne-tune on all) 86.7

Table 2: Multilingual image-text retrieval results on Multi30K and MSCOCO. The metric is the mean Recall(mR). Each bold number indicates the best mR score in that column. As MULE and SMALR are usingdiﬀerent dev/test splits of MSCOCO compared with all the other models, we highlight these numbers in bluecolor. We report the mR results of Unicoder-VL on the en datasets, as it is pre-trained based on the sameimage-caption corpus (i.e. Conceptual Captions), as M P did.

We use raw sentences extracted from the Wikipedia dump as the multilingual corpus for multilingualmonomodal pre-training. It includes 101G sentences covering 100 languages. We use Conceptual Captions[13] as the multimodal corpus for monolingual multimodal pre-training. It contains 3.3 million Englishimage-caption pairs harvested from the Web.For understanding tasks, we set the hyper-parameters as follows: 768 hidden units, 12 heads, GELU activation,a dropout rate of 0.1, 128 max input length, 12 layers in encoder. In the pre-training stage, we initializeM P with XLM-R [3] and run continue pre-training with xMLM, MMLM, MRM and VLM. We use AdamOptimizer [21] with a linear warm-up [22] and set the learning rate to 1e-4. The total batch size is 1,024after gradient accumulation. The pre-training stage takes about 4 days to converge on 8x V100 GPUs. In theﬁne-tuning stage, the batch size is set to 512 and sample 3 negative cases in VLM. We use Adam Optimizerwith β = 0 . β = 0 .

98 and 5e-5 learning rate.For generation tasks, we utilize the encoder-decoder architecture with 768 hidden units, 8 heads, GELUactivations, a dropout rate of 0.1, 128 max input length, 12 layers in both encoder and decoder. Thetransformer parameters between encoder and decoder are shared, including embedding and self-attentionmodule. In the pre-training stage, we train M P with xDAE, IC and DIC. The batch size is 1,536 withgradients accumulation and the initial lr is 1e-4 with a linear warm-up. In the ﬁne-tuning stage, we reducethe lr to 5e-5 with a total batch size of 512. We feed the same language ID into encoder and decoder exceptfor Multimodal Machine Translation. We set beam size as 10 in caption inference.

The task of multilingual image-text retrieval is to ﬁnd the mostrelevant images given input texts in diﬀerent languages, or vice versa. We evaluate M P on Multi30K [28, 29],MSCOCO [16, 30, 31] and MILD. Multi30K extended Flickr30K [32] to

German (de),

French (fr) and

Czech (cs). It contains 31,783 images and provides 5 captions per image in English and German and 1 caption perimage in French and Czech. We use the train, dev, test splits as deﬁned in [32]. MSCOCO contains 123,287images and provides 5 captions per image in English, but fewer in Chinese and Japanese. STAIR Captions[33]extended MSCOCO [16] with 820K Japanese captions for COCO images. [31] extended MSCOCO [16] withChinese captions for 20K images. We use the same train, dev, test splits for English and Japanese as deﬁned6 etting en es fr de it pt zh ja avg

Results based on pairs M P (w/ ﬁne-tune on en) 19.0 6.1 5.7 5.3 4.5 5.0 13.5 3.3 7.8M P (w/ ﬁne-tune on each) 19.0 7.7 7.7 P (w/ ﬁne-tune on all)

Results based on triples M P (w/ ﬁne-tune on en) 81.6 51.0 52.8 47.7 47.4 47.8 73.0 50.4 56.5M P (w/ ﬁne-tune on each) 81.6 54.5 56.6 52.7 52.3 51.4 75.4 58.2 60.3M P (w/ ﬁne-tune on all)

Table 3: Multilingual image-text retrieval results on MILD. The metric is the mean Recall (mR).

Model Multi30K MSCOCOen de fr cs en ja zhB@4/C B@4/C B@4/C B@4/C B@4/C B@4/C B@4/CVLP (w/ ﬁne-tune on en) [8] 30.1/67.4 -/- -/- -/- 36.5/116.9 -/- -/-XGPT (w/ ﬁne-tune on en) [35] -/- -/- -/- -/- -/-Ja-Generator [33] -/- -/- -/- -/- -/- 38.5/83.3 -/-COCO-CN [31] -/- -/- -/- -/- -/- -/- 36.7/98.4M P (w/ ﬁne-tune on each) 26.1/57.2 16.1/43.8 7.5/36.1 4.0/28.5 33.7/111.5 40.2/105.1 39.7/109.2M P (w/ ﬁne-tune on all) 26.5/59.4

Table 4: Multilingual image captioning results on Multi30K. The metrics are B@4 and C.in [34]. As for Chinese, we use the COCO-CN split [31]. We use mean Recall (mR) as the metric, which is anaverage score of Recall@1, Recall@5 and Recall@10 on image-to-text retrieval and text-to-image retrievaltasks.Table 2 shows the evaluation results on Multi30K and MSCOCO, where M P achieves the state-of-the-artresults comparing to several related work [26, 23, 24, 25, 27]. We study the impacts of diﬀerent ﬁne-tuningstrategies, including w/o ﬁne-tune : apply M P to all test sets directly without ﬁne-tuning; w/ ﬁne-tune onen : ﬁne-tune M P for English and then apply the ﬁne-tuned model to all test sets; w/ ﬁne-tune on each :ﬁne-tune M P for each language l i and then apply the ﬁne-tuned model to the test sets of l i ; w/ ﬁne-tune onall : ﬁne-tune M P for all languages using the merged labeled data and then apply the ﬁne-tuned model toall test sets. Similar to the observations reported in Unicoder [4, 12], the last two ﬁne-tuning methods canlead to the best results. The same sentence in diﬀerent languages may capture complementary informationto help improve performance. We also compare with Unicoder-VL , which is pre-trained using the sameimage-caption corpus (i.e. Conceptual Captions), but for English only. Although M P performs a bit worsethan Unicoder-VL on English, it can obtain comparable results on all the other languages, which veriﬁes itsstrong transfer capability. A possible reason is that the employment of xMLM task and a larger vocabularyfor 100 languages. In particular, SMALR [27] take advantage of machine translation to augment Multi30Kand MSCOCO. Considering that applying machine translation to translate English to all other supportedlanguages is not general and limited to a large amount of translators, we leave this as an option for futurework.Table 3 shows the evaluation results on MILD. The ﬁrst batch of results is based on Q-I pairs without usingimage contexts. Comparing to the results on Multi30K and MSCOCO, the numbers on MILD are muchlower, which shows it is a harder dataset. The second batch of results is based on Q-I-C triples, where eachimage and its context always appear together as input. Evaluation results show that such context informationhelps a lot in the image-text retrieval tasks in MILD.

Multilingual Image Captioning.

The task of multilingual image captioning is to generate captions inspeciﬁc languages given input images. We evaluate M P on Multi30K and MSCOCO. We use BLEU-4(B@4) and CIDEr (C) as the metrics. Table 4 shows the evaluation results on Multi30K. Similar to Table 2,M P still performs worse than state-of-the-art pre-trained models (VLP and XGPT) on the English imagecaptioning dataset. They employ the same image-caption corpus for pre-training. But it shows a strongcross-lingual transfer capability on non-English datasets in the few-shot settings (i.e., w/ ﬁne-tune on eachand w/ ﬁne-tune on all). 7 odel en->fr fr->en en->de de->enB@4 B@4 B@4 B@4IMAGINATION [36] - - 30.2 -LIUMCVC [37] 52.7 - 31.1 -Text-Only NMT [38] 53.5 - 31.6 -VAG-NMT [38] 53.8 - 31.6 -M P Table 5: Multimodal machine translation results on Multi30K. The metric is B@4.

Model en fr es de el bg ru tr ar vi th zh hi sw ur avgXLM-R base [3] (w/ ﬁne-tune on en) 84.6 78.2 79.2 77.0 75.9 77.5 75.5 72.9 72.1 74.8 71.6 73.7 69.8 64.7 65.1 74.2M P (w/ ﬁne-tune on en) 82.3 76.3 77.0 74.1 73.2 76.2 74.1 70.3 69.2 73.9 69.6 72.9 68.6 59.4 64.7 72.1

Table 6: Multilingual natural language inference results on XNLI. Test accuracy on the 15 XNLI languages.

Model en es fr de ru avgUnicoder xDAESC [12] (w/ ﬁne-tune on en) 15.6 9.0 8.7 6.8 7.7 9.6Unicoder xFNPSC [12] (w/ ﬁne-tune on en) 15.8 11.9 9.9 7.5 8.4 10.7M P (w/ ﬁne-tune on en) 14.1 8.0 7.3 5.2 6.1 8.1

Table 7: Multilingual text generation results on the NTG task in XGLUE. The metric is B@4.

Multimodal Machine Translation.

The task of multimodal machine translation is to generate sentencesin target languages given source sentences together with related images as complementary information. Weevaluate M P on Multi30K and use BELU-4 (B@4) as the metrics. We experiment with our model infour translation directions consisting of 3 languages: English (en), German (de), French (fr). All languagepairs include en on either of the sides. From Table 5, we compare the performance of M P against thestate-of-the-art multimodal machine translation approaches and the text-only baseline. We observe thatpretraining provides a signiﬁcant boost in the BLEU score for each translation direction.

Multilingual Natural Language Inference.

The task of multilingual natural language inference is topredict the entailment relation (

Entailment , Contradiction or Neutral ) between two sentences in a speciﬁclanguage. We evaluate M P on XNLI [10] based on its original train, dev and test splits, and compare it withthe base version (12 layers) of XLM-R base [3]. We ﬁne-tune these two models on the English labeled dataand then apply the ﬁne-tuned model to all test sets in 15 languages. Evaluation results are listed in Table 6.From Table 6 we can see that, although M P is pre-trained for diﬀerent types of tasks (understanding andgeneration) from diﬀerent perspectives (multilingual and multimodal), it can still obtain surprisingly goodperformance on XNLI, which shows the possibility of learning universal representations.

Multilingual Text Generation.

We also evaluate M P on the News Title Generation (NTG) tasks inXGLUE [12], and compare it with the extended version of Unicoder described in [12]. We ﬁne-tune thesetwo models on the English labeled data and then apply the ﬁne-tuned model to all test sets in 5 languages.Evaluation results are listed in Table 7. Similar to the trend on XNLI, Table 7 shows that M P can keep agood performance on this multilingual text generation task as well.

We have presented in this paper a new pre-trained model M P for multilingual-multimodal representationlearning. The learned representation shows a strong cross-lingual transfer capability and is proven eﬀectiveon ﬁve downstream tasks. To facilitate the research on multilingual-multimodal modeling, we also develop alarge-scale dataset called MILD and will make it publicly available to the research community.8 eferences [1] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deepbidirectional transformers for language understanding. In

Proceedings of the 2019 Conference of theNorth American Chapter of the Association for Computational Linguistics , pages 4171–4186, 2019.[2] Alexis Conneau and Guillaume Lample. Cross-lingual language model pretraining. In

Advances inNeural Information Processing Systems , pages 7057–7067, 2019.[3] Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, FranciscoGuzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. Unsupervised cross-lingualrepresentation learning at scale. arXiv preprint arXiv:1911.02116 , 2019.[4] Haoyang Huang, Yaobo Liang, Nan Duan, Ming Gong, Linjun Shou, Daxin Jiang, and Ming Zhou.Unicoder: A universal language encoder by pre-training with multiple cross-lingual tasks. arXiv preprintarXiv:1909.00964 , 2019.[5] Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert: Pretraining task-agnostic visiolinguisticrepresentations for vision-and-language tasks. In

Advances in Neural Information Processing Systems ,pages 13–23, 2019.[6] Gen Li, Nan Duan, Yuejian Fang, Daxin Jiang, and Ming Zhou. Unicoder-vl: A universal encoder forvision and language by cross-modal pre-training.

AAAI , 2020.[7] Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, andJingjing Liu. Uniter: Learning universal image-text representations. arXiv preprint arXiv:1909.11740 ,2019.[8] Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Jason J Corso, and Jianfeng Gao. Uniﬁedvision-language pre-training for image captioning and vqa.

AAAI , 2020.[9] Xiujun Li, Xi Yin, Chunyuan Li, Xiaowei Hu, Pengchuan Zhang, Lei Zhang, Lijuan Wang, Houdong Hu,Li Dong, Furu Wei, Yejin Choi, and Jianfeng Gao. Oscar: Object-semantics aligned pre-training forvision-language tasks. arXiv preprint arXiv:2004.06165 , 2020.[10] Alexis Conneau, Guillaume Lample, Ruty Rinott, Adina Williams, Samuel R Bowman, Holger Schwenk,and Veselin Stoyanov. Xnli: Evaluating cross-lingual sentence representations. arXiv preprintarXiv:1809.05053 , 2018.[11] Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, andLuke Zettlemoyer. Multilingual denoising pre-training for neural machine translation. In arXiv , 2020.[12] Yaobo Liang, Nan Duan, Yeyun Gong, Ning Wu, Fenfei Guo, Weizhen Qi, Ming Gong, Linjun Shou,Daxin Jiang, Guihong Cao, et al. Xglue: A new benchmark dataset for cross-lingual pre-training,understanding and generation. arXiv preprint arXiv:2004.01401 , 2020.[13] Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned,hypernymed, image alt-text dataset for automatic image captioning. In

Proceedings of the 56th AnnualMeeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 2556–2565,2018.[14] Vicente Ordonez, Girish Kulkarni, and Tamara L Berg. Im2text: Describing images using 1 millioncaptioned photographs. In

Advances in neural information processing systems , pages 1143–1151, 2011.[15] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen,Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and visionusing crowdsourced dense image annotations.

International Journal of Computer Vision , 123(1):32–73,2017.[16] Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, andC Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server. arXiv preprintarXiv:1504.00325 , 2015.[17] Mingyang Zhou, Runxiang Cheng, Yong Jae Lee, and Zhou Yu. A visual attention grounding neuralmodel for multimodal machine translation. In arXiv , 2018.[18] Aizhan Imankulova, Masahiro Kaneko, Tosho Hirasawa, and Mamoru Komachi. Towards multimodalsimultaneous neural machine translation. In arXiv , 2020.[19] Ross Girshick, Ilija Radosavovic, Georgia Gkioxari, Piotr Dollár, and Kaiming He. Detectron. 2018.920] Taku Kudo and John Richardson. Sentencepiece: A simple and language independent subword tokenizerand detokenizer for neural text processing.

EMNLP , 2018.[21] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. internationalconference on learning representations , 2015.[22] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. Attention is all you need. In

Advances in neural information processingsystems , pages 5998–6008, 2017.[23] Liwei Wang, Yin Li, Jing Huang, and Svetlana Lazebnik. Learning two-branch neural networks for image-text matching tasks.

IEEE Transactions on Pattern Analysis and Machine Intelligence , 41(2):394–407,2018.[24] Spandana Gella, Rico Sennrich, Frank Keller, and Mirella Lapata. Image pivoting for learning multilingualmultimodal representations.

In: Empirical Methods in Natural Language Processing (EMNLP) (2017) ,2017.[25] Jônatas Wehrmann, Douglas M Souza, Mauricio A Lopes, and Rodrigo C Barros. Language-agnosticvisual-semantic embeddings. In

Proceedings of the IEEE International Conference on Computer Vision ,pages 5804–5813, 2019.[26] Donghyun Kim, Kuniaki Saito, Kate Saenko, Stan Sclaroﬀ, and Bryan A Plummer. Mule: Multimodaluniversal language embedding.

In: AAAI Conference on Artiﬁcial Intelligence (2020) , 2020.[27] Andrea Burns, Donghyun Kim, Derry Wijaya, Kate Saenko, and Bryan A Plummer. Learning to scalemultilingual representations for vision-language tasks. arXiv preprint arXiv:2004.04312 , 2020.[28] Desmond Elliott, Stella Frank, Khalil Sima’an, and Lucia Specia. Multi30k: Multilingual english-germanimage descriptions. arXiv preprint arXiv:1605.00459 , 2016.[29] Desmond Elliott, Stella Frank, Loïc Barrault, Fethi Bougares, and Lucia Specia. Findings of the secondshared task on multimodal machine translation and multilingual image description. arXiv preprintarXiv:1710.07177 , 2017.[30] Takashi Miyazaki and Nobuyuki Shimizu. Cross-lingual image caption generation. In

ACL , 2016.[31] Xirong Li, Chaoxi Xu, Xiaoxu Wang, Weiyu Lan, Zhengxiong Jia, Gang Yang, and Jieping Xu. Coco-cnfor cross-lingual image tagging, captioning and retrieval. In

IEEE Transactions on Multimedia , 2019.[32] Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. From image descriptions to visualdenotations: New similarity metrics for semantic inference over event descriptions.

Transactions of theAssociation for Computational Linguistics , 2:67–78, 2014.[33] Yuya Yoshikawa, Yutaro Shigeto, and Akikazu Takeuchi. Stair captions: Constructing a large-scalejapanese image caption dataset. arXiv preprint arXiv:1705.00823 , 2017.[34] Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In

Proceedings of the IEEE conference on computer vision and pattern recognition , pages 3128–3137, 2015.[35] Qiaolin Xia, Haoyang Huang, Nan Duan, Dongdong Zhang, Lei Ji, Zhifang Sui, Edward Cui, TaroonBharti, and Ming Zhou. Xgpt: Cross-modal generative pre-training for image captioning. arXiv preprintarXiv:2003.01473 , 2020.[36] Desmond Elliott and Akos Kádár. Imagination improves multimodal translation. arXiv preprintarXiv:1705.04350 , 2017.[37] Ozan Caglayan, Walid Aransa, Adrien Bardet, Mercedes García-Martínez, Fethi Bougares, Loïc Barrault,Marc Masana, Luis Herranz, and Joost Van de Weijer. Lium-cvc submissions for wmt17 multimodaltranslation task. arXiv preprint arXiv:1707.04481 , 2017.[38] Mingyang Zhou, Runxiang Cheng, Yong Jae Lee, and Zhou Yu. A visual attention grounding neuralmodel for multimodal machine translation.