[PDF] Cross-lingual Visual Pre-training for Multimodal Machine Translation

Abstract

Pre-trained language models have been shown to improve performance in many natural language tasks substantially. Although the early focus of such models was single language pre-training, recent advances have resulted in cross-lingual and visual pre-training methods. In this paper, we combine these two approaches to learn visually-grounded cross-lingual representations. Specifically, we extend the translation language modelling (Lample and Conneau, 2019) with masked region classification and perform pre-training with three-way parallel vision & language corpora. We show that when fine-tuned for multimodal machine translation, these models obtain state-of-the-art performance. We also provide qualitative insights into the usefulness of the learned grounded representations.

Full PDF

CCross-lingual Visual Pre-training for Multimodal Machine Translation

Ozan Caglayan , Menekse Kuyu , Mustafa Sercan Amac , Pranava Madhyastha Erkut Erdem , Aykut Erdem and Lucia Specia , , Imperial College London , Hacettepe University , Koc¸ University University of Shefﬁeld , ADAPT - Dublin City University [email protected], [email protected], [email protected], [email protected]@cs.hacettepe.edu.tr, [email protected], [email protected] Abstract

Pre-trained language models have been shownto improve performance in many natural lan-guage tasks substantially. Although the earlyfocus of such models was single languagepre-training, recent advances have resultedin cross-lingual and visual pre-training meth-ods. In this paper, we combine these twoapproaches to learn visually-grounded cross-lingual representations. Speciﬁcally, we ex-tend the translation language modelling (Lam-ple and Conneau, 2019) with masked regionclassiﬁcation and perform pre-training withthree-way parallel vision & language corpora.We show that when ﬁne-tuned for multimodalmachine translation, these models obtain state-of-the-art performance. We also providequalitative insights into the usefulness of thelearned grounded representations.

Pre-trained language models (Peters et al., 2018;Devlin et al., 2019) have been proven valuable toolsfor contextual representation extraction. Manystudies have shown their effectiveness in discov-ering linguistic structures (Tenney et al., 2019),which is useful for a wide variety of NLP tasks (Tal-mor et al., 2019; Kondratyuk and Straka, 2019;Petroni et al., 2019). These positive results ledto further exploration of (i) cross-lingual pre-training (Lample and Conneau, 2019; Conneauet al., 2020; Wang et al., 2020) through the use ofmultiple mono-lingual and parallel resources, and(ii) visual pre-training where large-scale image cap-tioning corpora are used to induce grounded vision& language representations (Lu et al., 2019; Tanand Bansal, 2019; Li et al., 2020a; Su et al., 2020;Li et al., 2020b). The latter is usually achieved byextending the masked language modelling (MLM)objective (Devlin et al., 2019) with auxiliary vision& language tasks such as masked region classiﬁca-tion and image sentence matching. In this paper, we present the ﬁrst attempt tobring together cross-lingual and visual pre-training.Our visual translation language modelling (VTLM)objective combines the translation language mod-elling (TLM) (Lample and Conneau, 2019) withmasked region classiﬁcation (MRC) (Chen et al.,2020; Su et al., 2020) to learn grounded cross-lingual representations. Unlike most of the priorwork that use classiﬁcation or retrieval based down-stream evaluation, we focus on the generative taskof multimodal machine translation (MMT), whereimages accompany captions during translation (Su-lubacak et al., 2020). Once pre-trained, we trans-fer the VTLM encoder to a Transformer-based(Vaswani et al., 2017) MMT and ﬁne-tune it forthe MMT task. To our knowledge, this is also theﬁrst attempt of pre-training & ﬁne-tuning for MMT,where the current state of the art mostly relies ontraining multimodal sequence-to-sequence systemsfrom scratch (Calixto et al., 2016; Caglayan et al.,2016; Libovick´y and Helcl, 2017; Elliott and K´ad´ar,2017; Caglayan et al., 2017; Yin et al., 2020).Our ﬁndings highlight the effectiveness of cross-lingual visual pre-training: when ﬁne-tuned onthe English → German direction of the Multi30kdataset (Elliott et al., 2016), our MMT model sur-passes our constrained MMT baseline by about BLEU and METEOR points. The rest of thepaper is organised as follows: § § § We propose Visual Translation Language Mod-elling (VTLM) objective to learn multimodal cross-lingual representations. In what follows, we ﬁrstdescribe the TLM objective (Lample and Con-neau, 2019) and then introduce the modiﬁcationsrequired to extend it to VTLM. a r X i v : . [ c s . C L ] J a n RANSFORMER

ENGLISH

ModalityEmbeddings

GERMANstation [/s][MASK][/s] very [MASK] [MASK] [/s]typische[/s] [MASK] sehr4 530 1 2 4 530 1 2 typical bus + + + + + + + + + + + ++ + + + + + + + + + + + eine Bushaltestelle

IMAGE0 1 2 3 + + + ++ + + + [MASK] humanTranslation Language Modelling Masked Region Classification

PositionalEmbeddingsLang. & VisionEmbeddings

Figure 1: The architecture of the proposed model: VTLM extends the TLM (Lample and Conneau, 2019) (left sideof the dotted line) with regional image features. Masking applies on both linguistic and visual tokens.

The TLM objective is based on Transformer net-works and assumes the availability of parallel cor-pora during training. It deﬁnes the input x as theconcatenation of m -length source language sen-tence s (1)1: m and n -length target language sentence s (2)1: n : x = (cid:104) s (1)1 , · · · , s (1) m , s (2)1 , · · · , s (2) n (cid:105) For a given input, TLM follows (Devlin et al.,2019), and selects a random set of input tokens y = { s ( l )1 , . . . , s ( l ) k } for masking. Let us denotethe masked input sequence with ˜ x , and the ground-truth targets for masked positions with ˆ y . TLMemploys the masked language modelling (MLM)objective to maximise the log-probability of correctlabels ˆ y , conditioned on the masked input ˜ x : L = 1 |X | (cid:88) x ∈X log Pr(ˆ y | ˜ x ; θ ) where θ are the model parameters. We keep thestandard hyper-parameters for masking, i.e. of inputs are randomly selected for masking, fromwhich are replaced with the [MASK] token, are replaced with random tokens from thevocabulary, and are left intact. VTLM extends the TLM by adding the visualmodality alongside the translation pairs (Figure 1).Therefore, we assume the availability of sentencepair & image triplets and redeﬁne the input as: x = (cid:104) s (1)1 , · · · , s (1) m , s (2)1 , · · · , s (2) n , v , · · · , v o (cid:105) where { v , · · · , v o } are features extracted from aFaster R-CNN model (Ren et al., 2015) pre-trained on the Open Images dataset (Kuznetsova et al.,2018). Speciﬁcally, we extract convolutional fea-ture maps from o = 36 most conﬁdent regions,and average pool each of them to obtain a region-speciﬁc feature vector v i ∈ R . Each region i is also associated with a detection label ˆ v i pro-vided by the extractor. Before encoding, the featurevectors and their bounding box coordinates are pro-jected into the language embedding space.The ﬁnal model processes translation pairs andprojected region features in a single-stream fash-ion (Su et al., 2020; Li et al., 2020a), and combinesthe TLM loss with the masked region classiﬁcation(MRC) loss as follows: L = 1 |X | (cid:88) x ∈X log Pr( { ˆ y, ˆ v }| ˜ x ; θ ) Masking. random masking ratio is appliedseparately to both language and visual streams,and the ˆ v above now denotes the correct regionlabels for the masked feature positions. Differ-ent from previous work that zeroes out maskedregions (Tan and Bansal, 2019; Su et al., 2020),VTLM replaces their projected feature vectors withthe [MASK] token embedding. Similar to textualmasking, of the random masking amounts tousing regional features randomly sampled from allimages in the batch, and the remaining ofregions are left intact.

VTLM requires a three-way parallel multimodalcorpus, which does not exist in large-scale. To ad- The “ faster rcnn inception resnet v2 atrous oid v4 ” modelfrom TensorFlow. Although this choice is mostly practical, we hypothesisethat using the same signal for both language and visual mask-ing can be beneﬁcial for grounding. ress this, we extend the Conceptual Captions (CC) (Sharma et al., 2018) dataset with Germantranslations. CC is a large-scale collection of ∼ alt-text captions in English. The translationof English captions into German was automaticallyperformed using an existing NMT model (Ng et al.,2019) provided in the Fairseq (Ott et al., 2019)toolkit. Since some of the images are no longer ac-cessible, the ﬁnal corpus’ size is reduced to ∼ . M steps, using a single RTX2080-Ti GPU,and best checkpoints were selected with respect tovalidation set accuracy.

Settings.

We use a small version of theTLM (Lample and Conneau, 2019) and set themodel dimension, feed-forward layer dimension,number of layers and number of attention heads to d = 512 , f = 2048 , l = 6 and h = 8 , respectively.We randomly initialise model parameters, insteadof using pre-trained LM checkpoints such as BERTor XLM. We use Adam (Kingma and Ba, 2014)with the mini-batch size and the learning rate setto and . , respectively. The dropout (Sri-vastava et al., 2014) rate is set to . in all layers.The pre-training is done for . M steps using asingle RTX2080-Ti GPU, and best checkpoints areselected with respect to validation accuracy.

Our experimental protocol consists of initialis-ing the encoder and the decoder of Transformer-based NMT and MMT models with weights fromTLM/VTLM, and ﬁne-tuning them with a smallerlearning rate. The architectural difference betweenthe NMT and the MMT models is that the latterencodes regional visual features as part of thesource sequence, similar to the VTLM ( § from-scratch . For theﬁne-tuning experiments, we train three runs withdifferent seeds. For evaluation, we use the modelswith the lowest validation set perplexity to decodetranslations with beam size equal to 8. https://hucvl.github.io/VTLM The transformer.wmt19.en-de model. https://github.com/facebookresearch/XLM Dataset.

We use the standard MMT corpus

Multi30k (Elliott et al., 2016) for both ﬁne-tuningand from-scratch runs. It contains 30k image de-scriptions from Flickr30k (Young et al., 2014) andtheir human translations in German for training,along with three test sets of 1K samples each: theoriginal and the most in-domain test set, aswell as and

COCO test sets created using im-ages and descriptions collected from sources otherthan Flickr.

Settings.

For ﬁne-tuning, we use the same hyper-parameters as the pre-training phase, apart fromdecreasing the learning rate to e − . For MT mod-els that are trained from scratch, we increase thedropout rate to . and linearly warm up the learn-ing rate from e − to e − during the ﬁrst 4,000iterations. Inverse square-root annealing is appliedafter 4,000 iterations.

Table 1 reports M

ETEOR and B

LEU scores acrossthree different test sets of Multi30k. First, we ob-serve that the MMT system trained from scratchis consistently worse than its NMT counterpart.However, the gap disappears when pre-trainedTLM/VTLM checkpoints are ﬁne-tuned for MT.This suggests that pre-training may be necessaryfor single-stream multimodal encoding, where thenumber of regions ( ) outnumbers the avg. num-ber of source tokens ( for Multi30k).Second, we see that the best performances areobtained when models are ﬁrst pre-trained onthe three-way parallel Conceptual Captions (CC)dataset. To validate this further, we train a baselineNMT on the concatenation of Multi30k and CC(NMT+CC) and an MMT that uses only Multi30kfor both pre-training and ﬁne-tuning. The resultsclearly show that these systems lag behind the onespre-trained on CC.We also experimented with an alternative pre-training strategy where we do not mask visual re-gions. Interestingly, this alternative MMT in Ta-ble 1 reveals that not masking visual regions duringpre-training yields slightly better results overall.This is equivalent to letting the model predict theobject labels from a multimodal input where wordsare stochastically masked but regional features arekept intact. Overall, MMT ﬁne-tuning on VTLMsets a new state of the art across all Multi30k test

016 2017 COCO M ETEOR B LEU M ETEOR B LEU M ETEOR B LEU

Best RNN-MMT (Caglayan, 2019)58.7 39.4 52.9 32.6 – –Graph-based Transformers MMT (Yin et al., 2020)57.6 39.8 51.9 32.2 37.6 28.7

Ensemble

RNN-MMT (Delbrouck and Dupont, 2018)59.6 40.3 – – – –

Unconstrained

Transformers MMT (Helcl et al., 2018)59.1 42.7 – – – –Our Baseline Transformers (from scratch)

NMT +CC

MMT

VTLM : Pre-train and ﬁne-tune on

Multi30k

MMT

TLM : Pre-train on CC – ﬁne-tune on Multi30k

NMT ± ± ± ± ± ± MMT ± ± ± ± ± ± VTLM : Pre-train on CC – ﬁne-tune on Multi30k

NMT ± ± ± ± ± ± MMT ± ± ± ± ± ± VTLM : Alternative (0% visual masking during pre-training)

MMT ± ± ± ± ± ± Table 1: Quantitative comparison of experiments: when the mean and the standard deviation is reported, the singlenumbers appearing above, denote the maximum across three different runs. sets. We leave the exploration of visual regionmasking for the MRC task as future work and pro-ceed with the alternative variant in the followingexperiments.

Encoder attention parameters.

When ﬁne-tuning the TLM for MT, the default XLM imple-mentation randomly initialises the decoder’s miss-ing encoder attention parameters. In our experi-ments, we noticed that copying those parametersfrom the TLM self-attention layers substantiallyimproves the results up to . BLEU. We exclude Gr¨onroos et al. (2018) as their improvements(45.5 BLEU) were not due to multi-modality but rather toother modiﬁcations such as heavy parallel data augmentation,domain ﬁne-tuning, and ensembling.

Here, we will evaluate the extent to which the vi-sual information is taken into account (i) whenTLM/VTLM predicts masked tokens, and (ii) whenthe ﬁne-tuned NMT and MMT models are forced totranslate source sentences with missing visual enti-ties. For the latter, we use Flickr30k entities (Plum-mer et al., 2015) to mask head nouns in 2016 testset sentences, similar to Caglayan et al. (2019).

Last-word masking.

In this experiment, wemeasure the target word prediction accuracy, whenlast tokens of input caption pairs are systemati-cally masked during evaluation. Table 2 suggests We pre-process the sentences to ensure that they do notend with punctuation marks, which would make the task easierfor masked punctuation.

ALID T EST E N D E B OTH E N D E B OTH

TLM

VTLM ⇑ ⇑ ⇑ ⇑ ⇑ ⇑ +shuf ⇓ ⇓ ⇓ ⇓ ⇓ ⇓ Table 2: Masked last-word prediction accuracies:VTLM gains are with respect to TLM, whereas the in-congruent (+shuf) drops are relative to VTLM. M ASK R EMOVE

TLM → NMT

TLM → MMT ⇓ ⇓ VTLM → NMT

VTLM → MMT ⇑ ⇑ Table 3: Entity masking on 2016 test set: results areBLEU averages of three ﬁne-tuned MT systems. that the visual information is much more helpful(i.e. up to 6% accuracy improvement) when lasttokens are masked in both English and Germancaptions. However, if one caption is available, itprovides enough context for cross-lingual predic-tion. Finally, when we shufﬂe (+shuf) the test setfeatures to introduce incongruence (Elliott, 2018),we see that the VTLM model deteriorates substan-tially. This conﬁrms that the accuracy improve-ments are not due to side-effects of experimenta-tion noise, such as regularisation or random seedrelated effects.

Entity masking in MT.

We devise two ways ofmasking entities i.e. we either replace them withthe [MASK] token or remove them entirely so thatthe masking phenomena is not known to the model.The results in Table 3 show that MMT models canrecover the missing source context to some extent,only when they are pre-trained using the proposedVTLM objective. In other words, the groundingability can only be acquired when visual modal-ity is present for both pre-training and ﬁne-tuning.The gap between M

ASK and R

EMOVE also seemsto highlight the importance of reserving a sourceposition even it is corrupted/masked.

Here we take the MMT decoder’s cross-attentionlayers and measure the attention mass they attributeto regional features in the input embeddings. Al-though the encoder’s self-attention layers produce

Decoder Layers

Figure 2: Cross-attention mass over the visual portionof input sequences, averaged across the 2016 test set. increasingly mixed contextual embeddings as wemove towards the top layers, Brunner et al. (2020)show that the ﬁnal layer states still encode corre-sponding input embeddings to some extent. Withthis assumption at hand, Figure 2 shows the aver-age attention mass attributed to the ﬁrst (visual)top-layer encoding states, by each cross-attentionlayer in the decoder. We ﬁnd these results to be in agreement with the quantitative metrics (Table 1),with VTLM-MMT assigning substantially more at-tention to these positions, compared to TLM-MMTand MMT from scratch. We proposed a novel cross-lingual visual pre-training approach and tested its efﬁcacy for mul-timodal machine translation. Our pre-training ap-proach extends the TLM framework (Lample andConneau, 2019) with regional features and per-forms masked language modelling and masked re-gion classiﬁcation on a three-way parallel corpus.We show that this leads to substantial improve-ments compared to multimodal machine transla-tion with cross-lingual pre-training only or withoutpre-training at all. As future work, we considerexploring more informed masking strategies for vi-sual regions and investigating the impact of visualmasking probability for the MRC pre-training taskfor downstream MMT performance.

Acknowledgments

This work was supported in part by TUBA GEBIPfellowship awarded to Erkut Erdem, and theMMVC project funded by TUBITAK and theBritish Council via the Newton Fund Institu-tional Links grant programme (grant ID 219E054and 352343575). Lucia Specia, Pranava Mad-hyastha and Ozan Caglayan also received supportfrom MultiMT (H2020 ERC Starting Grant No.678017). eferences

Gino Brunner, Yang Liu, Damian Pascual Ortiz, OliverRichter, Massimiliano Ciaramita, and Roger Watten-hofer. 2020. On Identiﬁability in Transformers. In

ICLR .Ozan Caglayan. 2019.

Multimodal Machine Transla-tion . Theses, Universit´e du Maine.Ozan Caglayan, Walid Aransa, Adrien Bardet, Mer-cedes Garc´ıa-Mart´ınez, Fethi Bougares, Lo¨ıc Bar-rault, Marc Masana, Luis Herranz, and Joost van deWeijer. 2017. LIUM-CVC submissions for WMT17multimodal translation task. In

Proceedings of theSecond Conference on Machine Translation , pages432–439, Copenhagen, Denmark. Association forComputational Linguistics.Ozan Caglayan, Walid Aransa, Yaxing Wang,Marc Masana, Mercedes Garc´ıa-Mart´ınez, FethiBougares, Lo¨ıc Barrault, and Joost van de Wei-jer. 2016. Does multimodality help human andmachine for translation and image captioning?In

Proceedings of the First Conference on Ma-chine Translation: Volume 2, Shared Task Papers ,pages 627–633, Berlin, Germany. Association forComputational Linguistics.Ozan Caglayan, Pranava Madhyastha, Lucia Specia,and Lo¨ıc Barrault. 2019. Probing the need for visualcontext in multimodal machine translation. In

Pro-ceedings of the 2019 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics: Human Language Technologies, Vol-ume 1 (Long and Short Papers) , pages 4159–4170,Minneapolis, Minnesota. Association for Computa-tional Linguistics.Iacer Calixto, Desmond Elliott, and Stella Frank. 2016.DCU-UvA multimodal MT system report. In

Pro-ceedings of the First Conference on Machine Trans-lation: Volume 2, Shared Task Papers , pages 634–638, Berlin, Germany. Association for Computa-tional Linguistics.Yen-Chun Chen, Linjie Li, Licheng Yu, AhmedEl Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng,and Jingjing Liu. 2020. Uniter: Universal image-text representation learning. In

Computer Vision –ECCV 2020 , pages 104–120, Cham. Springer Inter-national Publishing.Alexis Conneau, Kartikay Khandelwal, Naman Goyal,Vishrav Chaudhary, Guillaume Wenzek, FranciscoGuzm´an, Edouard Grave, Myle Ott, Luke Zettle-moyer, and Veselin Stoyanov. 2020. Unsupervisedcross-lingual representation learning at scale. In

Proceedings of the 58th Annual Meeting of the Asso-ciation for Computational Linguistics , pages 8440–8451, Online. Association for Computational Lin-guistics.Jean-Benoit Delbrouck and St´ephane Dupont. 2018.UMONS Submission for WMT18 Multimodal Translation Task. In

Proceedings of the Third Con-ference on Machine Translation, Volume 2: SharedTask Papers , pages 643–647, Belgium, Brussels. As-sociation for Computational Linguistics.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In

Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers) ,pages 4171–4186, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.Desmond Elliott. 2018. Adversarial evaluation of mul-timodal machine translation. In

Proceedings of the2018 Conference on Empirical Methods in Natu-ral Language Processing , pages 2974–2978, Brus-sels, Belgium. Association for Computational Lin-guistics.Desmond Elliott, Stella Frank, Khalil Sima’an, and Lu-cia Specia. 2016. Multi30K: Multilingual English-German image descriptions. In

Proceedings of the5th Workshop on Vision and Language , pages 70–74, Berlin, Germany. Association for ComputationalLinguistics.Desmond Elliott and ´Akos K´ad´ar. 2017. Imaginationimproves multimodal translation. In

Proceedings ofthe Eighth International Joint Conference on Natu-ral Language Processing (Volume 1: Long Papers) ,pages 130–141, Taipei, Taiwan. Asian Federation ofNatural Language Processing.Stig-Arne Gr¨onroos, Benoit Huet, Mikko Kurimo,Jorma Laaksonen, Bernard Merialdo, Phu Pham,Mats Sj¨oberg, Umut Sulubacak, J¨org Tiedemann,Raphael Troncy, and Ra´ul V´azquez. 2018. TheMeMAD submission to the WMT18 multimodaltranslation task. In

Proceedings of the Third Confer-ence on Machine Translation: Shared Task Papers ,pages 603–611, Belgium, Brussels. Association forComputational Linguistics.Jindˇrich Helcl, Jindˇrich Libovick´y, and Duˇsan Variˇs.2018. CUNI system for the WMT18 multimodaltranslation task. In

Proceedings of the Third Confer-ence on Machine Translation: Shared Task Papers ,pages 616–623, Belgium, Brussels. Association forComputational Linguistics.Diederik P. Kingma and Jimmy Ba. 2014. Adam: AMethod for Stochastic Optimization.Dan Kondratyuk and Milan Straka. 2019. 75 lan-guages, 1 model: Parsing Universal Dependenciesuniversally. In

Proceedings of the 2019 Confer-ence on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP) , pages 2779–2795, Hong Kong, China. As-sociation for Computational Linguistics.lina Kuznetsova, Hassan Rom, Neil Alldrin, JasperR. R. Uijlings, Ivan Krasin, Jordi Pont-Tuset, Sha-hab Kamali, Stefan Popov, Matteo Malloci, TomDuerig, and Vittorio Ferrari. 2018. The open im-ages dataset V4: uniﬁed image classiﬁcation, objectdetection, and visual relationship detection at scale.

CoRR , abs/1811.00982.Guillaume Lample and Alexis Conneau. 2019. Cross-lingual language model pretraining.Gen Li, Nan Duan, Yuejian Fang, Ming Gong, andDaxin Jiang. 2020a. Unicoder-vl: A universal en-coder for vision and language by cross-modal pre-training. In

The Thirty-Fourth AAAI Conferenceon Artiﬁcial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artiﬁcial Intelli-gence Conference, IAAI 2020, The Tenth AAAI Sym-posium on Educational Advances in Artiﬁcial Intel-ligence, EAAI 2020, New York, NY, USA, February7-12, 2020 , pages 11336–11344. AAAI Press.Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang,Xiaowei Hu, Lei Zhang, Lijuan Wang, HoudongHu, Li Dong, Furu Wei, Yejin Choi, and JianfengGao. 2020b. Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks. In

Computer Vi-sion – ECCV 2020 , pages 121–137, Cham. SpringerInternational Publishing.Jindˇrich Libovick´y and Jindˇrich Helcl. 2017. Attentionstrategies for multi-source sequence-to-sequencelearning. In

Proceedings of the 55th Annual Meet-ing of the Association for Computational Linguistics(Volume 2: Short Papers) , pages 196–202, Vancou-ver, Canada. Association for Computational Linguis-tics.Jiasen Lu, Dhruv Batra, Devi Parikh, and StefanLee. 2019. Vilbert: Pretraining task-agnostic visi-olinguistic representations for vision-and-languagetasks.Nathan Ng, Kyra Yee, Alexei Baevski, Myle Ott,Michael Auli, and Sergey Edunov. 2019. FacebookFAIR’s WMT19 news translation task submission.In

Proceedings of the Fourth Conference on Ma-chine Translation (Volume 2: Shared Task Papers,Day 1) , pages 314–319, Florence, Italy. Associationfor Computational Linguistics.Myle Ott, Sergey Edunov, Alexei Baevski, AngelaFan, Sam Gross, Nathan Ng, David Grangier, andMichael Auli. 2019. fairseq: A fast, extensibletoolkit for sequence modeling. In

Proceedings ofNAACL-HLT 2019: Demonstrations .Matthew Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word rep-resentations. In

Proceedings of the 2018 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics: Human Lan-guage Technologies, Volume 1 (Long Papers) , pages2227–2237, New Orleans, Louisiana. Associationfor Computational Linguistics. Fabio Petroni, Tim Rockt¨aschel, Sebastian Riedel,Patrick Lewis, Anton Bakhtin, Yuxiang Wu, andAlexander Miller. 2019. Language models as knowl-edge bases? In

Proceedings of the 2019 Confer-ence on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP) , pages 2463–2473, Hong Kong, China. As-sociation for Computational Linguistics.Bryan A. Plummer, Liwei Wang, Chris M. Cervantes,Juan C. Caicedo, Julia Hockenmaier, and SvetlanaLazebnik. 2015. Flickr30k entities: Collectingregion-to-phrase correspondences for richer image-to-sentence models. In .Shaoqing Ren, Kaiming He, Ross Girshick, and JianSun. 2015. Faster R-CNN: Towards Real-Time Ob-ject Detection with Region Proposal Networks. InC. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama,and R. Garnett, editors,

Advances in Neural Infor-mation Processing Systems 28 , pages 91–99. CurranAssociates, Inc.Rico Sennrich, Barry Haddow, and Alexandra Birch.2016. Neural machine translation of rare wordswith subword units. In

Proceedings of the 54th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers) , pages 1715–1725, Berlin, Germany. Association for Computa-tional Linguistics.Piyush Sharma, Nan Ding, Sebastian Goodman, andRadu Soricut. 2018. Conceptual captions: Acleaned, hypernymed, image alt-text dataset for au-tomatic image captioning. In

Proceedings of the56th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers) , pages2556–2565, Melbourne, Australia. Association forComputational Linguistics.Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky,Ilya Sutskever, and Ruslan Salakhutdinov. 2014.Dropout: A simple way to prevent neural networksfrom overﬁtting.

The Journal of Machine LearningResearch , 15(1):1929–1958.Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu,Furu Wei, and Jifeng Dai. 2020. Vl-bert: Pre-training of generic visual-linguistic representations.In

International Conference on Learning Represen-tations .Umut Sulubacak, Ozan Caglayan, Stig-Arne Gr¨onroos,Aku Rouhe, Desmond Elliott, Lucia Specia, andJ¨org Tiedemann. 2020. Multimodal machine trans-lation through visuals and speech.

Machine Transla-tion , pages 1–51.Alon Talmor, Jonathan Herzig, Nicholas Lourie, andJonathan Berant. 2019. CommonsenseQA: A ques-tion answering challenge targeting commonsenseknowledge. In

Proceedings of the 2019 Conferenceof the North American Chapter of the Associationor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers) ,pages 4149–4158, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.Hao Tan and Mohit Bansal. 2019. LXMERT: Learningcross-modality encoder representations from trans-formers. In

Proceedings of the 2019 Conference onEmpirical Methods in Natural Language Processingand the 9th International Joint Conference on Natu-ral Language Processing (EMNLP-IJCNLP) , pages5100–5111, Hong Kong, China. Association forComputational Linguistics.Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019.BERT rediscovers the classical NLP pipeline. In

Proceedings of the 57th Annual Meeting of the Asso-ciation for Computational Linguistics , pages 4593–4601, Florence, Italy. Association for ComputationalLinguistics.Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, Ł ukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In I. Guyon, U. V. Luxburg, S. Bengio,H. Wallach, R. Fergus, S. Vishwanathan, and R. Gar-nett, editors,

Advances in Neural Information Pro-cessing Systems 30 , pages 5998–6008. Curran Asso-ciates, Inc.Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, NanYang, and Ming Zhou. 2020. MiniLm: Deepself-attention distillation for task-agnostic compres-sion of pre-trained transformers. arXiv preprintarXiv:2002.10957 .Yongjing Yin, Fandong Meng, Jinsong Su, ChulunZhou, Zhengyuan Yang, Jie Zhou, and Jiebo Luo.2020. A novel graph-based multi-modal fusion en-coder for neural machine translation. In

Proceed-ings of the 58th Annual Meeting of the Associationfor Computational Linguistics , pages 3025–3035,Online. Association for Computational Linguistics.Peter Young, Alice Lai, Micah Hodosh, and Julia Hock-enmaier. 2014. From image descriptions to visualdenotations: New similarity metrics for semantic in-ference over event descriptions.