[PDF] Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers

Abstract

Recently multimodal transformer models have gained popularity because their performance on language and vision tasks suggest they learn rich visual-linguistic representations. Focusing on zero-shot image retrieval tasks, we study three important factors which can impact the quality of learned representations: pretraining data, the attention mechanism, and loss functions. By pretraining models on six datasets, we observe that dataset noise and language similarity to our downstream task are important indicators of model performance. Through architectural analysis, we learn that models with a multimodal attention mechanism can outperform deeper models with modality specific attention mechanisms. Finally, we show that successful contrastive losses used in the self-supervised learning literature do not yield similar performance gains when used in multimodal transformers

Full PDF

DDecoupling the Role of Data, Attention, and Lossesin Multimodal Transformers

Lisa Anne Hendricks John Mellor Rosalia SchneiderJean-Baptiste Alayrac Aida Nematzadeh

DeepMind {lmh, johnme, rgschneider, jalayrac, nematzadeh} @google.com

Abstract

Recently multimodal transformer modelshave gained popularity because their perfor-mance on language and vision tasks sug-gest they learn rich visual-linguistic repre-sentations. Focusing on zero-shot image re-trieval tasks, we study three important fac-tors which can impact the quality of learnedrepresentations: pretraining data, the atten-tion mechanism, and loss functions. By pre-training models on six datasets, we observethat dataset noise and language similarityto our downstream task are important indi-cators of model performance. Through ar-chitectural analysis, we learn that modelswith a multimodal attention mechanism canoutperform deeper models with modality-speciﬁc attention mechanisms. Finally, weshow that successful contrastive losses usedin the self-supervised learning literature donot yield similar performance gains whenused in multimodal transformers. Signiﬁcant progress in pretraining of natural lan-guage processing (NLP) models has been madethrough both architectural innovations ( e.g. , trans-formers; Vaswani et al., 2017) as well as a hugeincrease in the size of pretraining data and themodel ( e.g. , Devlin et al., 2019; Brown et al.,2020). This success in language pretraining hasinspired parallel multimodal vision–language ef-forts; in particular, multimodal image–languagetransformers, pretrained on large noisy image–textdatasets, have achieved state-of-the-art results on arange of downstream tasks such as image retrieval,visual question answering, and visual reasoning( e.g. , Lu et al., 2019; Chen et al., 2020; Tan andBansal, 2019; Li et al., 2020a,b).However, even though many variants of multi-modal image–language transformer models have pre-print of MIT Press Publication version been proposed recently, it is unclear how learnedrepresentations are impacted by the large amountsof pretraining data, the transformer architectureand self-attention, or their speciﬁc losses. We ad-dress this gap, by ﬁrst establishing a baseline thatis trained on the same pretraining data as multi-modal transformers but with a different architec-ture. We then perform an investigative analysis tobetter understand the extent to which these aspectscontribute to models’ performance.Our evaluation mainly focuses on zero-shottasks where evaluation data is taken from a datasetunseen during pretraining. Measuring zero-shotperformance enables us to evaluate whether a pre-trained model learns general representations. Pre-vious work in NLP has considered probing classi-ﬁers to evaluate representations; however, this ap-proach can be misleading as the performance ofprobing classiﬁers does not solely depend on thequality of representations ( e.g. , Hewitt and Liang,2019; Voita and Titov, 2020). Similarly, evalua-tion after ﬁne-tuning is a less direct measure ofstrength of representations since performance onthese tasks is highly dependent on the ﬁne-tuningexperimental set-up and the size of ﬁne-tuningdata (Yogatama et al., 2019).We ﬁrst study the importance of different prop-erties of multimodal datasets such as their sizeand their noise level ( i.e. , how closely the lan-guage describes a given image’s content). Recentwork has introduced image–text datasets with dif-ferent qualities, for example, noisy but very largeones (Sharma et al., 2018) as well as carefully-annotated but smaller ones (Pont-Tuset et al.,2019). Better understanding of what aspect of adataset is more important can result in better taskperformance and also guide us in future dataset cu-ration efforts. We ﬁnd that a dataset’s size doesnot always predict multimodal transformers’ per-formance; its noise level and language similarityto the evaluation task are both important contribut- a r X i v : . [ c s . C L ] J a n ng factors. We also show that multimodal trans-formers can achieve competitive results withoutrelying on language-only or image-only pretrain-ing for weight initialization or feature extraction.We also dissect multimodal transformers’ ar-chitecture, analyzing the effectiveness of differ-ent attention mechanisms, depth, and number ofparameters. We show that multimodal attentionwhere both language and image transformers at-tend to each other are crucial for these models’success. Multimodal attention achieves the bestresults when combined with multi-level (deep) in-teractions. Moreover, models with other types ofattention (even with more depth or parameters)fail to achieve comparable results to shallower andsmaller models with multimodal attention.Additionally, inspired by the success of ( e.g. ,van den Oord et al., 2018) for self-supervised rep-resentation learning, we examine whether using acontrastive image–text matching loss instead of aclassiﬁcation one improves the quality of repre-sentations in our models. Surprisingly, we ﬁndthat the choice of image–text matching loss doesnot matter much in multimodal transformers. Onthe other hand, models without multimodal atten-tion (a multi-level “cross-talk” between modali-ties) beneﬁt signiﬁcantly from a contrastive loss.Finally, we believe that advances in multimodalpretraining can have signiﬁcant impacts on a widerange of downstream applications; however, it isimportant to form a clear understanding of howand why multimodal transformer models performwell to avoid overﬁtting to a set of downstreamevaluation tasks. Our analysis of pretraining data,attention, and loss functions is an important steptowards gaining a deeper understanding of thesepowerful models. The success of transformer-based language mod-els on a variety of language tasks ( e.g. , Devlinet al., 2019) has inspired similar multimodal ef-forts ( e.g. , Lu et al., 2019; Chen et al., 2020; Tanand Bansal, 2019; Li et al., 2020a,b). The maindistinction is that image-text multimodal trans-formers take image-text pairs as input, attend overboth modalities, and are trained with additional We use the term multimodal transformers to refer toimage–text transformer–based models. Note that similar ar-chitectures are applied to other modalities such as videos(Sun et al., 2019) but are outside of the scope of this work. losses. Similar to the language models, multi-modal transformers are often ﬁne-tuned on down-stream tasks but multimodal ones; e.g. , image re-trieval (Young et al., 2014) or visual question an-swering (Goyal et al., 2017).We give a brief overview of the BERTmodel (Devlin et al., 2019) which forms the back-bone of multimodal transformers. The BERT ar-chitecture consists of a stack of transformer blocks(Vaswani et al., 2017) and has three main compo-nents. First, the input text is tokenized and threeembedding functions are used to embed the to-ken, its position in the sentence ( i.e. , positionalencoding), and the sentence it belongs to. The ﬁ-nal language embedding is a summation of thesethree vectors. The BERT model also includes a token to separate different sentences anda token which can be thought of as an ag-gregate representation of the input text. Second,the sequence of token embeddings are input into aseries of transformer layers where tokens are com-bined through self-attention. Third, two differentlosses are applied to the model output: a maskedlanguage modeling loss, in which the model pre-dicts a masked word (denoted by a to-ken), and a next sentence prediction loss which,given two sentences, predicts if the second sen-tence follows the ﬁrst.Multimodal transformer models facilitate learn-ing from multimodal data via three changes to theBERT architecture: multimodal data preprocess-ing (more speciﬁcally images), adding multimodalattention by changing self-attention such that itcombines image and text modalities, and introduc-ing image and multimodal loss functions . Training multimodal transformers requires image–text pairs such that the text for a given image, atleast to some degree, describes the image. Re-cent work attempts to remove the annotation costby automatically collecting datasets ( e.g. , web im-ages and their alt-text as in Sharma et al., 2018).In Sec. 4.2, we examine whether the quality of textdescriptions impacts these models’ performance.The text input processing is the same as lan-guage models; in fact, many of the existing mod-els (such as Lu et al., 2019) are initialized withBERT pretrained weights. We show that this ini-tialization is not important in our experiments (seeSec. 4.2). Processing images into a sequenceigure 1: Tracking queries, keys and values for different attention types described in Sec. 2.2.

VisionMulti-Head Attention LanguageMulti-Head Attention Q w K r V r K w V w Q r H lr H lw Coattention

VisionMulti-Head Attention LanguageMulti-Head Attention Q w K r V r K w V w Q r H lr H lw Modality specific attention Q w K r ,K w Q r H lr H lw V r ,V w K r ,K w V r ,V w Vision & LanguageMulti-Head Attention

Merged

LanguageMulti-Head Attention Q w K r V r H lr H lw Asymmetric attention (language) Q w K r ,K w Q r H lr H lw V r ,V w K r ,K w V r ,V w Merged

VisionMulti-Head Attention LanguageMulti-Head Attention V r,w Q w Q r H lr H lw Vision & LanguageMulti-Head Attention

Merged K r,w V r,w K r,w V r/w Q w Q r H lr H lw Vision & LanguageMulti-Head Attention

Merged attention K r/w V r/w K r/w VisionMulti-Head Attention LanguageMulti-Head Attention Q w K r V r K w V w Q r H lr H lw Coattention

VisionMulti-Head Attention LanguageMulti-Head Attention Q w K r V r K w V w Q r H lr H lw Modality specific attention

LanguageMulti-Head Attention Q w K r V r H lr H lw Asymmetric attention V r/w Q w Q r H lr H lw Vision & LanguageMulti-Head Attention

Merged attention K r/w V r/w K r/w involves deﬁning “visual tokens” analogously tolanguage tokens. Almost all image-text multi-modal transformer models consider a boundingbox from a pretrained object detection model tobe a “visual token”. Similar to the positional en-codings in language models, for each visual token,the spatial position of each bounding box is alsoencoded.Although most multimodal transformers requiretraining a supervised model (a detector) to ex-tract bounding-box features, there are other possi-ble ways to represent visual tokens – for example,Huang et al. (2020) bypass training a detector byusing regions from a high-level feature map in animage classiﬁcation network as visual tokens. Wefocus our studies on models which use bounding-box features as this reﬂects the majority of recentwork, though we achieve comparable results whenlearning directly from images without a detector(or even a pretrained classiﬁer) in Sec. 4.2. Each transformer block consists of a multi-headattention module (Vaswani et al., 2017) that for agiven token embedding produces a weighted rep-resentation of all other tokens in a sentence. Thisweighted representation is then combined withthe input representation of the given token and ispassed to the next layer. More speciﬁcally, for thetoken i at layer l , each attention head takes as in-put a key k il , value v il , and query q il which are com-puted by passing the representation from the pre-vious layer h il − through a linear layer. The outputof the attention module for token i is: A ( q il , K l , V l ) = softmax (cid:18) q il K l √ d k (cid:19) V l (1)where d k is the dimension of the key and K l and V l matrices contain all tokens’ keys and values.Given this deﬁnition, there are a few possibleways to implement multi-head attention over im-age and language modalities as shown in Fig. 1.For a given query (from one modality), we cansimply consider keys and values from all input to-kens regardless of the modality type ( e.g. , Chen et al., 2020). We refer to this multimodal attentionas merged attention since it simply merges inputsfrom the two modalities.Alternatively, given queries from one modality( e.g. , image), keys and values can be taken only from the other modality ( e.g. , language). Follow-ing Lu et al. (2019), we refer to this multimodalattention as coattention . We also consider caseswhere this attention is asymmetric, i.e. , queries are either from language or image, while keys and val-ues are from image or language, respectively. Wecall these two attention types language-query at-tention or image-query attention .Another possibility is to consider single-modality transformers where queries, keys, andvalues all come from either the image or textmodality; we refer to this attention as modality-speciﬁc attention where each modality has its ownmulti-head attention. Single-modality transform-ers with modality-speciﬁc attention allow us tostudy the role of “cross-talk” between modalitiesin multimodal transformer models.We note that we use the term multimodal atten-tion to refer to both merged attention and coatten-tion and discuss the importance of different atten-tion types in Sec. 4.3. Broadly, multimodal transformers have three losstypes, language and image losses that are ap-plied to the language and image outputs, re-spectively, as well as an image-text matchingloss applied to image–language pairs. Let r = { r , · · · , r N } be the N input image regions and w = { w , · · · , w T } be the T word tokens rep-resenting an image–text pair. A subset of inputimage regions and word tokens are masked ( e.g. ,set to zero) before being passed through the trans-former layers. After applying the mask, we referto the unmasked image regions as r m and to theunmasked word tokens as w m . We use N m and T m to denote the set of image region and word to-ken indices that are masked, respectively. Similarto the BERT model, the language loss is a masked-anguage modelling (MLM) loss: − (cid:88) t ∈ T m log P w θ ( w t | w m , r m ) , (2)where P w θ corresponds to the output probabilitydistribution over words in the vocabulary from thetransformer model parameterized by θ .Most models also include an analogous maskedregion modeling loss (MRM) for images. Onepopular region modelling loss, for each bound-ing box, minimizes the KL-divergence betweenthe predicted distribution over object classes andthe distribution over classes obtained from a pre-trained detector D ( l | r n ) ( e.g. , Chen et al., 2020;Lu et al., 2019). (cid:88) n ∈ N m KL( D ( l | r n ) || P r θ ( r n | r m , w m )) , (3)where P r θ corresponds to the predicted probabil-ity distribution over object classes from the trans-former model parameterized by θ .Finally, multimodal transformer models includean image–text matching (ITM) loss which predictswhether an image and text pair match; this is gen-erally posed as a binary classiﬁcation problem: − y log( σ ( s θ ( r m , w m ))) − (1 − y ) log(1 − σ ( s θ ( r m , w m ))) , (4)where y is equal to for positive pairs and oth-erwise and s θ corresponds to the conﬁdence scoreof the model that a pair ( r , w ) are matched and σ is the sigmoid function. Recently, contrastiveimage–text matching losses have been success-ful in self-supervised representation learning ( e.g. ,van den Oord et al., 2018); thus, we also explorewhether a contrastive formulation of ITM can im-prove the performance of multimodal transformersand discuss the challenges of using these losses formultimodal transformer models. Our contrastiveloss is formulated as: − log  e s θ ( r m , w m )) e s θ ( r m , w m ) + (cid:80) (˜ r , ˜ w ) ∼N e s θ (˜ r m , ˜ w m )  , (5)where N is a set of negative image-text pairs.Sec. 4.4 outlines our ﬁndings on loss ablations. Here we outline the details of our experimentalsetup: the base multimodal transformer modelused in most of our experiments, our baselinemodel, and the pretraining datasets.

Our base multimodal transformer model (MMT)most closely resembles the ViLBERT model (Luet al., 2019). For text inputs, we ﬁrst tokenize sen-tences using SentencePiece (Kudo and Richard-son, 2018) and truncate sentences into a ﬁxedlength of for pretraining datasets and fordatasets used to ﬁne-tune and evaluate retrievalmodels. We then include a separator ( )and an aggregator ( ) token. Unless oth-erwise stated, we do not transfer weights from apretrained BERT model.For image inputs, we represent “visual tokens”as region of interest (RoI) pooled features corre-sponding to bounding boxes from an object detec-tor (Ren et al., 2015) trained on Visual Genome(Krishna et al., 2017) images with labels parsed aswas done in Anderson et al. (2018). The detectionmodel is trained using a multi-label sigmoid cross-entropy loss to simultaneously predict objects andattributes. The highest or scoring boundingboxes are input when pretraining or evaluating, re-spectively. Like ViLBERT, we include an “aver-age” feature which is computed by averaging fea-tures across bounding boxes and serves a similarrole to the token in the text input.In addition to the positional encoding added totext embeddings before the ﬁrst transformer layer,we also add the positional encoding to the text em-bedding at each layer of the language-only trans-former blocks as in XLNet (Yang et al., 2019) be-cause this led to improvements on a language–only BERT model. For image inputs, we embedbounding box coordinates and add this to our im-age embedding.In our model, following ViLBERT, a multi-modal co-attention layer consists of an image-onlyand a language-only transformer, each followedby a transformer with coattention (see Sec. 2.2).We use the term “layer” to refer to this multi-modal layer. Like VilBERT, our model consists of language-only layers, followed by multimodalones. We train the model by minimizing maskedlanguage modelling (Eq. (2)), masked region mod-eling (Eq. (3)), and binary classiﬁcation image–text matching (Eq. (4)) losses. To calculate theimage-text loss, we apply an element-wise multi-plication to the language features and out-put corresponding to the averaged image featureinput. The resulting “multimodal feature” is in-put into a classiﬁcation model. We create negativemage-text examples by sampling text from an-other image in our batch. Unless otherwise noted,we have an equal number of negative and positiveimage-text pairs.We train our models with a global batch size of1024 distributed over 64 Google Cloud TPU v3cores . We use the LAMB optimizer (You et al.,2019) with an initial learning rate of . and20,000 warm up steps. Learning rate is decayedwith polynomial decay with a minimum learningrate ratio of . . We use gradient clipping ( )and dropout ( . ) as well as weight decay ( . ).We ﬁnd weight decay particularly important in en-suring that our loss did not diverge. We train ourmodels for a maximum of 1,000,000 iterations. Multimodal transformers are different from mostprior image–text models because they are pre-trained on a large dataset (millions of image-textpairs). To better understand if data alone can leadto better image–text representations, we train astrong baseline model, which does not include amultimodal attention mechanism, with the samedata as our multimodal transformer.Our baseline model learns a joint space be-tween language and vision (Weston et al., 2011;Frome et al., 2013; Kiros et al., 2014) by minimiz-ing the distance between image and text featurestaken from a positive pair (where text describes theimage) and at the same time increasing that dis-tance for a negative pair. Despite lacking a mul-timodal attention mechanism, this approach hasbeen popular in image and video domains due toits simplicity and effectiveness for retrieval appli-cations ( e.g. , Gong et al., 2014; Wang et al., 2016;Chowdhury et al., 2018; Miech et al., 2018).To implement our baseline, we ﬁrst encodeword tokens w into a ﬁxed-size sentence represen-tation S ∈ R and image regions r into a ﬁxed-size image representation I ∈ R . To encodesentence representations, we input words into arandomly initialized BERT model and extract sen-tence representations S from the output.To extract image representations I , we ﬁrst mean-pool features across detected bounding boxes andthen pass the mean-pooled features into a one-layer MLP with an output of size . Finally, weelement-wise multiply I and S and input the re-sulting vector into a two-layer MLP parameterized https://cloud.google.com/tpu/ Table 1:

The pretraining datasets: the type and numberof images and captions.

Dataset by θ which outputs a score, s θ indicating whether I and S match. We train our baseline model usingthe contrastive loss deﬁned in Equation (5) with1024 negative examples. The detector weights areﬁxed during training. Conceptual Captions (CC) consists of over 3million image-text pairs harvested from the webwhere the caption corresponding to an image is itsalt-text description (Sharma et al., 2018). Image–text pairs are ﬁltered and preprocessed such thattext is more image relevant than raw Alt-text; how-ever, the dataset is still “noisy” and includes pairswhere the text is not relevant to the image’s con-tent. We were able to download 81% of the train-ing set of CC; unless otherwise stated, we train ourmodels on this subset of CC.The

SBU dataset (Ordonez et al., 2011) con-sists of 1 million image-text pairs sourced fromFlickr with text taken from users’ captions. Asa result, similar to CC, not all text is image rele-vant. We also use datasets which were collectedby asking annotators to describe images, result-ing in more image relevant language including the

MSCOCO dataset (Chen et al., 2015) and

VisualGenome (VG) (Krishna et al., 2017), which in-cludes descriptions for bounding boxes in images.When using VG, we consider each bounding boxdescription to be a caption for the entire image.We also experiment with the

Localized Nar-ratives dataset (Pont-Tuset et al., 2019). Thisdataset includes rich annotations collected by ask-ing users to describe an image while pointing toeach part of the image being described (usingtheir mouse). The resulting “narratives” often con-sist of multiple sentences. We break the narra-tives into individual sentences and treat each sen-tence as a caption paired with the image. Weable 2:

Number of images in evaluation tasks andwhether datasets were used in a zero-shot (ZS) or ﬁne-tuned (FT) setting.

Dataset (cid:88) (cid:88)

MSCOCO n/a 5K (cid:88)

VQA 440K 210K (cid:88) use the localized narratives collected for the OpenImages (Kuznetsova et al., 2018) and MSCOCOdatasets, and refer to them as OI-narratives andMSCOCO-narratives. This allows us to comparemodels which are trained with the same images(MSCOCO) with different language (MSCOCOcaptions vs. localized narratives). Table 1 pro-vides an overview of our pretraining datasets.Finally, we consider combining datasets us-ing two sampling approaches: instance sampling where we mix all datasets together and samplefrom this mix for each batch and dataset sampling where we sample evenly from datasets so that eachbatch contains the same number of examples fromeach dataset. For datasets with multiple captions,we ﬁrst sample an image, then sample a captionfor the given image. We combine all six datasetsdescribed here as well as the four datasets com-bined in Chen et al. (2020) (MSCOCO, VG, SBU,and Conceptual Captions) which we refer to asUNITER data.

We focus on zero-shot evaluation as it enablesus to examine the representations without con-founding our ﬁndings with the side-effects of ﬁne-tuning (Yogatama et al., 2019) or probing classi-ﬁers ( e.g. , Zhang and Bowman, 2018; Hewitt andLiang, 2019). Following Lu et al. (2019) and Chenet al. (2020), we use the term zero-shot to referto experiments where we test our models on adataset different from our pretraining data withoutﬁne-tuning . For example, we use the MSCOCOdataset to test the models that are pretrained onConceptual Captions. This is considered as a zero-shot task since the properties of the dataset usedfor testing (for example, its language) differ fromthose in the pretraining dataset. We use zero-shotimage retrieval tasks since image retrieval directlymeasures what our pretraining data and objectivesencourage our models to learn: whether an image and a sentence are aligned.We evaluate on the Flickr30k dataset (Younget al., 2014) (referred to as zero-shot Flickr ) anduse the splits deﬁned in Karpathy and Fei-Fei(2015). We evaluate checkpoints after 1 millionsteps as well as when the loss on the CC validationset is lowest. When varying the pretraining data,our models sometimes overﬁt quickly on smallerdatasets; as a result, we evaluate checkpoints ev-ery 100K steps. We select the best checkpointaccording to zero-shot performance on Flickr30kval and use it for all other downstream tasks. Wealso report retrieval numbers on MSCOCO (Chenet al., 2015) (which we call zero-shot MSCOCO )using the splits of Karpathy and Fei-Fei (2015).Reported retrieval numbers are on the test split ofdatasets. Images in Flickr30k and MSCOCO areannotated with captions.In addition to the zero-shot image retrievaltasks, we use the ﬁne-tuned Flickr30k image-retrieval task to examine whether our observa-tions transfer when ﬁne-tuning the MMT model.We ﬁne-tune our models for 10,000 steps and useMLM, MRM, and ITM losses. All results for im-age retrieval are reported using Recall@K (R@K),which measures whether the ground-truth image isamong the top K images retrieved by our model.When comparing pretraining datasets, we hy-pothesize that which pretraining dataset is best de-pends on the downstream task, so we additionallyconsider VQA (Antol et al., 2015; Goyal et al.,2017). To ﬁne-tune for VQA, we replace theimage–text matching loss with a 2-layer MLP andtrain with a binary cross-entropy loss against softanswer scores (Teney et al., 2018). We use similarhyper-parameters as when pretraining and reportresults on the validation set. We report the averagescore across 3 random initializations of the MLP.We use Flickr IDs to ﬁlter out images appearingin the Flickr30k and MSCOCO validation/test setsfrom our pretraining sets. Conceptual Captions isnot collected from Flickr, so we could not ﬁlterout images using this method. Table 2 provides anoverview of our evaluation datasets. We ﬁrst compare MMT to a baseline and then in-vestigate how pretraining data, attention, and lossfunctions impact model performance.able 3:

Comparison of our proposed baseline to ourmultimodal transformer model (MMT).

Flickr30k MSCOCOZS FT ZSR1 R10 R1 R10 R1 R10Baseline 25.4 64.9 40.9 81.8 13.0 44.5 − contrastive 21.7 61.0 39.0 80.6 10.2 40.9 + BERT PT 24.8 65.1 39.9 79.9 12.7 43.1MMT

ViLBERT 31.9 72.8 58.2 - -

We compare our multimodal transformer (MMT)against a strong baseline inspired by recent suc-cess in visual retrieval ( e.g. , Miech et al., 2018).To disentangle the effect of pretraining data andarchitecture, we investigate whether our baseline(described in Sec. 3.2), without multimodal atten-tion or MLM and MRM losses but pretrained onthe same data ( i.e. , Conceptual Captions) as multi-modal transformers produces competitive results.In Table 3, we compare MMT to our proposedbaseline, verifying that MMT learns better repre-sentations not only because it is pretrained on alarge dataset, but because of architectural choices.Our MMT results are on par with existing mod-els trained with the same data: comparing to ViL-BERT, the most similar model to ours, on the zero-shot Flickr , we achieve an R@1 of . in com-parison to . . As expected, retrieval numberson zero-shot MSCOCO are lower than zero-shotFlickr because MSCOCO has more images in itsevaluation set (see Table 2) and is therefore harder.On the ﬁne-tuned image retrieval task, we achievecomparable performance to ViLBERT (our R@1is . vs. . ), even though we do not samplehard negatives when training. We emphasize thatour goal is not to outperform existing work, butto build a strong multimodal transformer model toanalyze the role of data, attention, and losses.We verify that a contrastive loss (Eq. (5)) leadsto stronger results than a classiﬁcation one. Asshown in Table 3, replacing the contrastive losswith a classiﬁcation loss consistently decreasesperformance. Initializing our baseline with BERTweights marginally decreases performance, e.g. ,R@1 on zero-shot Flickr decreases by . . Figure 2: Effect of pretraining data. The datasets on Xaxis are ordered based on their zero-shot Flickr scores.IS:Instance Sampling, DS: Dataset Sampling.(a) Zero-shot (ZS) & ﬁne-tuned (FT) image retrieval (IR) M S C O C O - n a r . S B U V G O I - n a r . M S C O C O C C U n i t e r : I S U n i t e r : D S A ll : I S A ll : D S R Flickr30k ZS-IR Flickr30k FT-IR COCO ZS-IR (b) Visual question answering (VQA v2) M S C O C O - n a r . S B U V G O I - n a r . M S C O C O C C U n i t e r : I S U n i t e r : D S A ll : I S A ll : D S V Q A S c o r e We investigate how pretraining datasets, super-vised image features, and weights from a pre-trained language model impact our results.

Pretraining Datasets.

Fig. 2 reports our resultswhen we pretrain the MMT on the individual andcombined datasets introduced in Sec. 3.3. We ob-serve that in all our tasks, larger datasets usuallylead to better performance, but not always.

Forexample, SBU consistently performs worse thanMSCOCO, despite being substantially larger.Additionally, when combining datasets, howdatasets are sampled matters . In our experi-ments, dataset sampling (DS) is more effectivethan instance sampling (IS). In dataset sampling,smaller datasets (like MSCOCO) will be sampledmore frequently than in instance sampling. SinceMSCOCO pretraining leads to good performance,more exposure to MSCOCO samples is beneﬁ-cial. We consider combining all datasets as well asdatasets combined in UNITER (Chen et al., 2020).Fig. 3a shows that combining all datasets performsbetter than UNITER data on the zero-shot Flickr task, but not on the zero-shot MSCOCO , showingthat more data is not always better. On zero-shotMSCOCO the impact of the sampling mechanismis even more evident: given UNITER data, datasetsampling performs better than instance samplingby over points ( . vs . ).Next, we compare datasets that have a simi-lar number of images to investigate the role ofthe type of language used in each dataset. Asn extreme example, MSCOCO and MSCOCO-narratives contain the same images, but the formerdoes substantially better on our downstream tasks.To better understand this observation, we quan-tify the difference between the language of pre-training and evaluation datasets: we trained a lan-guage model (a 6-layer Transformer) on a givenpretraining dataset, and use that model to com-pute the perplexity of the evaluation dataset. Forour three datasets with the same number of im-ages (MSCOCO, MSCOCO-narratives, and VG),the perplexity of the evaluation dataset (Flickr orMSCOCO) explains their performance – the per-plexities are the lowest on MSCOCO, then VG,and lastly on MSCOCO-narratives. This showsthat the similarity between the language of pre-training and evaluation datasets is important.

However, not all performance differences areexplained by the number of images or perplexity:pretraining on SBU results in poorer performancethan OI-narratives on our downstream tasks, de-spite SBU having twice the number of images andlower perplexity on both evaluation datasets. Weconjecture that SBU’s poor performance is due tonoise: SBU text is scraped from captions and maynot match the images as well as the manually an-notated text in OI-narratives. To investigate this,we calculate an overlap metric for an image–textpair as the ratio of text words overlapping withpredicted bounding box labels. For each dataset,we calculate the average overlap for images,providing an approximation of how much the lan-guage describes the images in the dataset. Theoverlap is much lower for SBU compared to OI-narratives (0.14 vs. 0.25), showing that SBU isindeed noisier, which can decrease its utility forpretraining multimodal representations. Moreover, we observe that the goodness of apretraining dataset for one task does not alwaystransfer to a different task.

For example, CC isa better pretraining dataset than VG when ﬁne-tuning for image retrieval, but they perform sim-ilarly when ﬁne-tuning for VQA, a substantiallydifferent task. In fact, we note that VQA perfor-mance varies less across pretraining datasets ( e.g. ,CC, VG, and MSCOCO), likely because the VQAtraining split is large. We also observe differencesbetween zero-shot and ﬁne-tuned image retrieval.Though MSCOCO performs . points better on The overlap metric for other datasets: VG: 0.82,MSCOCO: 0.42, MSCOCO-narratives: 0.27, and CC: 0.11.

Figure 4:

Comparing models trained with theMSCOCO and CC datasets. We provide the top-1ranked retrieved image given an input query sentenceon the Flickr val dataset. Correctly retrieved imagesare framed in green and the incorrect ones in red.

CCMSCOCOThis is an image of two men during a match of karate and a display of fighting skills.A group of kids competing in a relay. CCMSCOCOBlond-haired man wearing a black fleece jacket sculpting.A backpacker enjoying a view of nature. zero-shot Flickr than OI-narratives, OI-narrativesperforms . points better after ﬁne-tuning.Finally, to visually illustrate the difference be-tween the learned representations, we comparequalitative examples of models trained with ourbest two pre-training datasets: MSCOCO and CC(see Fig. 4). Though the model trained withMSCOCO retrieves examples with some seman-tic relevance, our model trained with CC is able toretrieve images with more correct details like “en-joying a view” and “black ﬂeece jacket”. Language-only Pretraining.

Many multimodaltransformers initialize language weights from apretrained BERT model. Similar to LXMERT, weﬁnd this hurts performance on our retrieval task;R@1 on zero-shot Flickr decreases to . andR@1 on zero-shot MSCOCO decreases to . . Image-only Pretraining.

The object detectorused to extract image features is another sourceof modality-speciﬁc pretraining. We replace de-tection features with grid features taken from thelast residual block of a ResNet-50 trained fromscratch. Similarly to Huang et al. (2020), thismodel is trained without the MRM loss since fea-tures aggregate information in the whole image,and as a result, masking speciﬁc regions is notstraightforward. This model performs slightly bet-ter than our base MMT on zero-shot Flickr ( . We ﬁt images into a × square by resizing andpadding to preserve the aspect ratio. As the total stride ofResNet-50 is , a feature grid is of size × , whichwe ﬂatten to features and give as input along with theaveraged features (for the token) to our MMT. igure 5: Ablation studies on number of layers and heads. Number of layers I R Z S R @ (a) Depth Number of heads ( d ) I R Z S R @ (b) Heads: ﬁxed Number of heads ( d = 64 ) I R Z S R @ FlickrMSCOCO (c) Heads: ﬁxed dimension

Table 4:

MMT trained with coattention (Co), mergedattention (Merge), language-query attention (L-12 andL-24), image-query attention (I-12 and I-24) (thenumber indicates the number of attention heads) and modality-speciﬁc attention . R@1

Co Merge Asym. Attn. Mod.

L-12 I-12 L-24 I-24

Spec.

F. ZS vs. . ) and comparably on zero-shot MSCOCO ( . vs. . ). Though Huang et al. (2020)showed a detector can be replaced with an imageclassiﬁer, we show that comparable results can beachieved without any image-only pretraining.We conclude that careful consideration of pre-training datasets and their sampling methods is im-portant in a model’s performance – the level ofnoise and the type of language in a dataset can bemore signiﬁcant than its size. Finally, the image-only and language-only pretraining are not crucialin training strong multimodal representations. We explore the impact of the number of attentionheads and coattention layers in our base multi-modal transformer model before investigating theeffect of different attention mechanisms.

Number of Heads and Layers.

We test the im-portance of the number of heads in multi-head at-tention when ﬁxing the total number of parame-ters by comparing models trained with one head,3 heads, and 12 heads with query/key size of 768,256, and 64, respectively. Increasing the num-ber of heads to 12 leads to an improvement (Fig-ure 5b). Next, we vary the number of heads (6, 12,and 18) but ﬁx the query/key size to 64. We ob-serve that increasing the number of heads up to 12still leads to an improvement, but further increaseresults in poorer performance (see Figure 5c).Consistent with Lu et al. (2019), increasing thenumber of layers (Fig. 5a) helps up to a point, andthen adding more layers degrades performance.

Type of Attention Mechanism.

We perform anin-depth analysis on different types of attentionexplained in Sec. 2.2 (see Table 4). We compare coattention with merged attention – these mech-anisms both “combine” the image and languagemodalities; however, coattention does so by tak-ing keys/values and queries from opposite modali-ties, while merged attention shares keys and val-ues across the modalities. When controlled forthe number of parameters, coattention performsmarginally better than merged attention . Both per-form considerably better than asymmetric atten-tion in which attention queries are over one modal-ity.The number of heads in an asymmetric at-tentions is half of the equivalent coattention , sowe experiment with asymmetric attention mech-anisms with 12 heads (L-12, I-12) as well as24 heads (L-24, I-24). Increasing the numberof attention heads for the asymmetric attentionimproves results, but the gap between our best-performing model with asymmetric attention (L-24) and coattention is still quite large.We also consider transformers with modality-speciﬁc attention where there is no cross-talk be-tween the modalities through attention, but themodel has the same number of parameters as ourMMT with coattention and is trained with thesame losses (Table 4, Mod. Spec. column). Thismodel performs substantially worse than MMT.To better demonstrate the strength of mul-timodal attention compared to asymmetric andmodality-speciﬁc attention, we compare our mod-els in Table 4 to shallower and smaller models with coattention on the zero-shot Flickr task. Strik-ingly, our best-performing model without multi-modal attention with 24 attention heads and 12layers (R@1 of 33.6; L-24 in Table 4) performsworse than the coattention model with only onehead (R@1 of . ; Fig. 5b) or one multimodallayer (R@1 of . ; Fig. 5a).Figures 6 shows example retrieval results com-paring the asymmetric and modality speciﬁc atten-igure 6: Comparing top-1 ranked images retrieved with models trained with the different attention mechanismson the Flickr dataset. Correctly retrieved images are framed in green and the incorrect ones in red.

CoattentionModality SpecificA group of men work around a set of railroad tracks with heavy equipment.People are gathered on stage. A man in an orange robe sweeping outside.A little girl plays with a miniature electric circuit consisting of three light bulbs and a battery.CoattentionModality Specific CoattentionAsymmetric AttentionThey are posing for a picture.A person dressed as a court jester during a theatrical performance. CoattentionAsymmetric AttentionFemale rollerskating athlete.Someone in a lime green shirt is holding onto a tree. tion to our coattention mechanism. When the coat-tention mechanism retrieves the incorrect image,the image frequently includes important contentfrom the sentence ( e.g. , in Figure 6 lower left, theimage shows “people gathered”, but they are noton stage). Though other attention mechanisms re-trieve images with some similarities to the text, thecoattention mechanism retrieves ﬁne details like“lime green shirt” and “miniature electric circuit”.A modality speciﬁc transformer model is com-putationally more efﬁcient than models with mul-timodal attention because image and language fea-tures can be computed once and reused acrossimage–text pairs; this means that single-modalitytransformers are faster for retrieval and thus wouldbe more appealing in large-scale applications iftheir accuracy were higher. We therefore investi-gate whether we can improve the single-modalitytransformer’s poor performance by combining ﬁve modality-speciﬁc attention layers followed by one coattention layer to introduce multimodal interac-tion. This model is as deep as our MMT, but per-forms worse than our MMT with one coattention layer: R@1 of . vs . on zero-shot Flickr and . vs . on zero-shot MSCOCO .We conclude that multimodal attention mecha-nisms, either coattention or merged attention , area key component to multimodal transformers’ suc-cess. Moreover, a shallow or small model withmultimodal attention outperforms deeper modelswith an inferior attention mechanism yet more pa-rameters. Finally, we show that a model’s depthalone is not important; both multimodal attentionand depth are needed for best performance. Table 5: Zero-shot retrieval results (R@1) on modelstrained with different losses.

Flickr-ZS COCO-ZSMRM + ITM 20.2 9.7MLM + ITM 41.1 22.4MRM + MLM + ITM 41.9 21.3

We explore the degree to which MLM, MRM,and ITM losses contribute to our MMT results.We then explore whether a contrastive formula-tion of the ITM loss – used commonly in self-supervised representation learning and importantfor our baseline – improves MMT’s performance.

Comparing MLM, MRM, and ITM.

Table 5shows performance of our models with differentcombinations of the masked modelling losses andthe image-text loss. With careful hyper-parametertuning (in particular, decreasing the learning ratefrom . to . and using cosine decay in-stead of polynomial decay) we can remove theMRM loss during pretraining and achieve com-parable performance on our image retrieval tasks.We found negligible difference when training ourbase MMT with the different hyper-parameters.We note that our multimodal transformer trainedon pixels (Sec. 4.2) is also trained without a re-gion modelling loss, yet performs similarly to ourbase MMT. Additionally, our ﬁnding is in line withthe results of Li et al. (2020b), who achieve strongresults without a region modelling loss. Contrastive ITM Loss.

Contrastive losses ( e.g. ,Eq. (5)) require sampling many negative exam-les to achieve good performance and thus can becomputationally expensive ( e.g. , Tian et al., 2019;Miech et al., 2020). In models without multimodalattention ( e.g. , our baseline model), the computa-tional cost is reduced by caching and reusing neg-ative examples; in such models, since image andtext input are processed independently, once im-age and text features are calculated, they can beconsidered as negatives for all other training ex-amples in the batch. Due to their multimodal at-tention, multimodal transformers process imageand text examples as pairs and thus cannot shareimage or text features across training examples.This limits the number of negatives available forthese models to the maximum batch size that ﬁtsin memory. As a result, to study the role of acontrastive loss with a reasonable number of nega-tives, we consider our MMT with one multimodallayer. We also examine whether a model with only modality-speciﬁc attention (here, we use 6 imageand 12 language layers) beneﬁts from a contrastiveloss since it is easier to increase the negatives in amodel without multimodal attention. In both mod-els, we replace the image–text matching classiﬁca-tion loss, Eq. (4), with a contrastive one, Eq. (5).Table 6 compares the performance of a single-modality transformer trained with a classiﬁcationloss to a model trained with a contrastive loss and32 or 1024 negatives. We observe a notable im-provement with the contrastive loss and addingmore negatives. We next compare the performanceof our one-layer MMT trained with a classiﬁcationloss and a contrastive loss with 32 negatives (themax we could ﬁt into memory). When trainingwith the contrastive loss, we see no performancedifference on zero-shot MSCOCO and a small per-formance degradation on zero-shot Flickr . This issurprising given the large body of research demon-strating the beneﬁt of contrastive losses. We con-clude that the multimodal attention and MLM losscan help the model learn better representationswithout relying on stronger image–text losses.

Multimodal transformers are the ﬁrst family ofmultimodal models to be pretrained on large dataand applied to a range of different language andvision tasks (Lu et al., 2019; Chen et al., 2020;Tan and Bansal, 2019; Li et al., 2020b,a). The re-cent image-text transformers share the same back-bone but have slight differences in data prepro- Table 6:

R@1 with a classiﬁcation ITM loss (cls) andcontrastive ITM loss (con) for a MMT with one mul-timodal layer (MMT-1) and a model which only hasmodality speciﬁc attention (MSA).

Model Loss Negatives Flickr-ZS COCO-ZSMSA Cls. 1 15.0 6.9MSA Con. 32 17.9 8.3MSA Con. 1024 19.7 9.5MMT-1 Cls. 1 37.3 19.1MMT-1 Con. 32 35.7 19.1 cessing and other architectural choices. Notably,the UNITER model (Chen et al., 2020) achievesstate-of-the-art results on most existing image–language benchmarks by using a larger dataset anda number of different loss functions. Huang et al.(2020) removes the need for using image features(taken from a pretrained object detector) by train-ing models on raw images (pixels). To combineimage and text modalities, LXMERT (Tan andBansal, 2019) and ViLBERT (Lu et al., 2019) pro-pose coattention mechanisms, similar to the coat-tention originally proposed for VQA (Lu et al.,2016). In ViLBERT, feed-forward layers are ap-plied after the coattention and self-attention lay-ers, whereas in LXMERT, a feed-forward layer isonly applied after the self-attention layer.A few of our ﬁndings are similar to observa-tions in prior work: (i)

LXMERT and ViLBERTshow that more layers improve results, (ii)

ViL-BERT and UNITER show that more data boostsperformance, and (iii)

LXMERT shows that trans-ferring BERT weights is not beneﬁcial. In contrastto UNITER, we show that with the right hyper-parameters, the MRM loss is not needed.Finally, while joint-space approaches to mul-timodal training are applied to multilingual data(Gella et al., 2017; Sigurdsson et al., 2020), all ex-isting multimodal transformers are applied to En-glish; an interesting future direction is to extendthese models to other languages.

Analyzing multimodal transformers.

Recentanalysis work (Singh et al., 2020; Cao et al., 2020)has shed light on different aspects of multimodaltransformer models. Singh et al. (2020) studieswhich pretraining data is best when ﬁne-tuningtwo different multimodal transformer variants –ViLBERT (Lu et al., 2019) and VisualBERT (Liet al., 2019)– on four ﬁne-tuned tasks, whereaswe mainly focus on a zero-shot retrieval taskacross a variety of pretraining datasets, architec-ural choices, and loss functions. Our results arecomplementary to this work: Singh et al. (2020)observes that dataset size is not the only factor forgood performance and pretraining datasets are bet-ter when they match the domain of a downstreamtask. We take a ﬁrst step towards quantifying whatit means for a pretraining dataset to be similar to adownstream task by analyzing the language usedin the pretraining datasets and tasks (Section 4.2).Cao et al. (2020) consider various probingmethods on two models (UNITER (Chen et al.,2020) and LXMert (Tan and Bansal, 2019)) tostudy what information is learned in pretrainedmodels. Cao et al. (2020) show that while rep-resentations become more similar in the last lay-ers of models with merged attention, in coatten-tion models, they are most similar at the ﬁrst mul-timodal layer. They also observe that attentionheads in merged attention models mostly focus onthe language modality, only a few heads are spe-cialized for cross-modality processing, and that at-tention heads are able to capture some image-textalignment. Our comparisons of merged and coat-tention is performed in a more controlled settingthan the work of Cao et al. (2020) and Singh et al.(2020): they compare two models trained by dif-ferent researchers that include many small differ-ences other than the attention mechanism; in con-trast, we compare the attention mechanisms in thesame modeling framework.

We rigorously examined different aspects of train-ing multimodal transformers (datasets, attention,and losses) that contribute to the quality of theirlearned representations. We focused on zero-shotimage retrieval tasks to evaluate learned represen-tations. Zero-shot tasks are advantageous becausethey directly measure what a model has learnedand do not introduce confounds such as the size ofa ﬁne-tuning dataset and its experimental setup. Atthe same time, datasets do not always capture whatthey are designed to measure; e.g. , Akula et al.(2020) show that models can do well on a referringexpression task while ignoring the linguistic struc-ture. Thus, we argue that designing and curatingspecialized zero-shot evaluation tasks and datasetsis an important future direction which will allowus to better understand our models’ limitations.We ﬁnd the quality of language and the degreeto which the language describes its correspond- ing image (noisiness) plays an important role inour results. Moreover, language-only and image-only pretraining do not notably contribute to theperformance of multimodal transformers. Thesesuggest curating less noisy image–text datasetsto be more important than relying on single-modality datasets. Previous work has success-fully removed some of the noise in automatically-harvested datasets through preprocessing ( e.g. ,Sharma et al., 2018) but such approaches are stilllimited in their robustness to noise, and the farfrom negligible degree of noise in large-scale real-world datasets ( e.g. , Ordonez et al., 2011; Miechet al., 2019) still poses a challenge. An alternativeapproach is to aim to remove this noise by design-ing models that better tap into statistical regulari-ties of image–text pairs ( e.g. , Duygulu et al., 2002)and thus are more robust to noise.We show that multimodal attention – whereeach modality is informed by both modalities –is crucial in these models’ performance. Smallermodels with multimodal attention outperformdeeper models with no or other multi-head atten-tion mechanisms. This suggests that we can po-tentially train smaller models (than the existingmultimodal transformers) for a given task, espe-cially when the pretraining data is chosen care-fully. Moreover, with multimodal attention, wecan achieve the best zero-shot retrieval results us-ing a classiﬁcation loss which uses only one neg-ative example per image–text pair (compare to acontrastive loss with 16384 negatives used in Tianet al., 2019) and also removes the need for miningmore hard negatives (Faghri et al., 2017).Additionally, we observe that comparable re-sults can be achieved without the image (maskedregion modelling) loss in multimodal transform-ers. This suggests that our current models are nottapping into the useful signal in the image modal-ity, presumably because of the image loss formu-lation. An interesting future direction is design-ing better generative pretraining losses for images;previous work shows that the choice of loss signif-icantly impacts the quality of language representa-tions (Voita and Titov, 2020).Finally, we believe that examining why andhow multimodal transformers perform so well canguide future work in more effectively measuringprogress in learning rich visual-linguistic features. cknowledgements

We would like to thank Angeliki Lazaridou, An-drew Zisserman, Phil Blunsom, Laura Rimell, andStephen Clark for helpful feedback and conver-sations throughout the development of this work.We would like to give special thanks to AishwaryaAgrawal for detailed comments and discussion onour initial paper draft. Finally, we would like tothank Sebastian Borgeaud and Cyprien de Massond’Autume for providing a language only BERTcodebase.

References

Arjun R Akula, Spandana Gella, Yaser Al-Onaizan, Song-Chun Zhu, and Siva Reddy.2020. Words aren’t enough, their order matters:On the robustness of grounding visual referringexpressions. arXiv preprint arXiv:2005.01655 .Peter Anderson, Xiaodong He, Chris Buehler,Damien Teney, Mark Johnson, Stephen Gould,and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and vi-sual question answering. In

Proceedings of theIEEE conference on computer vision and pat-tern recognition , pages 6077–6086.Stanislaw Antol, Aishwarya Agrawal, Ji-asen Lu, Margaret Mitchell, Dhruv Batra,C Lawrence Zitnick, and Devi Parikh. 2015.Vqa: Visual question answering. In

Proceed-ings of the IEEE international conference oncomputer vision , pages 2425–2433.Tom B Brown, Benjamin Mann, Nick Ry-der, Melanie Subbiah, Jared Kaplan, PrafullaDhariwal, Arvind Neelakantan, Pranav Shyam,Girish Sastry, Amanda Askell, et al. 2020. Lan-guage models are few-shot learners. arXivpreprint arXiv:2005.14165 .Jize Cao, Zhe Gan, Yu Cheng, Licheng Yu, Yen-Chun Chen, and Jingjing Liu. 2020. Behindthe scene: Revealing the secrets of pre-trainedvision-and-language models. arXiv preprintarXiv:2005.07310 .Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakr-ishna Vedantam, Saurabh Gupta, Piotr Dollár,and C Lawrence Zitnick. 2015. Microsoft cococaptions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325 . Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed ElKholy, Faisal Ahmed, Zhe Gan, Yu Cheng, andJingjing Liu. 2020. UNITER: Universal image-text representation learning. In

European Con-ference on Computer Vision (ECCV) .Mithun Chowdhury, Panda Rameswar, EvangelosPapalexakis, and Amit Roy-Chowdhury. 2018.Webly supervised joint embedding for cross-modal image-text retrieval. In

ACM Interna-tional Conference on Multimedia .Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. Bert: Pre-trainingof deep bidirectional transformers for languageunderstanding.

North American Chapter ofthe Association for Computational Linguistics(NAACL) .P. Duygulu, K. Barnard, J.F.G. Freitas, and D.A.Forsyth. 2002. Object recognition as machinetranslation: Learning a lexicon for a ﬁxed imagevocabulary. In

European Conference on Com-puter Vision (ECCV) , pages 97–112.Fartash Faghri, David J Fleet, Jamie Ryan Kiros,and Sanja Fidler. 2017. Vse++: Improvingvisual-semantic embeddings with hard nega-tives. arXiv preprint arXiv:1707.05612 .Andrea Frome, Gregory S. Corrado, JonathonShlens, Samy Bengio, Jeffrey Dean,Marc’Aurelio Ranzato, and Tomas Mikolov.2013. Devise: A deep visual-semanticembedding model. In

NIPS .Spandana Gella, Rico Sennrich, Frank Keller,and Mirella Lapata. 2017. Image pivotingfor learning multilingual multimodal represen-tations. arXiv preprint arXiv:1707.07601 .Yunchao Gong, Qifa Ke, Michael Isard, and Svet-lana Lazebnik. 2014. A multi-view embeddingspace for modeling internet images, tags, andtheir semantics.

IJCV .Yash Goyal, Tejas Khot, Douglas Summers-Stay,Dhruv Batra, and Devi Parikh. 2017. Makingthe v in vqa matter: Elevating the role of im-age understanding in visual question answer-ing. In

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition ,pages 6904–6913.ohn Hewitt and Percy Liang. 2019. Designingand interpreting probes with control tasks. In

Proceedings of the 2019 Conference on Em-pirical Methods in Natural Language Process-ing and the 9th International Joint Conferenceon Natural Language Processing (EMNLP-IJCNLP) , pages 2733–2743.Zhicheng Huang, Zhaoyang Zeng, Bei Liu, Dong-mei Fu, and Jianlong Fu. 2020. Pixel-bert: Aligning image pixels with text bydeep multi-modal transformers. arXiv preprintarXiv:2004.00849 .Andrej Karpathy and Li Fei-Fei. 2015. Deepvisual-semantic alignments for generating im-age descriptions. In

Proceedings of the IEEEconference on computer vision and patternrecognition , pages 3128–3137.Ryan Kiros, Ruslan Salakhutdinov, and Richard SZemel. 2014. Unifying visual-semantic embed-dings with multimodal neural language models. arXiv preprint arXiv:1411.2539 .Ranjay Krishna, Yuke Zhu, Oliver Groth, JustinJohnson, Kenji Hata, Joshua Kravitz, StephanieChen, Yannis Kalantidis, Li-Jia Li, David AShamma, et al. 2017. Visual genome: Connect-ing language and vision using crowdsourceddense image annotations.

International journalof computer vision , 123(1):32–73.Taku Kudo and John Richardson. 2018. Sentence-piece: A simple and language independent sub-word tokenizer and detokenizer for neural textprocessing. arXiv preprint arXiv:1808.06226 .Alina Kuznetsova, Hassan Rom, Neil Alldrin,Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset,Shahab Kamali, Stefan Popov, Matteo Malloci,Tom Duerig, et al. 2018. The open imagesdataset v4: Uniﬁed image classiﬁcation, objectdetection, and visual relationship detection atscale. arXiv preprint arXiv:1811.00982 .Gen Li, Nan Duan, Yuejian Fang, Ming Gong, andDaxin Jiang. 2020a. Unicoder-vl: A universalencoder for vision and language by cross-modalpre-training. In

The Thirty-Fourth AAAI Con-ference on Artiﬁcial Intelligence, AAAI . Liunian Harold Li, Mark Yatskar, Da Yin, Cho-JuiHsieh, and Kai-Wei Chang. 2019. Visualbert:A simple and performant baseline for vision andlanguage. arXiv preprint arXiv:1908.03557 .Xiujun Li, Xi Yin, Chunyuan Li, Xiaowei Hu,Pengchuan Zhang, Lei Zhang, Lijuan Wang,Houdong Hu, Li Dong, Furu Wei, et al. 2020b.Oscar: Object-semantics aligned pre-trainingfor vision-language tasks. In

European Confer-ence on Computer Vision .Jiasen Lu, Dhruv Batra, Devi Parikh, and StefanLee. 2019. Vilbert: Pretraining task-agnosticvisiolinguistic representations for vision-and-language tasks. In

Advances in Neural Infor-mation Processing Systems , pages 13–23.Jiasen Lu, Jianwei Yang, Dhruv Batra, and DeviParikh. 2016. Hierarchical question-image co-attention for visual question answering. In

Ad-vances in neural information processing sys-tems , pages 289–297.Antoine Miech, Jean-Baptiste Alayrac, LucasSmaira, Ivan Laptev, Josef Sivic, and AndrewZisserman. 2020. End-to-End Learning of Vi-sual Representations from Uncurated Instruc-tional Videos. In

Computer Vision and PatternRecognition .Antoine Miech, Ivan Laptev, and Josef Sivic.2018. Learning a Text-Video Embedding fromIncomplete and Heterogeneous Data. arXivpreprint arXiv:1804.02516 .Antoine Miech, Dimitri Zhukov, Jean-BaptisteAlayrac, Makarand Tapaswi, Ivan Laptev, andJosef Sivic. 2019. Howto100m: Learninga text-video embedding by watching hundredmillion narrated video clips. In

Proceedings ofthe IEEE international conference on computervision , pages 2630–2640.Aaron van den Oord, Yazhe Li, and OriolVinyals. 2018. Representation learning withcontrastive predictive coding. arXiv preprintarXiv:1807.03748 .Vicente Ordonez, Girish Kulkarni, and Tamara LBerg. 2011. Im2text: Describing images using1 million captioned photographs. In

Advancesin neural information processing systems , pages1143–1151.ordi Pont-Tuset, Jasper Uijlings, Soravit Chang-pinyo, Radu Soricut, and Vittorio Ferrari. 2019.Connecting vision and language with localizednarratives. arXiv preprint arXiv:1912.03098 .Shaoqing Ren, Kaiming He, Ross Girshick, andJian Sun. 2015. Faster r-cnn: Towards real-timeobject detection with region proposal networks.In

Advances in neural information processingsystems , pages 91–99.Piyush Sharma, Nan Ding, Sebastian Goodman,and Radu Soricut. 2018. Conceptual captions:A cleaned, hypernymed, image alt-text datasetfor automatic image captioning. In

Proceedingsof the 56th Annual Meeting of the Associationfor Computational Linguistics (Volume 1: LongPapers) , pages 2556–2565.Gunnar A Sigurdsson, Jean-Baptiste Alayrac,Aida Nematzadeh, Lucas Smaira, Mateusz Ma-linowski, João Carreira, Phil Blunsom, and An-drew Zisserman. 2020. Visual grounding invideo for unsupervised word translation. In

Computer Vision and Pattern Recognition .Amanpreet Singh, Vedanuj Goswami, and DeviParikh. 2020. Are we pretraining it right?digging deeper into visio-linguistic pretraining. arXiv preprint arXiv:2004.08744 .Chen Sun, Austin Myers, Carl Vondrick,Kevin Murphy, and Cordelia Schmid. 2019.Videobert: A joint model for video and lan-guage representation learning. In

Proceedingsof the IEEE International Conference onComputer Vision , pages 7464–7473.Hao Tan and Mohit Bansal. 2019. Lxmert:Learning cross-modality encoder representa-tions from transformers. In

Empirical Methodsin Natural Language Processing .Damien Teney, Peter Anderson, Xiaodong He,and Anton Van Den Hengel. 2018. Tips andtricks for visual question answering: Learningsfrom the 2017 challenge. In

Proceedings of theIEEE conference on computer vision and pat-tern recognition , pages 4223–4232.Yonglong Tian, Dilip Krishnan, and Phillip Isola.2019. Contrastive multiview coding. arXivpreprint arXiv:1906.05849 . Ashish Vaswani, Noam Shazeer, Niki Parmar,Jakob Uszkoreit, Llion Jones, Aidan N Gomez,Łukasz Kaiser, and Illia Polosukhin. 2017. At-tention is all you need. In

Advances in neuralinformation processing systems , pages 5998–6008.Elena Voita and Ivan Titov. 2020. Information-theoretic probing with minimum descriptionlength. arXiv preprint arXiv:2003.12298 .Liwei Wang, Yin Li, and Svetlana Lazebnik. 2016.Learning deep structure-preserving image-textembeddings. In

CVPR .Jason Weston, Samy Bengio, and Nicolas Usunier.2011. Wsabie: Scaling up to large vocabularyimage annotation. In

IJCAI .Zhilin Yang, Zihang Dai, Yiming Yang, JaimeCarbonell, Russ R Salakhutdinov, and Quoc VLe. 2019. Xlnet: Generalized autoregressivepretraining for language understanding. In

Ad-vances in neural information processing sys-tems , pages 5753–5763.Dani Yogatama, Cyprien de Masson d’Autume,Jerome Connor, Tomas Kocisky, MikeChrzanowski, Lingpeng Kong, AngelikiLazaridou, Wang Ling, Lei Yu, Chris Dyer,et al. 2019. Learning and evaluating gen-eral linguistic intelligence. arXiv preprintarXiv:1901.11373 .Yang You, Jing Li, Sashank Reddi, Jonathan Hseu,Sanjiv Kumar, Srinadh Bhojanapalli, XiaodanSong, James Demmel, Kurt Keutzer, and Cho-Jui Hsieh. 2019. Large batch optimization fordeep learning: Training bert in 76 minutes. arXiv preprint arXiv:1904.00962 .Peter Young, Alice Lai, Micah Hodosh, and Ju-lia Hockenmaier. 2014. From image descrip-tions to visual denotations: New similarity met-rics for semantic inference over event descrip-tions.

Transactions of the Association for Com-putational Linguistics , 2:67–78.Kelly Zhang and Samuel Bowman. 2018. Lan-guage modeling teaches you more than trans-lation does: Lessons learned through auxiliarysyntactic task analysis. In