[PDF] VinVL: Revisiting Visual Representations in Vision-Language Models

Abstract

This paper presents a detailed study of improving visual representations for vision language (VL) tasks and develops an improved object detection model to provide object-centric representations of images. Compared to the most widely used \emph{bottom-up and top-down} model \cite{anderson2018bottom}, the new model is bigger, better-designed for VL tasks, and pre-trained on much larger training corpora that combine multiple public annotated object detection datasets. Therefore, it can generate representations of a richer collection of visual objects and concepts. While previous VL research focuses mainly on improving the vision-language fusion model and leaves the object detection model improvement untouched, we show that visual features matter significantly in VL models. In our experiments we feed the visual features generated by the new object detection model into a Transformer-based VL fusion model \oscar \cite{li2020oscar}, and utilize an improved approach \short\ to pre-train the VL model and fine-tune it on a wide range of downstream VL tasks. Our results show that the new visual features significantly improve the performance across all VL tasks, creating new state-of-the-art results on seven public benchmarks. We will release the new object detection model to public.

Full PDF

VVinVL: Making Visual Representations Matterin Vision-Language Models

Pengchuan Zhang ♥† Xiujun Li ♥♠†

Xiaowei Hu ♥ Jianwei Yang ♥ Lei Zhang ♥ Lijuan Wang ♥ Yejin Choi ♠ Jianfeng Gao ♥ January 5, 2021

Abstract

SCAR [21],and utilize an improved approach O

SCAR + to pre-train the VL model and ﬁne-tune it on a wide rangeof downstream VL tasks. Our results show that the new visual features signiﬁcantly improve the perfor-mance across all VL tasks, creating new state-of-the-art results on seven public benchmarks. We willrelease the new object detection model to public.

Vision language pre-training (VLP) has proved effective for a wide range of vision-language (VL) tasks[26, 36, 4, 34, 20, 19, 45, 21]. VLP typically consists of two stages: (1) an object detection model ispre-trained to encode an image and the visual objects in the image to feature vectors, and (2) a cross-modal fusion model is pre-trained to blend text and visual features. While existing VLP research focusesmainly on improving the cross-modal fusion model, this paper focuses on improving the object-centric visualrepresentations and presents a comprehensive empirical study to demonstrate that visual features matter inVL models.Among the aforementioned work, a widely-used object detection (OD) model [2] is trained on the VisualGenome dataset [16]. The OD model provides an object-centric representation of images, and has been usedin many VL models as a black box. In this work, we pre-train a large-scale object-attribute detection modelbased on the ResNeXt-152 C4 architecture (short as X152-C4). Compared to the OD model of [2], the newmodel is better-designed for VL tasks, and is bigger and trained on much larger amounts of data, combiningmultiple public object detection datasets, including COCO [25], OpenImages (OI) [17], Objects365 [31] ♥ Microsoft Corporation ♠ University of Washington † indicates equal contributions. a r X i v : . [ c s . C V ] J a n isual feature VQA GQA Image Captioning NoCaps Image Retrieval Text Retrieval NLVR2test-dev test-std test-dev test-std B@4 M C S C S R@1 R@5 R@10 R@1 R@5 R@10 dev test-PAnderson et al. [2] .

16 73 .

44 61 .

58 61 .

62 40 . . . . .

58 12 .

38 54 . . . . . . .

07 78 . Ours .

95 76 .

12 65 .

05 64 .

65 40 . . . . .

46 13 .

07 58 . . . . . . .

05 83 . ∆ . ↑ . ↑ . ↑ . ↑ . ↑ . ↑ . ↑ . ↑ . ↑ . ↑ . ↑ . ↑ . ↑ . ↑ . ↑ . ↑ . ↑ . ↑ Table 1: Uniform improvements on seven VL tasks by replacing visual features from Anderson et al. [2]with ours. The NoCaps baseline is from VIVO [9], and our results are obtained by directly replacing thevisual features. The baselines for rest tasks are from O

SCAR [21], and our results are obtained by replacingthe visual features and performing O

SCAR + pre-training. All models are BERT-Base size. As analyzed inSection 5.2, the new visual features contributes 95% of the improvement.Figure 1: Predictions from an X152-FPN model trained on OpenImages (Left) and our X152-C4 modeltrained on four public object detection datasets (Right). Our model contains much richer semantics, suchas richer visual concepts and attribute information, and the detected bounding boxes cover nearly all se-mantically meaningful regions. Compared with those from the common object classes in typical ODmodels (Left), the rich and diverse region features from our model (Right) are crucial for vision-languagetasks. For concepts detected by both models, e.g., “ boy ”, attributes from our model offer richer informa-tion, e.g., “ young barefoot shirtless standing surfing smiling little playinglooking blond boy ”. There are object concepts that are detected by our model but not by the Open-Images model, including fin , wave , foot , shadow , sky , hair , mountain , water , (bare , tan , light , beige ) back , ( blue , colorful , floral , multi colored , patterned ) trunk , sand , beach , ocean , ( yellow , gold ) bracelet , logo , hill , head , ( black , wet ) swim trunks , black , wet swim trunks . Compared to the R101-C4 model of [2], our model produces more accurateobject-attribute detection results and better visual features for VL applications; see Appendix A for the fullpictures and predictions from [2].and Visual Genome (VG) [16]. As a result, our OD model achieves much better results on a wide rangeof VL tasks, as shown in Table 1. Compared to other typical OD models, such as X152-FPN trained onOpenImages, our new model can encode a more diverse collection of visual objects and concepts (e.g.,producing visual representations for object categories and attribute categories), as illustrated byan example in Figure 1.To validate the effectiveness of the new OD model, we pre-train a Transformer-based cross-modal fusionmodel O SCAR + [21] on a public dataset consisting of . million text-image pairs, where the visual repre-sentations of these images are produced by the new OD model and are ﬁxed during O SCAR + pre-training.2e then ﬁne-tune the pre-trained O

SCAR + for a wide range of downstream tasks, including VL understand-ing tasks such as VQA [8], GQA [13], NLVR2 [35], and COCO text-image retrieval [25], and VL generationtasks such as COCO image captioning [25] and NoCaps [1]. Our results show that the object-centric repre-sentations produced by the new OD model signiﬁcantly improve the performance across all the VL tasks,often by a large margin over strong baselines using the classical OD model [2], creating new state of thearts on all these tasks, including GQA on which none of the published pre-trained models has surpassed thedeliberately designed neural state machine (NSM) [12]. We will release the new OD model to the researchcommunity.The main contributions of this work can be summarized as follows: ( i ) We present a comprehensiveempirical study to demonstrate that visual features matter in VL models. ( ii ) We have developed a newobject detection model that can produce better visual features of images than the classical OD model [2] andsubstantially uplifts the state-of-the-art results on all major VL tasks across multiple public benchmarks. ( iii ) We provide a detailed ablation study of our pre-trained object detection model to investigate the relativecontribution to the performance improvement due to different design choices regarding diversity of objectcategories, visual attribute training, training data scale, model size, and model architecture.

Deep learning-based VL models typically consist of two modules: an image understanding module

Vision and a cross-modal understanding module VL : ( q , v ) = Vision ( Img ) , y = VL ( w , q , v ) , (1)where Img and w are the inputs of the vision and language modalities, respectively. The output of the Vision module consists of q and v . q is the semantic representation of the image, such as tags or detectedobjects, and v the distributional representation of the image in a high-dimensional latent space representedusing e.g., the box or region features produced by a VG-pre-trained Faster-RCNN model [2]. Most VL models use only the visual features v , while the recently proposed O SCAR [21] model shows that q canserve as anchors for learning better vision-language joint representations and and thus can improve theperformance on various VL tasks. w and y of the VL module of Equation (1) vary among different VLtasks. In VQA, w is a question and y is an answer to be predicted. In text-image retrieval, w is a sentenceand y is the matching score of a sentence-image pair. In image captioning, w is not given and y is a captionto be generated.Inspired by the great success of pre-trained language models to various natural language processingtasks, vision-language pre-training (VLP) has achieved remarkable success in improving the performanceof the cross-modal understanding module VL by (1) unifying vision and language modeling VL withTransformer and (2) pre-training the uniﬁed VL with large-scale text-image corpora. However, most recentworks on VLP treat the image understanding module Vision as a black box and leave the visual featureimprovement untouched since the development of the classical OD model [2] three years ago, despite thatthere has been much research progress on improving object detection by 1) developing much more diverse,richer, and larger training datasets (e.g. OpenImages and Objects 365), 2) gaining new insights in objectdetection algorithms such as feature pyramid network [23], one-stage dense prediction [24], and anchor-freedetectors [37], and 3) leveraging more powerful GPUs for training bigger models.In this work, we focus on improving

Vision for better visual representations. We developed a new ODmodel by enriching the visual object and attribute categories, enlarging the model size and training on a We use the terms region and box interchangeably.

SCAR + for VL pre-training in Section 3. To improve the OD model for VL tasks, we utilize four public object detection datasets. As most datasetsdo not have attribute annotations, we adopt a pre-training and ﬁne-tuning strategy to build our OD model.We ﬁrst pre-train an OD model on a large-scale corpus consisting of four public datasets, and then ﬁne-tunethe model with an additional attribute branch on Visual Genome, making it capable of detecting both objectsand attributes.

Data.

Table 2 summarizes the statistics of the four public datasets used in our object detection pre-training,including COCO, OpenImagesV5 (OI), Objects365V1, and Visual Genome (VG). These datasets have com-plementary characters, and are extremely unbalanced in terms of data size, object vocabulary, and the num-ber of annotations in each class. For example, the VG dataset has a rich and diverse set of annotations forboth objects and their attributes with an open vocabulary. But its annotations are noisy and suffer from themissing-annotation problem. The COCO dataset, on the other hand, is very well annotated. But the cover-age of visual objects and attributes is much lower than that in VG although we use both its 80 object classesand 91 stuff classes to include as diverse visual concepts as possible. We take the following steps to build auniﬁed corpus by combining the four datasets.1. First of all, to enhance visual concepts of tail classes, we perform class-aware sampling for Open-Images and Objects365 to get at least 2000 instances per class, resulting in 2.2M and 0.8M images,respectively.2. To balance the contribution of each dataset, we merge the four datasets with 8 copies of COCO(8 × × × × × CA-2k, × CA-2k 5.43MTable 2: Statistics of the Vision pre-training datasets. In sampling, × k means k copies in one epoch and“CA-2k” means class-aware sampling with at least 2000 instances per class.4 odel Architecture (FPN vs C4). Although [23] shows that the FPN model outperforms the C4 model forobject detection, recent studies [14] demonstrate that FPN does not provide more effective region featuresfor VL tasks than C4, which is also conﬁrmed by our experimental results . We thus conduct a set ofcarefully designed experiments, as to be detailed in Appendix E, and ﬁnd two main reasons for this. The ﬁrstis that all layers in the C4 model used for region feature extraction are pre-trained using the ImageNet datasetwhile the multi-layer-perceptron (MLP) head of the FPN model are not. It turns out that the VG dataset isstill too small to train a good enough visual features for VL tasks and using ImageNet-pre-trained weightsis beneﬁcial. The second is due to the different network architectures (CNN vs. MLP). The convolutionalhead used in C4 has a better inductive bias for encoding visual information than the MLP head of FPN.Therefore, in this study we use C4 architecture for VLP. Model Pre-Training.

Following the common practice in object detection training, we freeze the ﬁrst con-volution layer, the ﬁrst residual block, and all the batch-norm layers. We also use several data augmentationmethods, including horizontal ﬂipping and multi-scale training. To train a detection model with the X152-C4 architecture, we initialize the model backbone from an ImageNet-5K checkpoint [40] and train for 1.8Miterations with a batch size of 16 images.

Following [2], we add an attribute branch to the pre-trained OD model, and then ﬁne-tune the OD modelon VG to inject attribute information (524 classes). Since the object representations are pre-trained in theobject detection pre-training stage, we can focus the VG ﬁne-tuning on learning attributes by picking a muchlarger attribute loss weight . , compared to . used in [2, 14]. Thus, our ﬁne-tuned model signiﬁcantlyoutperforms previous models [2, 14] in detecting objects and attributes on VG. With a richer set of visual objects and attributes, the classical class-aware non-maximal suppression (NMS)post-processing takes a signiﬁcantly larger amount of time to remove overlapped bounding boxes, mak-ing the feature extraction process extremely slow. To improve the efﬁciency, we replace the class-awareNMS with the class-agnostic NMS that only conducts the NMS operation once . We also replace the time-consuming conv layers with dilation=2 used in [2] with conv layers without dilation. These two replacementsmake the region feature extraction process much faster than that in [2] without any accuracy drop on VLdownstream tasks. We report the end-to-end inference time of VL models with different vision models on aTitan-X GPU and a CPU with a single thread in Table 20 in Appendix F.In summary, the pre-trained OD model serves as the image understanding module, as in Equation (1), toproduce vision presentations ( q , v ) for downstream VL tasks. Here, q is the set of detected object names (intext) and v is the set of region features. Each region feature is denoted as (ˆ v, z ) , where ˆ v is a P -dimensionalrepresentation from the input of the last linear classiﬁcation layer of the detection head ( i.e., P = 2048 )and z is a R -dimensional position encoding of the region ( i.e., R = 6 ) . We ﬁnd in our experiments that using the same training process, the X152-C4 model even produces better object detectionresult than the X152-FPN model. See Appendix E for details. Counting the NMS in the RPN module, there are in total 2 NMS operations in our efﬁcient region feature extractor. It includes coordinates of the bounding boxes, and height & width. O SCAR + Pre-training

The success of VLP lies in the use of a unifying model architecture for a wide range of VL tasks and thelarge-scale pre-training of the uniﬁed model using objectives that correlate with the performance metricsof these downstream VL tasks. In this study we pre-train an improved version of O

SCAR [21], known asO

SCAR + models, to learn the joint image-text representations using image tags as anchors for image-textalignment.

We build our pre-training corpus based on three types of existing vision and VL datasets: (1) image cap-tioning datasets with human-annotated captions as w and machine-generated image tags as q , includingCOCO [25], Conceptual Captions (CC) [32], SBU captions [28] and ﬂicker30k [42]; (2) visual QA datasetswith questions as w and human-annotated answers as q , including GQA [13], VQA [8] and VG-QAs; (3)image tagging datasets with machine-generated captions as w and human-annotated tags as q , includinga subset of OpenImages (1.67M images). In total, the corpus contains 5.65 million unique images, 8.85million text-tag-image triples. The detailed statistics are presented in Table 16 in the Appendix. The sizeof the pre-training corpus could have been signiﬁcantly increased by combining large-scale image taggingdatasets, such as the full set of OpenImages (9M images) and YFCC (92M images). We leave it to futurework to leverage much larger corpora for model pre-training.Loss ( w , q / q (cid:48) , v ) ( w / w (cid:48) , q , v ) w (cid:48) / q (cid:48) All q ’s (O SCAR ) q ’s from QA All w ’s All (O SCAR +) q ’s from QAVQA (vqa-dev) ± ± ± ± ± COCO-IR 73.9 ± ± ± ± ± Table 3: Effects of different pre-training contrastive losses on downstream tasks (R50-C4 as

Vision moduleand 4-layer Transformer as VL module in (1) ). COCO-IR metric is Image-to-Text retrieval R@1 at COCO1K test set. Blue indicates the best result for a task and

Black indicates the runner-up.

There are two terms in the O

SCAR + pre-training loss as in Equation (2). L Pre-training = L MTL + L CL3 . (2) L MTL is the Masked Token Loss deﬁned on the text modality ( w and q ), following closely [21]. (SeeAppendix B.2 for details.) L CL3 is a novel . Different from the binary contrastiveloss used in O

SCAR [21], the proposed to effectively optimize the training objectivesused for VQA [41] and text-image matching [6] . As shown in Equation 3, L CL3 takes into account twotypes of training samples x : the { caption, image-tags, image-features } triplets of the image captioning andimage tagging data, and the { question, answer, image-features } triplets of the VQA data. We use the same model to extract visual features. We use the captioning model released by O

SCAR [21]. [6] uses a deep-learning-based text-image matching model to select the best caption candidate for a given image. (cid:44) ( w (cid:124)(cid:123)(cid:122)(cid:125) caption , q , v (cid:124) (cid:123)(cid:122) (cid:125) tags&image ) or ( w , q (cid:124) (cid:123)(cid:122) (cid:125) Q&A , v (cid:124)(cid:123)(cid:122)(cid:125) image ) (3)To compute contrastive losses, negative examples need to be constructed. We construct two types ofnegative (unmatched) triplets for the two types of training samples, respectively. One is the polluted “cap-tions” ( w (cid:48) , q , v ) and the other the polluted “answers” ( w , q (cid:48) , v ) . To classify whether a caption-tags-imagetriplet contains a polluted caption is a text-image matching task. To classify whether a question-answer-image triplet contains a polluted answer is an answer selection task for VQA. Since the encoding of [ CLS ] can be viewed as a representation of the triplet ( w , q , v ) , we apply a fully-connected (FC) layer on top of itas a 3-way classiﬁer f ( . ) to predict whether the triplet is matched ( c = 0 ), contains a polluted w ( c = 1 ), orcontains a polluted q ( c = 2 ). The 3-way contrastive loss is deﬁned as L CL3 = − E ( w , q , v ; c ) ∼ ˜ D log p ( c | f ( w , q , v )) , (4)where the dataset ( w , q , v ; c ) ∈ ˜ D contains 50% matched triples, 25% w -polluted triples, and 25% q -polluted triples. For efﬁcient implementation, the polluted w (cid:48) is uniformly sampled from all w ’s (captionsand questions) and q (cid:48) is uniformly sampled from all q ’s (tags and answers) in the corpus. As demonstratedin Table 3, when only the answer-polluted triplets are used, i.e., ( w , q (cid:48) , v ) with q (cid:48) sampled from q ’s fromQA corpus, the contrastive loss simulates closely the objective for the VQA task but not the text-imageretrieval task. As a result, the pre-trained model can be effectively adapted to VQA, but not so to text-imageretrieval. By contrast, the proposed 3-way contrastive loss transfers well to both tasks. We pre-train two model variants, denoted as O

SCAR + B and O SCAR + L , which are initialized with param-eters θ BERT of BERT base ( L = 12 , H = 768 , A = 12 ) and large ( L = 24 , H = 1024 , A = 16 ),respectively, where L is the number of layers, H the hidden size, and A the number of self-attention heads.To ensure that the image region features have the same input embedding size as BERT, we transform theposition-augmented region features using a linear projection via matrix W . The trainable parameters are θ = { θ BERT , W } . O SCAR + B is trained for at least M steps, with learning rate e − and batch size .O SCAR + L is trained for at least M steps, with learning rate e − and batch size . The sequence lengthof language tokens [ w , q ] and region features v are and , respectively. We adapt the pre-trained models to seven downstream VL tasks, including ﬁve understanding tasks and twogeneration tasks. Each task poses different challenges for adaptation. This section brieﬂy introduces thetasks and our ﬁne-tuning strategy. We refer the readers to Appendix C for details.

VQA & GQA

These two are the most widely used understanding task for evaluating VL models in theresearch community. The tasks require the model to answer natural language questions based on an image.In this study, we perform experiments on the widely-used VQA v2.0 dataset [8] and GQA dataset [13],Following the setting of [2], for each question, the model picks an answer from a shared answer set (i.e., , candidates for VQA, , candidates for GQA). When adapting a VLP model to the VQA task, we7onstruct the input by concatenating a given question, object tags and object region features, and then feedthe [ CLS ] output from O SCAR + to a task-speciﬁc linear classiﬁer with a softmax layer for answer prediction.

Image Captioning & NoCaps

The captioning task is to generate a natural language caption for an im-age. This is the most widely used VL generation task in the research community – the Image CaptioningLeaderboard hosts more than 260 models as of December 10, 2020. To enable caption generation, weﬁne-tune O SCAR + using the seq2seq objective. Each training sample is converted to a triplet consisting of acaption, a set of image region features, and a set of object tags. We randomly mask out of the captiontokens, and use the encoding of the remaining context (the triplet) to predict the masked tokens. Similar toVLP [21, 45], the self-attention mask is constrained such that a caption token can only attend to the tokensbefore its position to simulate a uni-directional generation process. All caption tokens have full attentionsto image regions and object tags but not the other way around. During inference, we ﬁrst encode the imageregions, object tags, and a special token [ CLS ] as input. Then the model starts to generate a caption byfeeding in a [ MASK ] token and sampling a token from a vocabulary based on the token probability output.Next, the [ MASK ] token in the previous input sequence is replaced with the sampled token and a new [ MASK ] is appended for the next word prediction. The generation process terminates when the model outputs the [ STOP ] token or the generated sentence exceeds a pre-deﬁned max length. We perform image captioningexperiments on the COCO image captioning dataset [25]. N ovel O bject Cap tioning at S cale [1] extendsthe image captioning task to test a model’s capability of describing novel objects from the Open Imagesdataset [17] which are unseen in the training corpus. Following the restriction guideline of NoCaps, we usethe predicted Visual Genome and Open Images labels to form the input tag sequences, and directly trainO SCAR + on COCO without the initialization from pre-training. VIVO [9] proposed a VLP technique byonly using image tagging data, and achieved SOTA results on NoCaps by ﬁne-tuning on COCO captions.We reproduced VIVO with only one change, i.e., replacing its original vision model with our new visionmodel, and improved the VIVO performance signiﬁcantly (short as VinVL+VIVO), as reported in Table 9.

Image(-to-Text) Retrieval & Text(-to-Image) Retrieval

Both tasks require the model to calculate a simi-larity score between an image and a sentence. Thus, the task is widely used to directly measure the quality ofthe cross-modal VL representation. Following [21], we formulate the task as a binary classiﬁcation problem,where given a matched image-text pair, we randomly select a different image or a different sentence to forman unmatched pair. The representation of [ CLS ] is used as the input to a classiﬁer to predict a score indicatinghow likely the given pair is matched. In testing, the predicted score is used to rank a given image-text pairsof a query. Following [19], we report the top- K retrieval results on both the K and K COCO test sets.

NLVR2

The dataset is developed for joint reasoning about natural language and images [35]. The task isto determine whether a text description is true about a pair of images. For ﬁne-tuning, we ﬁrst construct twoinput sequences, each containing the concatenation of the given text description and one of the images, andthen two [ CLS ] outputs from O SCAR + are concatenated to form the input to a binary classiﬁer for prediction. Image Captioning Leaderboard: https://competitions.codalab.org/competitions/3221 Experiments & Analysis

To account for model parameter efﬁciency, we group the SoTA models in three categories: ( i ) SoTA S indicates the best performance achieved by small models prior to the Transformer-based VLP models. ( ii ) SoTA B indicates the best performance produced by VLP models of a similar size to BERT base. ( iii ) SoTA L indicates the best performance yielded by VLP models that have a similar size to BERT large.Table 4 gives an overview of the results of O SCAR + (short for O

SCAR + with our new OD model in thissubsection) on seven VL tasks, compared to previous SoTAs . O SCAR + outperforms previous SoTA modelson all tasks , often by a signiﬁcantly large margin. The result demonstrates the effectiveness of the regionfeatures produced by the new OD model. Task VQA GQA Image Captioning NoCaps Image Retrieval Text Retrieval NLVR2test-dev test-std test-dev test-std B@4 M C S C S R@1 R@5 R@10 R@1 R@5 R@10 dev test-PSoTA S .

55 70 . − . . . . . . . . . . . . . .

10 54 . SoTA B .

59 73 .

67 61 .

58 61 .

62 40 . . . . .

58 12 .

38 54 . . . . . . .

39 79 . SoTA L .

75 74 . − − . . . . − − . . . . . . .

76 81 . O SCAR + B .

95 76 .

12 65 .

05 64 . . . . . .

46 13 .

07 58 . . . . . . .

05 83 . O SCAR + L .

52 76 . − − . . . . − − . . . . . . .

67 83 . ∆ . ↑ . ↑ . ↑ . ↑ . ↓ . ↑ . ↑ . ↑ . ↑ . ↑ . ↑ . ↑ . ↑ . ↑ . ↑ . ↑ . ↑ . ↑ Table 4: An overall comparison with SoTAs on seven tasks. ∆ indicates the improvement over SoTA.SoTA with subscript S, B, L indicates performance achieved by small models, and models with the modelsize similar to BERT base and large, respectively. SoTAs: VQA is from ERNIE-VIL [43], GQA is fromNSM [12], NoCaps is from VIVO [9], NLVR2 is from VILLA [7], the rest tasks are from O SCAR [21].

Method ViLBERT VL-BERT VisualBERT LXMERT 12-in-1 UNITER O

SCAR

VILLA ERNIE-V I L InterBERT O

SCAR +Base Base Base Base Base Base Large Base Large Base Large Base Large

Ensemble * Base LargeTest-dev .

63 70 .

50 70 .

80 72 .

42 73 .

15 72 .

27 73 .

24 73 .

16 73 .

61 73 .

59 73 .

69 72 .

62 74 . - .

95 76 . Test-std .

92 70 .

83 71 .

00 72 . − .

46 73 .

40 73 .

44 73 .

82 73 .

67 74 .

87 72 .

85 74 .

93 76 . .

12 76 . Table 5: Evaluation results on VQA. * denotes the No.1 ensemble model of InterBERT Large on the VQAleaderboard.

Method LXMERT MMN [3] 12-in-1 O

SCAR B NSM [12] O

SCAR + B Test-dev . − − . − . Test-std .

33 60 .

83 60 .

65 61 . .

17 64 . Table 6: Evaluation results on GQA.In Tables 5 to 11, we report the detailed results for each downstream task, respectively. ( i ) The

VQA results are shown in Table 5, where our single O

SCAR + B model outperforms the best ensemble model (In-terBERT large [22]) on the VQA leaderboard as of Dec. 12, 2020 . ( ii ) The

GQA results are shownin Table 6, where O

SCAR + is the ﬁrst VLP model that outperforms the neural state machine (NSM) [12]which contains some sophisticated reasoning components deliberately designed for the task. ( iii ) The

Im-age Captioning results on the public “Karpathy” 5k test split are shown in Table 7. Table 8 shows on a All the (single-model) SoTAs are from the published results. For all the tables in this paper,

Blue indicates the best result for atask, and gray background indicates results produced by O

SCAR +. The only exception is B@4 on image captioning. VQA leaderboard: https://eval.ai/web/challenges/challenge-page/514/leaderboard/1386 ethod cross-entropy optimization CIDEr optimizationB@4 M C S B@4 M C SBUTD [2] . . . . . . . . VLP [45] . . . . . . . . AoANet [10] . . . . . . . . O SCAR B [21] . . . . . . . . O SCAR L [21] . . . . . . . . O SCAR + B . . . . . . . . O SCAR + L . . . . . . . . Table 7: Image captioning evaluation results (single model) on COCO “Karpathy” test split. (Note: B@4:BLEU@4, M: METEOR, C: CIDEr, S: SPICE.)

Method BLEU@1 BLEU@2 BLEU@3 BLEU@4 METEOR ROUGE-L CIDEr-Dc5 c40 c5 c40 c5 c40 c5 c40 c5 c40 c5 c40 c5 c40BUTD [2] . . . . . . . . . . . . . . AoANet [10] . . . . . . . . . . . . . . X-Transformer [29] . . . . . . . . . . . . . . O SCAR + . . . . . . . . . . . . . . Table 8: Leaderboard of the state-of-the-art image captioning models on the COCO online testing.

Method in-domain near-domain out-of-domain overall in-domain near-domain out-of-domain overallCIDEr SPICE CIDEr SPICE CIDEr SPICE CIDEr SPICE CIDEr SPICE CIDEr SPICE CIDEr SPICE CIDEr SPICEValidation Set Test SetUpDown + . . . . . . . . . . . . . . . . O SCAR B * . . . . . . . . . . . . . . . . O SCAR L * . . . . . . . . . . . . . . . . Human [1] . . . . . . . . . . . . . . . . VIVO* [9] . . . . . . . . . . . . . . . . VinVL* . . . . . . . . . . . . . . . . VinVL+VIVO . . . . . . . . . . . . . . . . Table 9: NoCaps evaluation results. All the models are trained on COCO without additional image-captionpairs following the restriction of NoCaps. (UpDown + is UpDown+ELMo+CBS, the models with * is+SCST+CBS, VinVL+VIVO is with SCST only.) Method ↓ BERT

1K Test Set 5K Test SetText Retrieval Image Retrieval Text Retrieval Image RetrievalR@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10Unicoder-VL [19] B . . . . . . . . . . . . UNITER [4] B − − − − − − . . . . . . L − − − − − − . . . . . . O SCAR B . . . . . . . . . . . . L . . . . . . . . . . . . O SCAR + B . . . . . . . . . . . . L . . . . . . . . . . . . Table 10: Text and Image retrieval evaluation on the COCO K and K test sets. ( B for Base, L for Large) Method MAC VisualBERT LXMERT 12-in-1 UNITER O

SCAR

VILLA O

SCAR +base base base base large base large base large base largeDev . .

40 74 . − .

14 78 .

40 78 .

07 79 .

12 78 .

39 79 . .

05 82 . Test-P . .

00 74 .

50 78 .

87 77 .

87 79 .

50 78 .

36 80 .

37 79 .

47 81 . .

08 83 . Table 11: Evaluation results on NLVR2.10ision vl no VLP O

SCAR B [21] O SCAR + B (ours) R101-C4 [2] 68.52 ± ± VinVL (ours) 71.34 ± – 74.90 ± Table 12: Effects of vision (V) and vision-language (VL) pre-training on VQA.concise version of the COCO image captioning online leaderboard . The online testing setting reports theresults on 40K images, with 5 reference captions (c5) and 40 reference captions (c40) per image. At the timeof submitting this paper, our single model achieves No.1 on the entire leaderboard, outperforming all 263models, including many ensemble (and anonymous) models. ( iv ) The Novel Object Captioning (

NoCaps )results are shown in Table 9. Without any VLP, i.e. by directly training a BERT-based captioning modelon COCO, the model with our new visual features (denoted as VinVL) already surpasses the human perfor-mance in CIDEr . By adding VIVO [9] pre-training, our VinVL improves the original VIVO result by 6CIDEr points and creates a new SoTA. ( v ) Overall, on all these tasks (VQA in Table 5, Image Captioning inTable 7, NoCaps in Table 9, Image-Text Retrieval in Table 10, NLVR2 in Table 11), we show that O

SCAR + B can match or outperform previous SoTA large models, and O SCAR + L substantially uplifts the SoTA. We select the VQA task for the ablation study because its evaluation metric is well-deﬁned and the task hasbeen used as a testbed for all VLP models. To assist our analysis, we create a local validation set, vqa-dev,out of the standard validation set to select the best model during training for evaluation. vqa-dev containsrandomly sampled 2K images and their corresponding questions, amounting to 10.4K image-QA pairs intotal. Except for Table 4 and 5, all our VQA results are reported on this vqa-dev set. Unless otherwisespeciﬁed, the reported STD is half of the difference of two runs of the VQA training with different randomseeds.In VQA, the VL model y = VL ( w , q , v ) has w as the question and y as the answer. We focus on study-ing the effect of visual features v produced by different Vision models Vision ( Img ) to better understandtheir relative contribution in the VQA performance. To eliminate the impact of using different tags q , weuse the same tags in the VQA models of O SCAR [21]. All the ablation experiments are conducted usingmodels of the BERT-base size.

How much do the V and VL matter to the SoTA?

Table 12 shows the VQA results with different visionmodels, i.e., R101-C4 model from [2] and our X152-C4 model pre-trained with 4 datasets (VinVL), and withdifferent VLP methods, i.e., no VLP, O

SCAR [21] and our O

SCAR +. Taking the O

SCAR B model with R101-C4 features as the baseline, the O SCAR + B model with our X152-C4 features improves the absolute accuracyfrom 72.38 to 74.90, in which the O SCAR + pre-training contributes 5% of the gain (i.e., . → . )and the vision pre-training (improved visual features) 95% (i.e., . → . ). This demonstrates thatvision representations matter signiﬁcantly in VLP and downstream tasks.Taking the “no VLP” model with R101-C4 features as the baseline, Table 12 shows that the gains ofVinVL ( . − .

52 = 2 . ) and VLP ( . − .

52 = 3 . ) are additive ( . − . ≈ .

82 + 3 . ). Image Captioning Leaderboard: https://competitions.codalab.org/competitions/3221 NoCaps leaderboard: https://eval.ai/web/challenges/challenge-page/355/leaderboard/1011 ± ± ± ± → VG 68.3 ± ± – 71.34 ± Table 13: Ablation of model size and data size on training vision models.Model R50-FPN R50-C4 X152-C4Pre-training dataset ImageNet 4Sets ImageNet 4Sets ImageNet5k 4SetsCOCO mAP * mAP attr mAP with gt boxes 9.65.4 11.35.5 9.66.3 12.16.1 11.26.6 13.87.1 * Since our four pre-training datasets contain Objects365, it is not surprising that we obtainbetter results than 42.3 mAP in [31], which is obtained by pre-training on Objects365.Table 14: Effect of vision pre-training on object detection tasks.This is intuitive because vision pre-training and VLP improve the Vision model Vision ( Img ) and VLmodel VL ( w , q , v ) separately. This also indicates that our pre-trained vision model can be utilized in anyVL models by directly replacing their vision models, such as R101-C4 [2], with ours. How much do data and model sizes matter to the new vision model?

The improvement of VQA fromR101-C4 [2] to VinVL (ours) in Table 12 is a compound effect of increasing model size (from R101-C4to X152-C4) and data size (from VG to our merged four OD datasets). Table 13 shows the ablation of thetwo factors without VLP. Although VG’s large object and attribute vocabulary allows to learn rich semanticconcepts, VG does not contain large amounts of annotations for effective training of deep models. Visionmodels trained using the merged four OD datasets perform much better than VG-only-trained models, andthe improvement is larger with the increase of the model size. How much does OD model architecture matter?

The choice of model architecture affects the VQAperformance. Table 13 shows that R50-FPN under-performs R50-C5 when they are trained only on VG; butthe performance gap diminishes when both are trained on the merged dataset (4Sets). A detailed comparisonbetween FPN and C4 architectures is presented in Appendix E.

How much does OD pre-training matter for object detection tasks?

Table 14 presents the object de-tection results on COCO and the object-attribute detection results on VG (1594 object classes, 524 attributeclasses). The results show that OD pre-training beneﬁts the object detection tasks. Note that the mAP onVG is much lower than that on typical OD datasets (such as COCO) due to two reasons: (1) VG containsa large number of object classes with limited and extremely unbalanced annotations, (2) there are manymissing annotations in the VG evaluation data. Although the mAP numbers are low, the detection resultusing X152-C4 is reasonably good; see Appendix A for more visualizations. We also see that FPN models The R101-C4 model in Table 13 is exactly the VG-pre-pretrained model from [2]. We do not train this model on our mergedOD dataset because this model architecture is old-fashioned and is slow to train. As a reference, the R101-C4 model from [2] on VG with 1600 objects and 400 attributes has mAP of 8.7/7.8 evaluated in ourcode, whereas it was reported as 10.2/7.8 due to differences in OD evaluation pipeline. → VG B ± ± ± ± ± Table 15: Effect of object-attribute vocabulary. We use all grid features (maximal 273) for the ImageNetclassiﬁcation model (ﬁrst column), and maximal 50 region features for OD models (other columns).perform consistently worse in attribute detection than C4 models, neither do FPN models show any advan-tage in object detection on VG. This contributes to the inferior performance of FPN, compared to C4, ondownstream VL tasks, as discussed in Section 2.1.

How much does the diversity of visual concepts, i.e., object and attribute vocabularies, matter?

Wedirectly train vision models on different datasets, including (1) standard ImageNet with 1K classes (Ima-geNet), (2) Visual Genome with 317 object classes (VG-obj) that are shared with COCO 80 classes andOpenImagesV5 500 classes, (3) VG with all 1594 object classes (VG w/o attr), (4) VG with 1594 objectclasses and 524 attribute classes (VG), and (5) the merged OD dataset (4Sets) for pre-training and VG forﬁne-tuning. For all the OD models (the last four columns in Table 15), we initialize the OD training withan ImageNet-pre-trained classiﬁcation model, and use maximal 50 region features per image as input to theVL fusion module. For the ImageNet pre-trained classiﬁcation model (the second column in Table 15), weuse all the grid features (maximal 273) for each image . The results show that• In general, vocabularies with richer objects lead to better VQA results: VG-obj < ImageNet < VG w/oattr. The VG-obj vocabulary contains 79 of 80 COCO classes (only missing potted plant ) and 313 of500 OpenImagesV5 classes, and is a good approximation of common object classes of typical OD tasks.However, our results show that this vocabulary is not rich enough for VL tasks because it misses manyimportant visual concepts (e.g., sky , water , mountain , etc.) which are crucial for VL tasks, as alsoillustrated by the comparison of detected regions in Figure 1. .• Attribute information is crucial to VL tasks: models trained with attributes (VG and 4Sets → VG) aresigniﬁcantly better than those without attributes.• Even for the small vision model R50-C4, vision pre-training improves visual features for VQA, i.e.,4Sets → VG is the best performer.

In this paper we have presented a new recipe to pre-train an OD model for VL tasks. Compared to the mostwidely used bottom-up and top-down model [2], the new model is bigger, better-designed for VL tasks, andpre-trained on much larger text-image corpora, and thus can generate visual features for a richer collectionof visual objects and concepts that are crucial for VL tasks. We validate the new model via a comprehensiveempirical study where we feed the visual features to a VL fusion model which is pre-trained on a large-scalepaired text-image corpus and then ﬁne-tuned on seven VL tasks. Our results show that the new OD modelcan substantially uplift the SoTA results on all seven VL tasks across multiple public benchmarks. Our Our use of grid feature follows PixelBert [11]. See Appendix F for details. Using the same training procedure on VG, we trained an R50-C4 model on the OpenImagesV5 dataset (500 classes). Usingthe region features produced by this model, the VQA performance is 63.55 ± Acknowledgement

We thank Xi Yin for her contributions to this project while she was in Microsoft. We thank Xiyang Dai forhis conjecture that C4 arch is better than FPN because C4 arch makes better use of ImageNet initializationweights.

References [1] Harsh Agrawal, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh,Stefan Lee, and Peter Anderson. nocaps: novel object captioning at scale. In

ICCV , 2019. 3, 8, 10, 23[2] Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang.Bottom-up and top-down attention for image captioning and visual question answering. In

CVPR , 2018. 1, 2, 3,5, 7, 10, 11, 12, 13, 17, 18, 20, 21, 22, 23, 26, 28, 30[3] Wenhu Chen, Zhe Gan, Linjie Li, Yu Cheng, William Wang, and Jingjing Liu. Meta module network forcompositional visual reasoning. arXiv preprint arXiv:1910.03230 , 2019. 9, 22[4] Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu.Uniter: Learning universal image-text representations. arXiv preprint arXiv:1909.11740 , 2019. 1, 10, 24[5] Fartash Faghri, David J Fleet, Jamie Ryan Kiros, and Sanja Fidler. Vse++: Improved visual-semantic embed-dings. arXiv preprint arXiv:1707.05612 , 2(7):8, 2017. 24[6] Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh K Srivastava, Li Deng, Piotr Doll´ar, Jianfeng Gao, XiaodongHe, Margaret Mitchell, John C Platt, et al. From captions to visual concepts and back. In

Proceedings of theIEEE conference on computer vision and pattern recognition , pages 1473–1482, 2015. 6[7] Zhe Gan, Yen-Chun Chen, Linjie Li, Chen Zhu, Yu Cheng, and Jingjing Liu. Large-scale adversarial training forvision-and-language representation learning. arXiv preprint arXiv:2006.06195 , 2020. 9[8] Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the V in VQA matter:Elevating the role of image understanding in visual question answering. In

CVPR , 2017. 3, 6, 7, 22[9] Xiaowei Hu, Xi Yin, Kevin Lin, Lijuan Wang, Lei Zhang, Jianfeng Gao, and Zicheng Liu. Vivo: Sur-passing human performance in novel object captioning with visual vocabulary pre-training. arXiv preprintarXiv:2009.13682 , 2020. 2, 8, 9, 10, 11[10] Lun Huang, Wenmin Wang, Jie Chen, and Xiao-Yong Wei. Attention on attention for image captioning. In

ICCV , 2019. 10, 23[11] Zhicheng Huang, Zhaoyang Zeng, Bei Liu, Dongmei Fu, and Jianlong Fu. Pixel-bert: Aligning image pixelswith text by deep multi-modal transformers. arXiv preprint arXiv:2004.00849 , 2020. 13[12] Drew Hudson and Christopher D Manning. Learning by abstraction: The neural state machine. In

NeurIPS ,2019. 3, 9[13] Drew A Hudson and Christopher D Manning. GQA: A new dataset for real-world visual reasoning and compo-sitional question answering. arXiv preprint arXiv:1902.09506 , 2019. 3, 6, 7, 22[14] Huaizu Jiang, Ishan Misra, Marcus Rohrbach, Erik Learned-Miller, and Xinlei Chen. In defense of grid featuresfor visual question answering. In

Proceedings of the IEEE/CVF Conference on Computer Vision and PatternRecognition , pages 10267–10276, 2020. 5, 26[15] Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In

CVPR ,2015. 23, 24[16] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, YannisKalantidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowd-sourced dense image annotations.

International Journal of Computer Vision , 123(1):32–73, 2017. 1, 2

17] Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali,Stefan Popov, Matteo Malloci, Tom Duerig, et al. The open images dataset v4: Uniﬁed image classiﬁcation,object detection, and visual relationship detection at scale. arXiv preprint arXiv:1811.00982 , 2018. 1, 8, 23[18] Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. Stacked cross attention for image-textmatching. In

ECCV , 2018. 24[19] Gen Li, Nan Duan, Yuejian Fang, Daxin Jiang, and Ming Zhou. Unicoder-VL: A universal encoder for visionand language by cross-modal pre-training. arXiv preprint arXiv:1908.06066 , 2019. 1, 8, 10, 23, 24[20] Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. Visualbert: A simple and perfor-mant baseline for vision and language. arXiv preprint arXiv:1908.03557 , 2019. 1[21] Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, LiDong, Furu Wei, Yejin Choi, and Jianfeng Gao. Oscar: Object-semantics aligned pre-training for vision-languagetasks. In

European Conference on Computer Vision , pages 121–137. Springer, 2020. 1, 2, 3, 6, 8, 9, 10, 11, 19,20, 22, 23, 27[22] Junyang Lin, An Yang, Yichang Zhang, Jie Liu, Jingren Zhou, and Hongxia Yang. Interbert: Vision-and-language interaction for multi-modal pretraining. arXiv preprint arXiv:2003.13198 , 2020. 9[23] Tsung-Yi Lin, Piotr Doll´ar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramidnetworks for object detection. In

Proceedings of the IEEE conference on computer vision and pattern recognition ,pages 2117–2125, 2017. 3, 5[24] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll´ar. Focal loss for dense object detection.In

Proceedings of the IEEE international conference on computer vision , pages 2980–2988, 2017. 3[25] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, andC Lawrence Zitnick. Microsoft COCO: Common objects in context. In

ECCV , 2014. 1, 3, 6, 8, 22, 23[26] Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. VilBERT: Pretraining task-agnostic visiolinguistic repre-sentations for vision-and-language tasks. In

NeurIPS , 2019. 1[27] Jiasen Lu, Vedanuj Goswami, Marcus Rohrbach, Devi Parikh, and Stefan Lee. 12-in-1: Multi-Task vision andlanguage representation learning. arXiv preprint arXiv:1912.02315 , 2019. 24[28] Vicente Ordonez, Girish Kulkarni, and Tamara L Berg. Im2text: Describing images using 1 million captionedphotographs. In

NeurIPS , 2011. 6[29] Yingwei Pan, Ting Yao, Yehao Li, and Tao Mei. X-linear attention networks for image captioning. In

Proceedingsof the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 10971–10980, 2020. 10[30] Steven J Rennie, Etienne Marcheret, Youssef Mroueh, Jerret Ross, and Vaibhava Goel. Self-critical sequencetraining for image captioning. In

CVPR , 2017. 23[31] Shuai Shao, Zeming Li, Tianyuan Zhang, Chao Peng, Gang Yu, Xiangyu Zhang, Jing Li, and Jian Sun. Ob-jects365: A large-scale, high-quality dataset for object detection. In

Proceedings of the IEEE internationalconference on computer vision , pages 8430–8439, 2019. 1, 12[32] Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed,image alt-text dataset for automatic image captioning. In

Annual Meeting of the Association for ComputationalLinguistics , 2018. 6, 23[33] Botian Shi, Lei Ji, Pan Lu, Zhendong Niu, and Nan Duan. Knowledge aware semantic concept expansion forimage-text matching. In

IJCAI , 2019. 24[34] Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. VL-BERT: Pre-training of genericvisual-linguistic representations. arXiv preprint arXiv:1908.08530 , 2019. 1[35] Alane Suhr, Stephanie Zhou, Ally Zhang, Iris Zhang, Huajun Bai, and Yoav Artzi. A corpus for reasoning aboutnatural language grounded in photographs. arXiv preprint arXiv:1811.00491 , 2018. 3, 8, 24[36] Hao Tan and Mohit Bansal. LXMERT: Learning cross-modality encoder representations from transformers.

EMNLP , 2019. 1[37] Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. Fcos: A simple and strong anchor-free object detector.

IEEETransactions on Pattern Analysis and Machine Intelligence , 2020. 3

38] Yaxiong Wang, Hao Yang, Xueming Qian, Lin Ma, Jing Lu, Biao Li, and Xin Fan. Position focused attentionnetwork for image-text matching. arXiv preprint arXiv:1907.09748 , 2019. 24[39] Zihao Wang, Xihui Liu, Hongsheng Li, Lu Sheng, Junjie Yan, Xiaogang Wang, and Jing Shao. CAMP: Cross-Modal adaptive message passing for text-image retrieval. In

ICCV , 2019. 24[40] Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen Lo, and Ross Girshick. Detectron2. https://github.com/facebookresearch/detectron2 , 2019. 5, 12, 29[41] Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. Stacked attention networks for imagequestion answering. In

Proceedings of the IEEE conference on computer vision and pattern recognition , pages21–29, 2016. 6[42] Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. From image descriptions to visual denota-tions: New similarity metrics for semantic inference over event descriptions.

Transactions of the Association forComputational Linguistics , 2:67–78, 2014. 6[43] Fei Yu, Jiji Tang, Weichong Yin, Yu Sun, Hao Tian, Hua Wu, and Haifeng Wang. Ernie-vil: Knowledge enhancedvision-language representations through scene graph. arXiv preprint arXiv:2006.16934 , 2020. 9[44] Zhedong Zheng, Liang Zheng, Michael Garrett, Yi Yang, and Yi-Dong Shen. Dual-path convolutional image-textembedding with instance loss. arXiv preprint arXiv:1711.05535 , 2017. 24[45] Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Jason J Corso, and Jianfeng Gao. Uniﬁed vision-language pre-training for image captioning and VQA.

AAAI , 2020. 1, 8, 10, 22, 23

A Qualitative study of three pre-trained vision models

We apply three (pre-trained) object detection models on the image in Figure 1 and list their detection resultsfor a more detailed comparison.Detections from X152-FPN trained on Open Images V5. See Figure 2:

Surfboard; Surfboard; Surfboard; Surfboard; Man; Human leg; Human leg;Swimwear; Swimwear; Shorts; Shorts; Boy; Human arm .Detections from R101-C4 trained on VG by Anderson et al. [2]. There are obviously wrong detections,marked in red. See Figure 3 (top): black shorts; young, shirtless, standing, barefoot, surfing, little,playing boy; shirtless, standing, barefoot, walking, wet, surfing, youngman; tan, bare, shirtless back; blue, clear, cloudy, hazy, light blue sky;young, shirtless, standing, surfing, barefoot, little boy; brown, short,wet, blond hair; brown, short, wet, blond hair; small, crashing wave;white, wet surfboard; white, crashing, big, rolling wave;wet, tan surfboard; green, blue fin; blue, calm, choppy, wavy, ocean,splashing, foamy, water, rough, sandy, wet ocean; wet, calm, sandy,splashing, wavy water; white, wet surfboard; bare, wet foot;blue, colorful, multi colored, floral shorts; calm, choppy, water, rough, oamy, wavy water; distant, rocky, hazy mountains; standing, shirtless,young, barefoot, wet, surfing, walking, smiling boy; calm ocean; distant,rocky mountain; white, bare, wet surfboard; wet, sandy, calm, tan beach;gray, big rock; blue, calm background; wet, brown, tan, sandy sand;wet shadow; blue, colorful, floral, multi colored swim trunks;yellow, plastic hand .Detections from our pre-trained X152-C4 model pre-trained on four datasets and ﬁne-tuned on VG.There are some repetitive detections, but no obvious wrong detections. See Figure 3 (bottom): blue, green fin; young, barefoot, shirtless, standing, surfing, smiling,little, playing, looking, blond boy; young, barefoot, standing, shirtless,smiling, surfing, blond, playing, looking, little, walking, riding boy;shirtless, barefoot, standing, young, smiling, surfing, walking, wet, playingman; bare, wet foot; black, white surfboard; small, large, white, crashing,big, water, rolling, splashing, rough, foamy wave; bare, wet foot; dark,black, wet, cast shadow; blue, clear, hazy, cloudy, cloudless sky; black,gray, white, raised surfboard; black, wet, short short; brown, short, blond,wet, curly, wavy hair; distant, brown, large, rocky, hazy, big mountain;brown, short, dark, blond, wet hair; blue, white, calm, wavy, choppy, ocean,splashing, water, rough, clear, shallow water; bare, tan, light, beige back;black, blue, wet surfboard; small, dark, water, crashing, rolling, splashing,big wave; wet, white, sandy, tan surfboard; blue, colorful, floral, multicolored, patterned trunk; wet, brown, sandy, tan sand; white, blue, calm,foamy, choppy, splashing, wavy, ocean, rough, water, clear, shallow water;wet, brown, sandy, calm, tan, shallow, smooth, muddy, rough beach; black,white, young board; shirtless, young, standing, barefoot, smiling, surfing,looking, walking, playing boy; blue, calm, choppy, wavy, ocean, clear, rough,splashing, water, foamy, shallow, rippled ocean; yellow, gold bracelet; white,silver, black logo; wet, bare, bent, tan, crossed, hairy, short, skinny,back, muscular, extended, outstretched leg; black, gray, white board; brown,distant, large, rocky, big hill; brown, short, blond, wet, curly head; red,black logo; bare, raised, extended, holding, open, up, bent, outstretchedhand; black, wet swim trunks; bare, wet, bent, tan, crossed, skinny, short,back, muscular leg; wet, brown, muddy, sandy, tan, shallow reflection . B O

SCAR + pre-training

B.1 Pre-training Corpus

Table 16 shows the statistics of image and text of the pre-training corpora. In our ablation study, we havecorpora of three different sizes: ‘Small’, ‘Medium’, ‘Large’. Different from O

SCAR [21], we make useof image tagging datasets OpenImages, by generating captions using O

SCAR ’s image captioning model toform triplets of “(generated caption, image tags, image features)” for the O

SCAR + pre-training. Thanks tothis self-training technique, our pre-training corpus can be scaled to a much larger scale by making use oflarge-scale image tagging datasets, e.g., OpenImages (9M) and YFCC (92M).19 mall 0.22M Images, 2.5M QAs, 0.7M captionsMedium 1.89M Images, 2.5M QAs, 0.7M captions, 1.67M pseudo-captionsLarge 5.65M Images, 2.5M QAs, 4.68M captions, 1.67M pseudo-captionsSource VQA GQA VG-QA COCO Flicker30k OpenImages CC SBU(train) (bal-train) (train) (train) (train) (od train) (train) (all)Image/Text 83k/545k 79k/1026k 87k/931k 112k/559k 29k/145k 1.67M/1.67M 3.1M/3.1M 875k/875k w , q , v Question, Answer, ImageFeatures (Generated) Caption, (Generated) ImageTags, ImageFeatures

Table 16: Statistics of the pre-training corpus.

B.2 O

SCAR + pre-training objectives

Masked Token Loss: A Loss Mimics Image Captioning.

The word tokens of image captions (questions) w and word tokens of object tags (answers) q share the same linguistic semantic space, and the MaskedToken Loss (MTL) is applied on tokens of both w and q . We deﬁne the discrete token sequence as h (cid:44) [ w , q ] , and apply the Masked Token Loss (MTL) for pre-training. At each iteration, we randomly mask eachinput token in h with probability , and replace the masked one h i with a special token [ MASK ] . The goalof training is to predict these masked tokens based on their surrounding tokens h \ i and all image features v by minimizing the negative log-likelihood: L MTL = − E ( v , h ) ∼D log p ( h i | h \ i , v ) (5)This is the same MTL as in O SCAR [21] and similar to the masked language model used by BERT. Themasked word or tag needs to be recovered from its surrounding context, with additional image informationto help ground the learned word embeddings in the vision context.

We present our 3-way contrastive loss in Section 3.2 in the main paper.

B.3 Ablation of the two new techniques

Effect of self-training: Leveraging Image Tagging data.

In Figure 4, we show the effect of self-trainingby making use of tagging data in O

SCAR +, by ﬁne-tuning O

SCAR + pre-training checkpoints on VQA.Compared with “O

SCAR +, Small; VinVL” (green), “O

SCAR +, Medium; VinVL” (yellow) adds the 1.7MOpenImages Tagging data into pre-training and its performance gets improved signiﬁcantly, demonstratingthe effect of self-training by making use of tagging data. As baselines, we also provide performance ofO

SCAR and O

SCAR + with image features from [2], which clearly demonstrates that the new image featurespre-trained by VinVL matter signiﬁcantly in the VL pre-training and VL downstream tasks.

Effect of the new 3-way contrastive loss.

As illustrated in Table 3, with the new 3-way contrastive loss,the VQA performance keeps the same compared with the O

SCAR pre-training, while the Text-Image Re-trieval performance improves signiﬁcantly compared with the O

SCAR pre-training.20igure 4: Effect of O

SCAR + pre-training corpus size and effect of self-training by making use of taggingdata in O

SCAR +. Each curve, with legend “VLP, Corpus; VisionFeature”, denotes a VLP experiment wherethe VLP method is either O

SCAR or O

SCAR +, the VLP pre-training Corpus is Small/Medium/Large (de-ﬁned in Table 16), and VisionFeature is either our new vision features (VinVL for short) or those from [2]([2] for short). X-axis denotes the pre-training iterations of O

SCAR + checkpoints. Y-axix is the vqa-devaccuracy of a VQA model initialized from the corresponding pre-training checkpoint and ﬁne-tuned with aﬁxed scheme. Compared with “O

SCAR +, Small; VinVL” (green), “O

SCAR +, Medium; VinVL” (yellow)adds the 1.7M OpenImages Tagging data into the pre-training and its performance gets improved signif-icantly, demonstrating the effect of self-training by making use of tagging data. The “O

SCAR +, Large;VinVL” (blue) further scales up the pre-training corpus by adding Google Conceptual Captions and SBUdatasets with generated tags and its performance gets further improved, demonstrating the effect of O

SCAR +pre-training corpus size. As baselines, we also provide performance of O

SCAR and O

SCAR + with imagefeatures from [2], which clearly demonstrates that our new image features (VinVL) matter signiﬁcantly inthe VL pre-training and VL downstream tasks.

Overall improvement from O

SCAR to O

SCAR +. We point out that the improvement from O

SCAR toO

SCAR + with image features from [2] is minor, because (1) we only add 1.7M OpenImages’ tagging data21o enlarge the pre-training corpus, which is a small portion compared with O

SCAR ’s original pre-trainingcorpus (i.e., Large \ OI, 3.98M images and 7.18M image-caption pairs), and (2) the new 3-way contrastiveloss has more signiﬁcant improvements in Text-Image Retrieval tasks instead of the VQA task, as illustratedin Table 3. We would expect much more signiﬁcant improvements when we scale up the O

SCAR +’s pre-training corpus to a much larger scale by adding large scale image tagging datasets, e.g., OpenImages (9M)and YFCC (92M).

C Downstream Tasks Fine-tuning

We follow the downstream task ﬁne-tuning recipes in O

SCAR [21].

C.1 VQA

Given an image and a question, the task is to select the correct answer from a multi-choice list, it requiresthe model to answer natural language questions based on an image. Here we conduct experiments on thewidely-used VQA v2.0 dataset [8], which is built on the MSCOCO [25] images. Following [2], for eachquestion, the model picks the corresponding answer from a shared set of , candidates.When ﬁne-tuning on the VQA task, the input sequence contains the concatenation of a given question,object tags and object region features, and then the [ CLS ] output from O SCAR + is fed to a task-speciﬁc linearclassiﬁer for answer prediction. Similarly as the literature [2], we treat VQA as a multi-label classiﬁcationproblem – assigning a soft target score to each answer based on its relevancy to the human answer responses,and then we ﬁne-tune the model by minimizing the cross-entropy loss computed using the predicted scoresand the soft target scores. During inference, we simply use Softmax for answer prediction.For VQA training, we random sample a set of 2k images from the MS COCO validation set as ourvalidation set, the rest of images in the training and validation are used in the VQA ﬁne-tuning. For theO

SCAR + B model, we ﬁne-tune for epochs with a learning rate of e − and a batch size of . For theO SCAR + L model, we ﬁne-tune for epochs with a learning rate of e − and a batch size of . C.2 GQA

Similarly as VQA, GQA tests the reasoning capability of the model to answer a question. We conductexperiments on the public GQA dataset [13]. For each question, the model chooses an answer from a sharedset of , candidates. Our ﬁne-tuning procedure is following Oscar [21, 3], which ﬁrst ﬁne-tunes themodel on unbalanced “all-split” for epochs with a learning rate of e − and a batch size of , and thenﬁne-tuned on the “balanced-split” for epochs. C.3 Image Captioning

An image captioning model generates a natural language description for a given image. To enable sentencegeneration, we ﬁne-tune O

SCAR + using the seq2seq objective. The input samples are processed to triplesconsisting of image region features, captions, and object tags, in the same way as that during the pre-training.We randomly mask out of the caption tokens and use the corresponding output representations toperform classiﬁcation to predict the token ids. Similar to previous works [21, 45], the self-attention maskis constrained such that a caption token can only attend to the tokens before its position to simulate a uni-directional generation process. Note that all caption tokens will have full attentions to image regions andobject tags but not the other way around. 22uring inference, we ﬁrst encode the image regions, object tags, and a special token [ CLS ] as input. Thenthe model starts the generation by feeding in a [ MASK ] token and selecting a token from the vocabulary basedon the likelihood output. Next, the [ MASK ] token in the previous input sequence is replaced with the selectedtoken and a new [ MASK ] is appended for the next word prediction. The generation process terminates whenthe model outputs the [ SEP ] token. We use beam search ( i.e., beam size = 5) [2] in our experiments andreport our results on the COCO image captioning dataset.Though the training objective ( i.e., seq2seq) for image captioning is different from that used in pre-training ( i.e., bidirectional attention-based masked token loss), we directly ﬁne-tune O SCAR + for imagecaptioning on COCO without additional pre-training on Conceptual Captions [32]. This is to validate thegeneralization ability of the O

SCAR + models for generation tasks. We use the same Karpathy split [15]. Forthe O

SCAR + B model, we ﬁne-tune with cross-entropy loss for epochs with a batch size of and aninitial learning rate of e − and then with CIDEr optimization [30] for epochs with a batch size of and initial learning rate of e − . We compare with several existing methods, including BUTD [2], VLP [45],AoANet [10], OSCAR [21]. C.4 NoCaps

Novel Object Captioning [1] extends the image captioning task, is to test models’ capability of describingnovel objects from the Open Images dataset [17] which are not seen in the training corpus. Following therestriction guideline of NoCaps, we train O

SCAR + on COCO without the initialization from pre-training, sono additional image-text pairs are used for training except COCO.Since NoCaps images are collected from Open Images, we train an object detector using the OpenImages training set and apply it to generate the tags. We conduct experiments from BERT model directlywithout pre-training as required by the task guidelines. For the O

SCAR + B model, we train epochs witha batch size of and learning rate e − ; further we perform CIDEr optimization with learning rate e − and batch size for epochs. During inference, we use constrained beam search for decoding. Wecompare O SCAR + with OSCAR [21] on this task.

C.5 Image-Text Retrieval

There are two sub-tasks: image retrieval and text retrieval , depending on which modality is used as theretrieved target. Both tasks calculate a similarity score between an image and a sentence, which heavilyrelies on the cross-modal representations.Following Oscar [21], we formulate the retrieval as a binary classiﬁcation problem, where given analigned image-text pair, we randomly select a different image or a different sentence to form an unalignedpair. The ﬁnal representation of [ CLS ] is used as the input to the classiﬁer to predict whether the given pairis aligned or not. In the testing stage, the probability score is used to rank the given image-text pairs of aquery.Following [19], we report the top- K retrieval results on both the K and K COCO test sets. Weadopt the widely used Karpathy split [15] on the COCO caption dataset [25] to conduct our experiments.Speciﬁcally, the dataset consists of , images for training, , images for validation, and , images for testing. Each image is associated with human-generated captions. For the O SCAR + B model,we ﬁne-tune with a batch size of for epochs. The initial learning rate is set to e − and linearlydecreases. For the O SCAR + L model, we ﬁne-tune with a batch size of for epochs. The initial learningrate is set to e − and linearly decreases. We use the validation set for parameter tuning. We compare with23igure 5: Overall comparison of vocabulary effect on VQA. X-axis: how the R50-C4 model is trained; Y-axis: how the feature is extracted (grid or region features, different kinds of boxes to extract region features).All region features have maximal 50 regions. The top row “Mean” is the average over all rows, showingthe overall quality of different vision models. The far-right column “Mean” is the average over all columns,showing the overall quality of different feature extraction methods.several existing methods, including DVSA [15], VSE++ [5], DPC [44], CAMP [39], SCAN [18], SCG [33],PFAN [38], Unicoder-VL [19], 12-in-1 [27], UNITER [4]. C.6 NLVR2

Given a pair of images and a natural language, the goal of NLVR2 [35] is to determine whether the naturallanguage statement is true about the image pair. For NLVR2 ﬁne-tuning, we ﬁrst construct two input se-quences, each containing the concatenation of the given sentence (the natural language description) and oneimage, and then two [ CLS ] outputs from O SCAR + are concatenated as the joint input for a binary classiﬁer,implemented by an MLP.For the O

SCAR + B model, we ﬁne-tune for epochs with learning rate { e − , e − , e − } and a batchsize of . For the O SCAR + L model, we ﬁne-tune for epochs with learning rate of { e − , e − } and abatch size of . D More on the Effect of Object-Attribute Vocabulary Size: disentanglingthe effects of region proposals and model weights

In Section 5.2, we demonstrate that the more diverse the visual concepts (object and attribute vocabularies)are, the better the visual region features for VL tasks. The better performance may come from the morediverse proposed regions where the region features are extracted (see the comparison in Figure 1, “region”for short), or from the better model weights that can produce better high-dimensional region representationeven for the same region (“model” for short). In this section, we disentangle effects of region proposalsand model weights, by performing synthetic experiments in which we use region proposals from one visionmodel and model weights from another vision model. Our results show that both the region proposals and24igure 6: Left: comparison of object vocab and attribute vocab, average over all types of bounding boxes.Right: comparison of feature extraction methods, average over all types of pre-trained vision models. X-axisis the number of iterations when we take the checkpoint for evaluation. Y-axis is the VQA accuracy on ourvqa-dev.model weights matter for VL tasks.

D.1 Disentangling the effects of region proposals and model weights on R50-C4

As in Section 5.2, We train vision models v = Vision ( Img ) on different datasets, i.e., OpenImages with500 object classes (OI:O500), standard ImageNet with 1K classes (ImageNet:O1000), Visual Genome with317 object classes (VG-obj), Visual Genome with 1594 object classes (VG:O1594), VG with 1594 ob-ject classes and 524 attribute classes (VG:O1594A524), pretrain on the merged 4 datasets and ﬁnetune onVG:O1594A524 (4Sets → VG:O1594A524). For each model, we also try different ways to extract features:(1) region features from different models’ proposed regions (same notations with models) where each imagehas maximal 50 region features, and (2) grid features where we use all grid features (Grid-273) or ran-domly sampled 50 grid features (Grid-50) for each image. We present the results of these model-regioncross-combination experiments in Figure 5. We also present the mean accuracy over all box types to obtaina robust ranking of different checkpoints and the mean accuracy over all checkpoints to obtain a robustranking of different box types. We have the following observations:• The richer the object vocabulary is, the better for VQA: OI:500 ≈ VG-obj:O317 < ImageNet:O1000 < VG:O1594.• Attribute information is crucial to VL tasks: all features trained with attributes (Columns with VG:O1594A524)are signiﬁcantly better than those without attributes.• Even for small vision backbone R50, vision pre-training makes vision features better: Column“4Sets → VG:O1594A524” are better than all other columns. Notice that the vision pre-training improvesboth the region features and the grid features. 25 It is crucial to extract features from semantically diverse regions: regions from OI and VG-obj are signif-icantly worse than all other regions, and is even worse than grid features.• Grid features perform worse than region features with regions proposed by VG models. By comparingRow “Grid-273” with rows with VG regions, it seems hopeful to close this gap while paying more hard-ware memory and computational cost in cross-modal models VL . It is three times slower to train the“Grid-273” models than training models with region features.In Figure 6, instead of just showing one ﬁnal number, we provide the mean evaluation curves alongtraining trajectories to demonstrate the ranking, as an even more robust evidence. These results furtherconﬁrm the conclusions we draw in Section 5.2. D.2 Disentangling the effects of region proposals and model weights on the SoTA model

In Table 17, we alternate the combination of region proposals and model weights, and evaluate them onVQA. As we can see, the improvement of using boxes from the R101-C4 model [2] to extract features fromour X152-C4 model is much bigger than that of using boxes from our X152-C4 model to extract featuresfrom the R101-C4 model [2], indicating pre-trained model weights are more important than regions. Inspiredby this analysis, we propose the class-agnostic NMS for region selection in the box head of the OD model,which does not sacriﬁce any VQA performance but greatly improves the model’s inference speed. Thisanalysis also suggests that large-scale OD pre-training should improve performance for grid-feature basedVL models, as supported by more results in Appendix F.region model R101-C4 [2] VinVLR101-C4 [2] 68.52 ± ± VinVL 69.05 ± ± Table 17: Ablation of region and model on VQA.

E More on FPN and Comparison of C4 and FPN

E.1 Two reasons why FPN performs worse than C4 on VL tasks.

Our experimental results conﬁrm the conclusion of [14] that the FPN model does not provide better regionfeatures for VL tasks than the C4 model (Columns “R50C4” vs. “R50FPN” in Table 18). Our analysisreveals two reasons. First of all, all layers involved in feature extraction in the C4 model have been pre-trained using ImageNet while the MLP head of FPN does not. It turns out that the VG dataset is still smallto train a good visual features for VL tasks and using ImageNet-pre-trained weights is beneﬁcial. This canbe veriﬁed by two experiments: (1) When the R50-C4 model is trained on VG with its box head randomlyinitialized (VG-trained - R50C4 w/ box head randomly initialized), the C4 model’s performance is the sameas FPN; and (2) C4 and FPN achieve the same performance after vision pre-training on 4 datasets (68.3 vs.68.2). The second reason is due the network architecture (CNN vs. MLP) of the box head in the OD model.The convolutional head in C4 has a better inductive bias in encoding visual information than the MLP headin FPN. This can be veriﬁed by the fact that when vision features from randomly initialized models areused (Row “Initial” in Table 18), R50-C4 performs much better than R50-FPN, indicating that the initialC4 features encode much more useful visual information than the inital FPN features. The “random” C4features nearly match the feature from ImageNet pre-trained model (Row “Initial” Column “R50C4”), while26 o image feature w R50-C4 w/ box headrandomly initialized

R50-FPN R50-C4 4Sets → R50-FPN 4Sets → R50-C4VG-trained – 67.6 ± . ± . ± . ± . ± . Initial 55.5 ± . ± . ± . ± . ± . ± . Table 18: C4 vs FPN architecture on VQA. Boxes used to extract features v and tags q used in VL modelare the same with those used in O SCAR [21]. Row “Initial” means using the initialization model withoutVG training for feature extraction.“random” FPN features are close to the performance without visual features as input (Row “Initial” Column“no image feature w ”). E.2 Effect of pooling methods in FPN on VQA performance.

Different from C4 models that extract region features from a single scale (the end of C4 block), FPN modelsextract region features from multiple scales adaptively based on the area of the region. Therefore, there issome in-homogeneity in FPN’s region features since they may come from different scales. In Figure 7, weshow that this is not the cause of FPN’s worse performance than C4 on the VQA task. More speciﬁcally,we experiment with 4 pooling methods for FPN architecture. (1) adapt: the original FPN’s pooling methodthat extract features adaptively from different scales; (2) max: extract features from all scales and then do amax-pool; (3) avg: extract features from all scales and then do an average-pool; (4) concat: extract featuresfrom all scales and then concatenate them together. We also train multiple FPN models on VG with thesepooling methods, with or without pre-training on the Objects365 dataset. We experiment on all possiblecombinations (in total × ) of 8 vision models and 4 pooling methods on the VQA task. When there is aparameter dimension mis-match, e.g., non-concat FPN models but use concat pooling methods in VQA andvice versa, we specify those parameter randomly with PyTorch’s default initialization method. The resultsin Figure 7 shows that (1) there is no obvious difference in different pooling methods, with the default“adapt” and the “concat” methods perform slightly better than “max” and “avg”; (2) (without surprise) theperformance is signiﬁcantly worse when there is a parameter dimension mismatch between vision modelsand VL task feature extraction methods, i.e., non-concat FPN models but use concat pooling methods inVQA and vice versa. These results show that the pooling method (no matter in vision model training or inVL task feature extraction) is not the root cause of FPN’s worse performance than C4 on the VQA task. E.3 Large-scale object-detection pre-training of C4 and FPN models

In this paper, we have trained R50-C4, R50-FPN, R152-C4 and R152-FPN models on the merged objectdetection datasets described in Table 2. In Figure 8, we report the mAP of checkpoints from these 4experiments on 4 validation sets: COCO with stuff (top left), Objects365 (top right), OpenImages (bottomleft) and Visual Genome (1594 object classes, bottom right). For R50 models, the R50-FPN model is slightlybetter than C4 on COCO and Objects365 but slightly worse than C4 on Visual Genome. For R152 models,the R152-FPN model is consistently worse than the R152-C4 model on all 4 different datasets. Therefore,we ﬁnally use the R152-C4 model for downstream vision-language tasks.27igure 7: Pooling methods in FPN feature extraction are not the root cause of FPN’s worse performancethan C4. X-axis: the pooling method when extracting features for VL tasks; Y-axis: the pooling method(vision model) when pre-training the visual feature extraction model. All experiments are using regionsfrom the Bottum-up Top-down model [2]. Each combination is experimented twice with two random seeds,i.e. seed=42 on the left and seed=88 on the right. The results from two random seeds are consistent.Figure 8: Checkpoints’ mAP on 4 validation sets: COCO with stuff (top left), Objects365 (top right),OpenImages (bottom left) and Visual Genome (1594 object classes, bottom right). For R50 models, theR50-FPN model is slightly better than C4 on COCO and Objects365 but slightly worse than C4 on VisualGenome. For R152 models, the R152-FPN model is consistently worse than the R152-C4 model on all 4different datasets. 28mageNet-5k[40] 4Sets VG with Attr 4Sets → VGgrid feature (273) 68.3 ± . ± . ± . * region feature (50) 67.7 ± . ± . ± . ± . * The other run failed and thus there is no std for this experiment.Table 19: Ablation study of X152 models on VQA. Vision models in the last three columns are trained withinitialization from the ImageNet-5k checkpoint in the ﬁrst column. All the region features are extracted withboxes proposed by our best X152-C4 model (pre-trained on 4Sets and ﬁne-tuned on VG). By comparing theﬁrst column and the last column, we see that our proposed vision pre-training (ﬁrst on 4 sets and then on VGwith attributes) improves performance for both the grid-feature based model and the region-feature basedmodel. Since the X152 backbone is much larger than the R50 backbone in Figure 5, the larger model canmake better use of the large pre-training datasets and thus have more signiﬁcant improvements.

F Grid feature

In Table 19, we train grid-feature based and region-feature based X152 models for VQA, with the visionmodels pre-trained on different vision datasets, i.e., “ImageNet-5k” from [40], our 4-dataset merged ODdataset 2 (4Sets), our VG dataset with 1594 object classes and 524 attribute classes (VG with Attr), and ﬁrst4Sets and then VG (4Sets → VG). Vision models in the last three cases are trained with initialization fromthe same ImageNet-5k checkpoint from [40]. All the region features are extracted with boxes proposed byour best X152-C4 model (pre-trained on 4Sets and ﬁne-tuned on VG). By comparing “ImageNet-5k” and“4Sets → VG”, we see that our proposed vision pre-training improves performance for both the grid-featurebased model and the region-feature based model. Since the X152 backbone is much larger than the R50backbone in Figure 5, the larger model makes better use of the large pre-training datasets and thus hasmore signiﬁcant improvements. It is interesting to see that for grid-feature based models, the “ImageNet-5k” model performs better than the “4Sets” model and the “VG with Attr”, while it is not the case forregion-feature based models. This may indicate that how the vision model is trained (grid-feature wise orregion-feature wise) may have big impact on the downstream VL tasks.

G End-to-end inference efﬁciency

We report the end-to-end inference time of different VQA models on a Titan-X GPU and a Xeon E5 CPUin Table 20. For CPU evaluation, we force that the inference use only one CPU thread. The input imagesize is × , and we run the inference with batch size 1 (one image-question pair per batch). Wecan see that (1) vision models dominate the inference time, especially for large models; (2) models basedon grid-feature are faster than those based on region feature; (3) with our proposed fast inference trick,region-feature models are greatly sped up and their inference time can be brought to within 3 times of thatof grid-feature models on GPU. We ﬁnd that on CPU with a single thread, our class-agnostic trick does notlead to time saving, because nearly all inference time is taken by the backbone and C4 head and the timefrom NMS operations is nearly ignorable on CPU. 29odel R50-C4 R101-C4 [2] X152-C4 Vision VL Vision VL Vision VL

Grid-50 0.059 ± ± ± ± ± ± Grid-273 0.056 ± ± ± ± ± ± Object 0.373 ± ± ± ± ± ± Object-eff 0.165 ± ± ± ± ± ± Grid-50 (cpu) 1.943 ± ± ± ± ± ± Grid-273 (cpu) 2.032 ± ± ± ± ± ± Object (cpu) 11.808 ± ± ± ± ± ± Object-eff (cpu) 11.729 ± ± ± ± ± ± Table 20: Time cost of end-to-end inference on VQA. All cross-modal models are BERT-Base. On theSOTA number obtained with X152-C4 region features, the performance keeps the samekeeps the same