Unifying Vision-and-Language Tasks via Text Generation
UUnifying Vision-and-Language Tasks via Text Generation
Jaemin Cho Jie Lei Hao Tan Mohit Bansal
UNC Chapel Hill { jmincho,jielei,haotan,mbansal } @cs.unc.edu Abstract
Existing methods for vision-and-language learn-ing typically require designing task-specific archi-tectures and objectives for each task. For example,a multi-label answer classifier for visual questionanswering, a region scorer for referring expres-sion comprehension, and a language decoder forimage captioning, etc. To alleviate these hassles,in this work, we propose a unified framework thatlearns different tasks in a single architecture withthe same language modeling objective, i.e., mul-timodal conditional text generation, where ourmodels learn to generate labels in text based onthe visual and textual inputs. On 7 popular vision-and-language benchmarks, including visual ques-tion answering, referring expression comprehen-sion, visual commonsense reasoning, most ofwhich have been previously modeled as discrimi-native tasks, our generative approach (with a sin-gle unified architecture) reaches comparable per-formance to recent task-specific state-of-the-artvision-and-language models. Moreover, our gen-erative approach shows better generalization abil-ity on answering questions that have rare answers.In addition, we show that our framework allowsmulti-task learning in a single architecture with asingle set of parameters, which achieves similarperformance to separately optimized single-taskmodels.
1. Introduction
Mirroring the success of the pretraining-finetuning paradigmwith transformer language models (Devlin et al., 2019),recent vision-and-language transformers (Tan & Bansal(2019); Lu et al. (2019); Chen et al. (2020); Li et al. (2020b), inter alia ) have also been adopted in a wide range of vision-and-language tasks. These models are first pretrained onthe large image-text corpus (e.g., COCO Caption (Chenet al., 2015)), then finetuned on downstream tasks (e.g., vi- Our code will be publicly available at: https://github.com/j-min/VL-T5 “vqa: what is the man jumping over?”“image text match: A cat is lying on a bed”“visual grounding: yellow fire hydrant”“span prediction: A
Text Input Text Output
MultimodalLMVisual QAVisualGroundingImage-Text Matching
Figure 1.
Our unified framework for learning vision-and-languagetasks. While existing methods require designing task-specific ar-chitectures for different tasks, our framework unifies them togetheras generating text labels conditioned on multimodal inputs. sual question answering (Goyal et al., 2019) and referringexpression comprehension (Mao et al., 2016)), which outper-formed many previous non-pretraining-finetuning methods.For each pretraining or downstream task, existing vision-and-language transformers typically require designing task-specific, separately-parameterized architectures on top ofthe transformer encoder (e.g., multi-label sigmoid classifierfor visual question answering, and softmax classifier for re-ferring expression comprehension). However, the reasoningskills required by these tasks overlap significantly. Considerthe example in Fig. 1, answering the question “What is theman jumping over?” and grounding an image region corre-sponding to the referring phrase “yellow fire hydrant”. Bothrequire models to recognize the object “fire hydrant”.In addition, the labels for these tasks can be easily ex-pressed in text. For instance, we can assign a region id(e.g., “
2. Related Works
Vision-and-Language pretraining
Large-scale lan-guage pretraining with transformers (Vaswani et al., 2017;Devlin et al., 2019; Liu et al., 2019; Lan et al., 2020; Clarket al., 2020; Yang et al., 2019; Raffel et al., 2019) haveachieved remarkable success for a spectrum of naturallanguage understanding tasks (Rajpurkar et al., 2016;Zellers et al., 2018; Wang et al., 2018; Williams et al.,2017). Following this success, in the vision-and-languagedomain, image+text pretraining models (Lu et al., 2019;Tan & Bansal, 2019; Chen et al., 2020; Huang et al., 2020;Li et al., 2020b; Cho et al., 2020; Radford et al., 2021)and video+text pretraining models (Sun et al., 2019b;a; Liet al., 2020a; Zhu & Yang, 2020; Miech et al., 2020) havealso shown to perform better than previous approaches (Yuet al., 2018a; Anderson et al., 2018; Kim et al., 2018; Yuet al., 2018b) without such pretraining, in a wide rangeof discriminative tasks (Goyal et al., 2019; Hudson &Manning, 2019; Lei et al., 2018; Mao et al., 2016; Xu et al.,2016; Zhou et al., 2018) and generative tasks (Chen et al.,2015; Xu et al., 2016; Zhou et al., 2018). In this work, wefocus on image+text tasks.Existing image+text models encode an image as a set ofbounding box region features (Lu et al., 2019; Tan & Bansal,2019; Chen et al., 2020) or grid features (Huang et al., 2020;Cho et al., 2020), analogous to text embeddings. Someworks study using better image encoder (Zhang et al., 2021),or using objects tags as additional text input (Li et al.,2020b). These improvements on stronger visual and textinput representations are orthogonal to ours. We expectthat our models can benefit from using these stronger inputrepresentations.Common pretraining objectives include ( i ) multimodalmasked language modeling (Lu et al., 2019; Tan & Bansal,2019; Chen et al., 2020; Huang et al., 2020; Li et al., 2020b):predict masked words conditioned on the input image andneighboring text context; and ( ii ) image text matching (Luet al., 2019; Tan & Bansal, 2019; Chen et al., 2020; Huanget al., 2020): predict whether an input sentence matcheswith the input image. Besides these two objectives, wealso use visual question answering, visual grounding, andgrounded captioning as additional tasks for pretraining.For each pretraining and downstream task, existing ap-proaches typically train separately-parameterized task-specific architecture along with the transformer backbone.Though these pretrained models use shared encoders acrossmultiple tasks, the output layers for the downstream tasks,e.g., visual question answering (Goyal et al., 2019; Hudson nifying Vision-and-Language Tasks via Text Generation AutoregressiveText DecoderBidirectionalMultimodal Encoder
RoIfeaturesBox coordinatesImage idsRegion idsPrefix +++ + +1 ++1 ++1 visual grounding : fire hydrant
Visual embedding
Figure 2.
An illustration of our VL-T5 and VL-BART architectures for visual grounding task. Instead of task-specific architectures, ourmodels use text prefixes to adapt to different tasks. The green block in (a) refers to visual embeddings. (b) shows the components of visualembedding. Note that we reuse the text embeddings of visual sentinel tokens (ex.
Unified frameworks
One line of work focus on solvingnatural language processing tasks in a unified format, asquestion answering (Mccann et al., 2018), span prediction(Keskar et al., 2019), or text generation (Raffel et al., 2019;Brown et al., 2020; Khashabi et al., 2020). These unifiedframeworks provide efficient knowledge sharing among dif-ferent tasks and make it easy to leverage pretrained languagemodels. In relation to these works, we propose to unify pre-viously separately modeled vision-and-language tasks in asingle unified format, via text generation, conditioned onmultimodal inputs from the image and the textual context.
3. Model
We propose a new learning method that unifies vision-and-language problems as multimodal conditional text gener-ation. We introduce VL-T5 and VL-BART based on twopretrained sequence-to-sequence transformer language mod-els: T5
Base (Raffel et al., 2019) and BART
Base (Lewiset al., 2020). Specifically, we extend their text encodersto multimodal encoders by incorporating image region em-beddings as additional input. The overall architecture ofour framework is shown in Fig. 2. Since the architecturedifferences between VL-T5 and VL-BART are minor, we will use VL-T5 as an example to illustrate our framework indetails in the rest of this section.
We represent an input image v with n object regions fromobject detector. Following previous works, we use the FasterR-CNN (Ren et al., 2015) trained on Visual Genome (Kr-ishna et al., 2016) for object and attribute classification, pro-vided by Anderson et al. (2018). Following Tan & Bansal(2019), we use n =36 object regions per image.As shown in Fig. 2 ( b ), each image region is encoded as asum of four types of features: ( i ) RoI (Region of Interest)object features; ( ii ) RoI bounding box coordinates; ( iii )image ids ∈ { , } ; and ( iv ) region ids ∈ { , . . . , n } . RoIfeatures and bounding box coordinates are encoded witha linear layer, while image ids and region ids are encodedwith learned embeddings (Devlin et al., 2019). Image ids areused to discriminate regions from different images, and takeeffect only when multiple images are given to the model(e.g., in NLVR (Suhr et al., 2019), models take two in-put images). The final visual embeddings are denoted as e v = { e v , . . . , e vn } . These embeddings have the same dimen-sion as the text embeddings that we will discuss next. Instead of designing task-specific architectures, we add dif-ferent prefixes to the original input text to adapt to differenttasks, as shown in Table. 1 ( top ). We show the prefixes fordifferent tasks in Table 1. Input text x is tokenized as { x , . . . , x | x | } and encoded aslearned embedding e x = { e x , . . . , e x | x | } . The embedding Note that since we use simple prefixes (e.g., “ vqa: ” for VQAtask), it is likely that engineering in text prompts (Gao et al., 2020)would improve the accuracy of our methods. As this is not thefocus of this paper, we leave it as future works. nifying Vision-and-Language Tasks via Text Generation parameters are shared by the encoder, decoder, and languagemodeling head (Press & Wolf, 2017). Since the attentionlayers are permutation-invariant, BART learns positionalembedding (Vaswani et al., 2017; Devlin et al., 2019) foreach absolute text position and adds them to the text tokenembeddings. In contrast, T5 adds relative position bias toeach self-attention layer (Shaw et al., 2018). Our modelsfollow the positional embedding configurations of their textbackbone models. At the same time, we use boundingbox coordinates to provide position information for visualembeddings, similar to absolute position embeddings fortext.In addition to the original vocabulary of T5 and BART,we introduce visual sentinel tokens { Region scoringheadVQAhead Existing methods: N heads for N tasksOurs: LM head for all tasks [CLS] What is the man jumping over? [CLS] fire hydrantTop-K answer scores Sigmoid Multi-labelClassification Softmax “fire hydrant” Classification VL TransformerVL Transformer vqa: What is the man jumping over? visual grounding: fire hydrant VL Transformer “fire hydrant” “ LanguageModeling (a) (b)(c) (d) Figure 3. Comparison between existing vision-language trans-formers and our framework on visual question answering and re-ferring expression comprehension (visual grounding) tasks. Whileexisting methods use task-specific architectures and objectives,our models use language modeling head and maximum likelihoodestimation on label text for all tasks. (Sec. 4) and downstream tasks (Sec. 5), we train our modelparameters θ by minimizing the negative log-likelihood oflabel text y tokens given input text x and image v (Eq. 1). L GEN θ = − | y | (cid:88) j =1 log P θ ( y j | y In this subsection, we compare our unified framework withexisting vision-and-language transformers on two populartasks: visual question answering (Goyal et al., 2019) andreferring expression comprehension (Mao et al., 2016). Weillustrate this comparison in Fig. 3.Visual question answering task requires a model to an-swer a question to a given context image. As shownin Fig.3 ( a ), existing methods (Tan & Bansal, 2019; Luet al., 2019; Chen et al., 2020) typically formulate thistask as a discriminative task, i.e., multi-label classifica-tion over a predefined set of K frequent answer candidates { a , . . . , a K } . Specifically, they introduce a multi-layerperceptron (MLP) sigmoid scorer head on top of h x [CLS] tolearn the likelihood of each answer candidate being correct: P VQA θ ( correct | a, x, v ) = sigmoid ( MLP VQA ( h x [CLS] )) . ThisVQA scorer head is trained end-to-end with the transformerencoder through a binary cross-entropy loss, by using VQA nifying Vision-and-Language Tasks via Text Generation Table 1. Input-output formats for pretraining (Sec. 4) and downstream tasks (Sec. 5). a We use different prefixes (“vqa:”, “gqa:”,“visual7w:”) for questions from different datasets. b NLVR takes two images as visual input, for brevity, we only show one here. Tasks Input image Input text Target text Pretraning tasks (Sec. 4) Multimodal LM (VL-T5) span prediction: A Grounded captioning caption region: Downstream tasks (Sec. 5) VQA vqa: [Q] [A] GQA gqa: [Q] [A] b NLVR nlvr: [text] true/falseVCR Q → A vcr qa: question [Q] answer: [A] true/falseVCR Q → AR vcr qar: question [Q] answer: [A] rationale: [R] true/falseRefCOCOg visual grounding: [referring expression] [region id] COCO captioning caption: [caption] COCO captioning (w/ object tags) caption with tags: [Tag1 Tag2 ..] [caption] Multi30K En-De translation translate English to German: [English text] [German text] score (Goyal et al., 2019) as soft target distribution (Eq. 2). L VQA θ = − K (cid:88) k =1 score ( a k , x, v ) log P VQA θ ( correct | a k , x, v ) (2)For referring expression comprehension, it requires modelsto localize a target region in an image that is describedby a given referring expression. Previous methods tacklethis task as multi-class (Chen et al., 2020) or binary (Luet al., 2019) classification over image regions. For example,UNITER (Chen et al., 2020) introduces a region scoringhead (an MLP layer) on top of the output representations ofregions, as shown in Fig. 3( b ). This region scoring head isjointly trained with the encoder by minimizing the negativelog-likelihood of the target region r ∗ : L REF θ = − log P REF θ ( r ∗ | x, v ) (3)In contrast to existing methods that develop task-specificarchitectures and objectives (e.g., Eq. 2, 3), our unifiedframework is free from extra model designs for new tasks.As shown in Fig. 3 ( c,d ) and Table 1, we formulate thetask labels to corresponding text, and we learn these differ-ent tasks by predicting label text with the same languagemodeling objective (Eq. 1). 4. Pretraining In this section, we describe how we pretrain our VL-T5and VL-BART models (Sec. 3). We start with the detailsof the pretraining data and illustrate how we formulate di-verse vision-and-language pretraining tasks as multimodalconditional text generation. score ( a, x, v )= min( ( a as the answer) ∗ . , We aggregate pretraining data from MS COCO (Lin et al.,2014; Chen et al., 2015) and Visual Genome (VG; Krishnaet al. (2016)) images . The captioning data from thesetwo datasets are used in the multimodal language modelingtask. The COCO captions are also used in the image-textmatching task to learn cross-modal alignment. Besidesthe captions, we also use three visual question answeringdatasets (VQA v2.0 Goyal et al. (2019), GQA balancedversion (Hudson & Manning, 2019), and Visual7W (Zhuet al., 2016)) as in Tan & Bansal (2019), but only usedthem in the visual question answering task. Details of thesepretraining tasks are in Sec. 4.2.Overall, our pretraining dataset contains 9.18M image-textpair on 180K distinct images. We carefully split our pre-training data to avoid any intersection between our trainingdata and the evaluation set of downstream tasks (e.g., COCOCaptioning, RefCOCOg). In this process, around 10K im-ages are excluded from the training sets of COCO and VG.We then take the COCO Karpathy val split (Karpathy &Fei-Fei, 2015) with 5,000 images as our validation set tomonitor pretraining performance. We pretrain our models under a multi-task setup with diversepretraining tasks, including multimodal language model-ing, visual question answering, image-text matching, visualgrounding, and grounded captioning. Table 1 shows inputand output examples of our pretraining tasks. The trainingdata for each of these tasks are summarized in Table 11. Inthe rest of this section, we explain these tasks in detail. Existing vision-and-language transformers are trained withdifferent datasets and computational budget, thus their results maynot be directly comparable to each other. We show the number oftheir pretraining images in Table 2. nifying Vision-and-Language Tasks via Text Generation Multimodal language modeling We follow Raffel et al.(2019) and Lewis et al. (2020) to construct the languagemodeling pretraining task. The basic idea is to recover themasked input text based on both visual and textual context(while original methods are only based on textual context).For VL-T5, we mask 15% of input text tokens and replacecontiguous text span with sentinel tokens (e.g., Visual question answering Similar to Tan & Bansal(2019), we include visual question answering in our pre-training tasks. The task requires models to answer a ques-tion to a given context image. While previous methods (Tan& Bansal, 2019; Lu et al., 2019; Chen et al., 2020) tacklethe task as classification over predefined answer candidates(illustrate in Fig. 3), we directly generate answers in theiroriginal text format. Image-text matching In this task, the model needs toverify whether an input text corresponds to the given inputimage. We consider the image and its captions as positivepairs. With a probability of 50%, we create a negativepair by randomly sampling another image from trainingset and taking its caption. The model then predicts thecorrespondence between the input image and text with “true”or “false” as shown in Table 1. Visual grounding Besides the above image-text match-ing task, we also develop an object-text matching task toendow the model with grounding ability, which is requiredin several tasks (e.g., referring expression comprehensionand VCR). Previous vision-and-language transformers (Tan& Bansal, 2019; Lu et al., 2019; Chen et al., 2020) predictthe property of masked objects to indirectly learn object-textalignment. To explicitly learn this important grounding abil-ity, we give the model a region description and let it predictthe id of the related object region. With the help of the visualsentinel token (e.g., To teach the model with object-level information, we also use an inverse task of the afore-mentioned visual grounding, called grounded captioning.As shown in Table 1, given a visual sentinel token (whichindicates a region in the image) as text input, the model isasked to generate a corresponding textual description of thisinput region. For both VL-T5 and VL-BART, it takes 4 days for 30-epochpretraining with mixed precision training (Narang et al.,2018) on 4 RTX 2080 Ti GPUs (4 x 11GB). We use batchsize 320 and 600 for VL-T5 and VL-BART, respectively.We use AdamW optimizer (Loshchilov & Hutter, 2019) with ( β , β ) = (0 . , . and learning rate 1e-4 with 5% lin-ear warmup schedule. We use the VQA validation score totrack the progress of pretraining. Our code is based on Py-Torch (Paszke et al., 2017) and Huggingface Transformers(Wolf et al., 2019). 5. Downstream Tasks and Results In this section, we evaluate our generative architecturesVL-T5 and VL-BART on a diverse set of 7 downstreamtasks, including two image question answering tasks (Goyalet al., 2019; Hudson & Manning, 2019), referring expres-sion comprehension (Mao et al., 2016), natural languagevisual reasoning (Suhr et al., 2019), visual commonsensereasoning (Zellers et al., 2019), image captioning (Chenet al., 2015), and multimodal machine translation (Elliottet al., 2016). We summarize the statistics of the datasetsused in downstream tasks in Table 12. We compare ourmodels with strong vision-and-language pretrained trans-formers: LXMERT (Tan & Bansal, 2019), ViLBERT (Luet al., 2019), UNITER (Chen et al., 2020), Unified VLP(Zhou et al., 2020), Oscar (Li et al., 2020b), and XGPT (Xiaet al., 2020).As summarized in Table 2, our models achieve similar re-sults to most of the baselines. We highlight that our unifiedgenerative modeling approach (with the input-output for-mat shown in Table 1) is close to the performance of theheavily developed task-specific discriminative models. Notethat different vision-and-language transformers are trainedwith different setups (e.g., pretraining data, objectives, fea-ture extractor, hyperparameters, computational budget), thusthe results might not be directly comparable. For exam- Our grounded captioning task can be seen as a simplifieddense captioning (Johnson et al., 2016) task, where only one objectis asked to describe at a time. nifying Vision-and-Language Tasks via Text Generation Table 2. Single model performance on downstream tasks. Note that the baseline models adopt task-specific objectives and architectures,whereas our models tackle all tasks, including discriminative tasks (e.g., RefCOCOg), as text generation with a single architecture andobjective. (cid:63) See our discussion in Sec.5.3. † Submitted to the leaderboard (the result will be updated). Method RefCOCOg VCR Q → AR COCO Cap Multi30K En-Detest-std test-std test-P test d test Karpathy test test 2018Acc Acc Acc Acc Acc CIDEr BLEULXMERT 180K 72.5 60.3 74.5 - - - -ViLBERT 3M 70.9 - - - 54.8 - -UNITER Base 4M 72.9 - 77.9 74.5 58.2 -Unified VLP 3M 70.7 - - - - 117.7 -Oscar Base 4M 73.4 61.6 78.4 - - 123.7 -XGPT 3M - - - - - 120.1 -MeMAD - - - - - - - 38.5VL-T5 180K 70.3 60.8 73.6 71.3 58.9 116.5 38.6VL-BART 180K 71.3 60.5 70.3 22.4 (cid:63) - † ple, UNITER and Oscar use around 4M extra images fromSBU captions (Ordonez et al., 2011) and Conceptual Cap-tions (Sharma et al., 2018) for pretraining. The closestbaseline to our models is LXMERT as both are pretrainedon the same datasets and use the same visual features. SeeTable 10 in the appendix for a detailed comparison betweenbaselines and our models. We tune the hyperparametersbased on the validation set of each downstream task. SeeTable 13 for details. In the rest of this section, we’ll providea detailed comparison w.r.t. our models and the baselines,as well as elaborating the details of the evaluated tasks. The visual question answering task requires models to an-swer a question to a given context image. In this work, weevaluate our models on VQA (Goyal et al., 2019) and GQA(Hudson & Manning, 2019) datasets. Each question in VQAand GQA typically have multiple answers, at each trainingstep, we randomly sample one answer from the ground-truthanswer set and use it as the text generation target.Table 2 compares our models VL-T5 and VL-BART withexisting methods on visual question answering tasks VQAand GQA. For both tasks, our models achieve comparableperformance to existing approaches. Note that in additionto the Visual Genome and COCO Captions data that weuse, UNITER and Oscar also use around 4M extra imagesfrom SBU captions (Ordonez et al., 2011), Conceptual Cap-tions (Sharma et al., 2018), and Flicker30k (Young et al.,2014) (Oscar only) for pretraining. Chen et al. (2020) haveshown that adding these extra data during pretraining im-proves model performance across various downstream tasks. Table 3. VQA Karpathy-test split accuracy using generative anddiscriminative methods. We break down the questions into twosubsets in terms of whether the best-scoring answer a ∗ for eachquestion is included in the top-K answer candidates A topk . In-domain : a ∗ ∈ A topk , Out-of-domain : a ∗ / ∈ A topk . Method VQA Karpathy-test Acc.In-domain Out-of-domain Overall Discriminative UNITER Base VL-T5 70.2 7.1 66.4VL-BART 69.4 7.0 65.7 Generative VL-T5 71.4 13.1 67.9VL-BART 72.1 Generative vs. Discriminative model Modern ap-proaches (Tan & Bansal, 2019; Lu et al., 2019; Chen et al.,2020; Zhou et al., 2020; Li et al., 2020b) are discriminativemodels, where they tackle visual question answering tasksas multi-label classification over a predefined set of answercandidates. For example, LXMERT and UNITER train atwo-layer MLP classifier with sigmoid activation and usesoft target scores on 3,129 answers that appear 9 or moretimes in the VQA train2014 split (See Sec. 3.4 and Fig. 3).While this strategy has achieved strong performance, it maynot generalize to a real-world scenario where answers maynot exist in this fixed answer set. In contrast, our models VL-T5 and VL-BART directly generate answers as free-formtext, allowing a truly open-ended setup.To quantitatively compare existing discriminative ap- nifying Vision-and-Language Tasks via Text Generation Model Image 1 TextTrue / False Model Model Multi-head Attention(a) Triplet (b) Pair (c) Pair-biattn Image 2 Image 1 TextImage 2Text Model Model Image 1 TextImage 2TextTrue / False True / False Figure 4. Different encoding settings for NLVR . Pair and Pair-biattn approximately double the computational cost over Triplet which our models are based on. proaches and our generative approaches, we evaluate theirperformance on questions with rare answers, i.e., out-of-domain answers (for discriminative approaches). We breakdown VQA questions in Karpathy-test split, in terms ofwhether the best-scoring answer a ∗ for each question is in-cluded in the top-K ( K = 3 , ) answer candidates A topk .The questions with a ∗ / ∈ A topk can be treated as out-of-domain questions since their best-scoring answers are rareanswers and have been excluded in standard discriminativeapproaches. After this split, the in-domain subset contains24,722 questions, and the out-of-domain subset contains1,558 questions. For discriminative baselines, we intro-duce a two-layer MLP classifier with sigmoid on top ofthe decoder representation of start-of-sequence token Table 4. NLVR performance comparison under different encod-ing settings. Note that Triplet takes lower computational cost than Pair and Pair-biattn (See also Fig. 4). Method Setting dev test-PUNITER Base Triplet 73.0 73.9UNITER Base Pair 75.9 75.8UNITER Base Pair-biattn LXMERT Pair 74.9 74.5Oscar Base Pair 78.1 78.4VL-T5 Triplet 74.6 73.6VL-BART Triplet 71.7 70.3 Table 5. Referring expression comprehension performance com-parison on RefCOCOg. Method V&L PT val d test d MattNet 66.9 67.3UNITER Base (cid:88) VL-T5 63.4 62.9VL-T5 (cid:88) (cid:88) under different encoding settings (See Fig. 4): (a) Triplet : joint encoding of image pairs and text; (b) Pair : theconcatenation of individual embedding of each image-textpair; (c) Pair-biattn : bidirectional attention added to Pair .UNITER shows that one can improve performance with amore complex encoding setting, i.e., Pair-biattn achievesbetter performance than Pair , which is again better thanthe simplest Triplet . Note that both the Pair and the Pair-biattn settings approximately double the computational costcompared to that of the Triplet setting. While there’s the gapbetween our models and baselines in Pair and Pair-biattn setting, VL-T5 shows comparable performance to UNITERin Triplet setting. Referring expression comprehension is a visual groundingtask, where given a natural language referring expression(e.g., ‘the car on the left’) describing an object in an image,a model needs to correctly localize the object in this image(when object candidates are given, the task is reduced tochoose an object from a set of candidates). In this work,we evaluate models on the RefCOCOg (Mao et al., 2016)dataset. Similar to the visual grounding pretraining task inSec. 4, we give our model a referring phrase and candidate nifying Vision-and-Language Tasks via Text Generation Table 6. VCR accuracy. Stage 1 refers to the original vision-and-language generic-domain pretraining and Stage 2 refers to the in-domainpretraining on VCR. Method V&L PT VCR val VCR testStage 1 Stage 2 Q → A QA → R Q → AR Q → A QA → R Q → ARViLBERT 69.3 71.0 49.5 - - -ViLBERT (cid:88) Base Base (cid:88) Base (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) VL-BART 65.4 68.1 44.6 - - -VL-BART (cid:88) (cid:88) (cid:88) COCO captioning scores on Karparthy-test split. Allmodels are trained with cross-entropy loss. PT and FT refer to theuse of object tags during pretraining and finetuning, respectively. Method V&L PT Object tags COCO CaptioningB C M SOscar (cid:88) PT+FT VL-T5 (cid:88) FT 34.5 116.5 28.7 21.9VL-BART (cid:88) FT 35.1 116.6 28.7 21.5Oscar (cid:88) Unified VLP (cid:88) (cid:88) (cid:88) VL-BART (cid:88) On the VCR val split, comparing to the model variants thatadapt different pretraining strategies, we find that both Stage1 generic-domain pretraining and Stage 2 in-domain pre-training help improve the VCR task performance, which isconsistent with the findings in UNITER. We evaluate automatic caption generation performance onMS COCO Caption dataset (Chen et al., 2015). We use Karparthy split (Karpathy & Fei-Fei, 2015), which re-splitstrain2014 and val2014 COCO images (Lin et al., 2014) into113,287 / 5000 / 5000 for train / validation / test. While somemethods use reinforcement learning-based optimization onCIDEr, we only compare with methods using cross-entropyloss. Note that image captioning is the only task in our exper-iments that do not have meaningful textual contexts, whichresults in a notable difference in pretraining and finetuningw.r.t. the input format. Inspired by Oscar (Li et al., 2020a),we also experimented with using object tags as additionaltext inputs during finetuning. We use BLEU (Papineni et al.,2002), CIDEr (Vedantam et al., 2015), METEOR (Banerjee& Lavie, 2005), SPICE (Anderson et al., 2016) as evaluationmetrics using COCOEvalCap implementation. In Table 7, we compare our models with baselines in dif-ferent settings: use of vision-and-language pretraining anduse of object tag as additional text inputs. With and withoutvision-and-language pretraining, our models show compara-ble performance to baselines. Since the use of object tagsrequires significant extra computation, we only use it forfinetuning. Using tags gives a comparable or slightly im-proved performance for both models, and the improvementis significant (2.5) in CIDEr for VL-BART. We expect tag- https://github.com/tylin/coco-caption Table 8. Multi30K En-De multimodal translation BLEU scores. † and * refer to data augmentation and ensemble, respectively. Weuse gray color for the ensemble model it is not fairly comparable. Method V&L PT test2016 test2017 test2018MSA 38.7 - -MeMAD 38.9 32.0 -MSA † † †∗ VL-T5 (cid:88) (cid:88) augmented pretraining like Oscar would further boost theperformance of our models. We evaluate English to German multimodal machine transla-tion performance on Multi30K dataset (Elliott et al., 2016),which have been used in WMT multimodal machine transla-tion shared tasks (Barrault et al., 2018). Multi30K dataset iscollected by translating the Flickr30K (Young et al., 2014)dataset (in English) with paired German sentences. We re-port BLEU score using SacreBLEU (Post, 2018) implemen-tation , which produces official WMT BLEU scores. Sinceno pretrained vision-and-language transformers have beenevaluated on the multimodal machine translation task yet,we compare our models with state-of-the-art transformermodels: Multimodal self-attention (MSA) (Yao & Wan,2020), MeMAD (Gr¨onroos et al., 2018).Table 8 shows that our T5-based models outperformed allsingle-model baselines on all three test splits of Multi30K,without strong data augmentation (e.g., back-translation,captions from external image captioning model). Our vision-and-language models outperformed their original text-onlybackbones, but we did not observe notable improvementwith vision-and-language pretraining. Vision-and-languagepretraining degraded performance of VL-BART. We conjec-ture the reasons as ( i ) the source text in Multi30K containssufficient information for machine translation without visualinputs as discussed in Caglayan et al. (2019). ( ii ) the visualgrounding ability which VL-BART failed to learn (Sec.5.3)is important for multimodal machine translation task. https://github.com/mjpost/sacrebleu nifying Vision-and-Language Tasks via Text Generation Table 9. Multi-task finetuning results on VQA and RefCOCOg.With a single set of parmeters, our multi-task model achievessimilar performance to separately optimized single-task models. Method Finetuning tasks VQA RefCOCOgKarpathy test testAcc AccVL-T5 VQA 67.9 -VL-T5 RefCOCOg - 71.3VL-T5 VQA + RefCOCOg 67.0 70.1 While our framework has unified the architecture for dif-ferent downstream tasks, the parameters are separately op-timized. To see whether we can go one step further, wetrain a single model that tackles different kinds of tasks atonce with the same set of weights. Specifically, we finetuneVL-T5 on two different tasks, VQA (Goyal et al., 2019)and RefCOCOg (Mao et al., 2016), in a multi-task learningsetup. At each finetuning step, we sample a mini-batch ofexamples from one of the two tasks. The existing vision-and-language multi-task learning method (Lu et al., 2020) trainsmultiple task-specific heads and only shares the backboneencoder, as illustrated in Fig. 3. With the help of our unifiedencoder-decoder architecture and generative pretraining, webuild a unified multi-task model, where only a single sharedlanguage modeling head is learned for both tasks.Table 9 shows the multi-task and single-task finetuning re-sults of VL-T5 on VQA and RefCOCOg. On both tasks, ourmulti-task model achieves similar performance compared tothe single-task models, while using a single set of weightsshared by both tasks. Since we did not use advanced multi-task learning strategies such as oversampling or dynamicstop-and-go (Lu et al., 2020), we expect the multi-task per-formance of our model to be further improved with theseorthogonal techniques. 6. Conclusion In this work, we proposed VL-T5 and VL-BART whichtackle vision-and-language tasks with a unified text gener-ation objective. Experiments show VL-T5 and VL-BARTcan achieve comparable performance with state-of-the-artvision-and-language transformers on diverse vision-and-language tasks without hand-crafted architectures and objec-tives. Especially, we demonstrate our generative approach isbetter suited for open-ended visual question answering. Inaddition, we also showed it is possible to train two differenttasks simultaneously using the same architecture with thesame weight while not losing much performance, it wouldbe an interesting future work to further explore this directionby adding even more tasks. Acknowledgments We thank Hyounghun Kim, Zineng Tang, Swarnadeep Saha,Xiang Zhou for their comments and suggestions. Thiswork was supported by NSF-CAREER Award 1846185,ARO-YIP Award W911NF-18-1-0336, DARPA MCS GrantN66001-19-2-4031, Google Focused Research Award, andBloomberg Data Science Ph.D. Fellowship. The views,opinions, and/or findings contained in this article are thoseof the authors and not of the funding agency. References Anderson, P., Fernando, B., Johnson, M., and Gould, S.SPICE: Semantic Propositional Image Caption Evalua-tion. In ECCV , 2016.Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M.,Gould, S., and Zhang, L. Bottom-Up and Top-DownAttention for Image Captioning and Visual Question An-swering. In CVPR , 2018. URL http://arxiv.org/abs/1707.07998 .Banerjee, S. and Lavie, A. METEOR : An Automatic Met-ric for MT Evaluation with Improved Correlation withHuman Judgments. In ACL Workshop , 2005.Barrault, L., Bougares, F., Specia, L., Lala, C., Elliott, D.,and Frank, S. Findings of the Third Shared Task onMultimodal Machine Translation. In WMT , pp. 304–323,2018. doi: 10.18653/v1/w18-6402.Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan,J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G.,Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu,J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M.,Gray, S., Chess, B., Clark, J., Berner, C., McCandlish,S., Radford, A., Sutskever, I., and Amodei, D. LanguageModels are Few-Shot Learners. In NeurIPS , 2020. URL http://arxiv.org/abs/2005.14165 .Caglayan, O., Madhyastha, P., Specia, L., and Barrault,L. Probing the Need for Visual Context in Multi-modal Machine Translation. In NAACL , 2019. ISBN9781950737130. doi: 10.18653/v1/n19-1422.Chen, X., Fang, H., Lin, T.-Y., Vedantam, R., Gupta, S.,Dollar, P., and Zitnick, C. L. Microsoft COCO Captions:Data Collection and Evaluation Server. apr 2015. URL http://arxiv.org/abs/1504.00325 .Chen, Y.-c., Li, L., Yu, L., Kholy, A. E., Ahmed, F., Gan,Z., Cheng, Y., and Liu, J. UNITER: UNiversal Image-TExt Representation Learning. In ECCV , 2020. URL https://arxiv.org/abs/1909.11740 . nifying Vision-and-Language Tasks via Text Generation Cho, J., Lu, J., Schwenk, D., Hajishirzi, H., and Kembhavi,A. X-LXMERT: Paint, Caption and Answer Questionswith Multi-Modal Transformers. In EMNLP , 2020. doi:10.18653/v1/2020.emnlp-main.707.Clark, K., Luong, M.-T., Le, Q. V., and Manning, C. D. Elec-tra: Pre-training text encoders as discriminators ratherthan generators. In ICLR , 2020.Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.BERT: Pre-training of Deep Bidirectional Transformersfor Language Understanding. In NAACL , oct 2019. URL http://arxiv.org/abs/1810.04805 .Elliott, D., Frank, S., Sima’an, K., and Specia, L. Multi30K: Multilingual English-German Image Descriptions. In ACL Workshop , pp. 70–74, 2016.Gao, T., Fisch, A., and Chen, D. Making Pre-trained Lan-guage Models Better Few-shot Learners. 2020. URL http://arxiv.org/abs/2012.15723 .Goyal, Y., Khot, T., Agrawal, A., Summers-Stay, D., Ba-tra, D., and Parikh, D. Making the V in VQA Matter:Elevating the Role of Image Understanding in VisualQuestion Answering. International Journal of Com-puter Vision , 2019. ISSN 15731405. doi: 10.1007/s11263-018-1116-0.Gr¨onroos, S.-A., Huet, B., Kurimo, M., Laaksonen, J., Meri-aldo, B., Pham, P., Sj¨oberg, M., Sulubacak, U., Tiede-mann, J., Troncy, R., and V´azquez, R. The MeMADSubmission to the WMT18 Multimodal Translation Task.In WMT , volume 2, pp. 609–617, 2018.He, K., Gkioxari, G., Dollar, P., and Girshick, R. MaskR-CNN. ICCV , 2017.Huang, L., Wang, W., Chen, J., and Wei, X. Y. Attention onattention for image captioning. In ICCV , pp. 4633–4642,2019. ISBN 9781728148038. doi: 10.1109/ICCV.2019.00473.Huang, Z., Zeng, Z., Liu, B., Fu, D., and Fu, J. Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers. 2020. URL http://arxiv.org/abs/2004.00849 .Hudson, D. A. and Manning, C. D. GQA: A new dataset forreal-world visual reasoning and compositional questionanswering. In CVPR , 2019. ISBN 9781728132938. doi:10.1109/CVPR.2019.00686.Johnson, J., Karpathy, A., and Fei-Fei, L. DenseCap: FullyConvolutional Localization Networks for Dense Caption-ing. In CVPR , 2016. Karpathy, A. and Fei-Fei, L. Deep Visual-Semantic Align-ments for Generating Image Descriptions. In CVPR ,2015. ISBN 9781467369640. doi: 10.1109/TPAMI.2016.2598339.Kazemzadeh, S., Ordonez, V., Matten, M., and Berg, T.Referitgame: Referring to objects in photographs of natu-ral scenes. In EMNLP , 2014.Keskar, N. S., McCann, B., Xiong, C., and Socher, R. Unify-ing Question Answering and Text Classification via SpanExtraction. 2019. URL http://arxiv.org/abs/1904.09286 .Khashabi, D., Min, S., Khot, T., Sabharwal, A., Tafjord,O., Clark, P., and Hajishirzi, H. Unified QA : CrossingFormat Boundaries with a Single QA System. In Findingsof EMNLP , 2020.Kim, J.-h., Jun, J., and Zhang, B.-t. Bilinear AttentionNetworks. In NeurIPS , pp. 1–12, 2018.Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K.,Kravitz, J., Chen, S., Kalantidis, Y., Jia-Li, L., Shamma,D. A., Michael Bernstein, and Fei-Fei, L. VisualGenome: Connecting Language and Vision Using Crowd-sourced Dense Image Annotations. International Jour-nal of Computer Vision , 2016. ISSN 15731405. doi:10.1007/s11263-016-0981-7.Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P.,and Soricut, R. Albert: A lite bert for self-supervisedlearning of language representations. In ICLR , 2020.Lei, J., Yu, L., Bansal, M., and Berg, T. L. Tvqa: Localized,compositional video question answering. In EMNLP ,2018.Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mo-hamed, A., Levy, O., Stoyanov, V., Zettlemoyer, L., andBart, P.-t. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation,and Comprehension. In ACL , 2020.Li, L., Chen, Y.-C., Yu Cheng, Z. G., Yu, L., and Liu, J.HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training. In EMNLP , 2020a.Li, X., Yin, X., Li, C., Zhang, P., Hu, X., Zhang, L.,Wang, L., Hu, H., Dong, L., Wei, F., Choi, Y., andGao, J. Oscar: Object-Semantics Aligned Pre-trainingfor Vision-Language Tasks. In ECCV , 2020b. URL http://arxiv.org/abs/2004.06165 .Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ra-manan, D., Doll´ar, P., and Zitnick, C. L. Microsoft COCO:Common Objects in Context. In ECCV , 2014. ISBN 978-3-319-10601-4. doi: 10.1007/978-3-319-10602-1 48. nifying Vision-and-Language Tasks via Text Generation Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D.,Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V.Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 , 2019.Loshchilov, I. and Hutter, F. Decoupled Weight De-cay Regularization. In ICLR , 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7 .Lu, J., Batra, D., Parikh, D., and Lee, S. ViLBERT: Pre-training Task-Agnostic Visiolinguistic Representationsfor Vision-and-Language Tasks. In NeurIPS , 2019. URL http://arxiv.org/abs/1908.02265 .Lu, J., Goswami, V., Rohrbach, M., Parikh, D., and Lee,S. 12-in-1: Multi-Task Vision and Language Repre-sentation Learning. In CVPR , 2020. URL http://arxiv.org/abs/1912.02315 .Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A., andMurphy, K. Generation and Comprehension of Unam-biguous Object Descriptions. In CVPR , 2016.Mccann, B., Keskar, N. S., Xiong, C., and Socher, R. TheNatural Language Decathlon : Multitask Learning asQuestion Answering. 2018.Miech, A., Alayrac, J.-B., Smaira, L., Laptev, I., Sivic, J.,and Zisserman, A. End-to-end learning of visual repre-sentations from uncurated instructional videos. In CVPR ,2020.Narang, S., Diamos, G., Elsen, E., Micikevicius, P., Alben,J., Garcia, D., Ginsburg, B., Houston, M., Kuchaiev, O.,Venkatesh, G., and Wu, H. Mixed Precision Training.In ICLR , 2018. URL https://openreview.net/forum?id=r1gs9JgRZ .Nogueira, R., Jiang, Z., Lin, J., Mar, I. R., Pradeep, R., andLin, J. Document Ranking with a Pretrained Sequence-to-Sequence Model. In Findings of EMNLP , pp. 1–8,2020.Ordonez, V., Kulkarni, G., and Berg, T. L. Im2Text : De-scribing Images Using 1 Million Captioned Photographs.In NIPS , 2011.Papineni, K., Roukos, S., Ward, T., and Zhu, W.W.-j. BLEU: a Method for Automatic Evaluationof Machine Translation. In ACL , 2002. ISBN 1-55860-883-4. doi: 10.3115/1073083.1073135. URL http://portal.acm.org/citation.cfm?doid=1073083.1073135http://dl.acm.org/citation.cfm?id=1073135 .Paszke, A., Gross, S., Chintala, S., Chana, G., Yang, E.,DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., andLerer, A. Automatic differentiation in PyTorch. In NIPS Workshop , 2017. URL https://openreview.net/pdf?id=BJJsrmfCZ .Post, M. A Call for Clarity in Reporting BLEU Scores. In WMT , volume 1, pp. 186–191, 2018.Press, O. and Wolf, L. Using the Output Embedding toImprove Language Models. In EACL , 2017.Radford, A., Wook, J., Chris, K., Aditya, H., Gabriel, R.,Sandhini, G., Sastry, G., Askell, A., Mishkin, P., Clark, J.,Krueger, G., and Sutskever, I. Learning Transferable Vi-sual Models From Natural Language Supervision. 2021.Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S.,Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploringthe Limits of Transfer Learning with a Unified Text-to-Text Transformer. JMLR , 21:1–67, 2019. URL http://arxiv.org/abs/1910.10683 .Rajpurkar, P., Zhang, J., Lopyrev, K., and Liang, P. Squad:100,000+ questions for machine comprehension of text.In EMNLP , 2016.Ren, S., He, K., Girshick, R., and Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Re-gion Proposal Networks. In NIPS , 2015. URL https://arxiv.org/abs/1506.01497 .Sharma, P., Ding, N., Goodman, S., and Soricut, R. Con-ceptual captions: A cleaned, hypernymed, image alt-textdataset for automatic image captioning. In ACL , 2018.ISBN 9781948087322. URL .Shaw, P., Uszkoreit, J., and Vaswani, A. Self-Attention withRelative Position Representations. In NAACL , 2018.Song, K., Tan, X., Qin, T., Lu, J., and Liu, T.-Y. MASS:Masked Sequence to Sequence Pre-training for LanguageGeneration. In ICML , 2019. URL http://arxiv.org/abs/1905.02450 .Suhr, A., Zhou, S., Zhang, A., Zhang, I., Bai, H., and Artzi,Y. A Corpus for Reasoning About Natural LanguageGrounded in Photographs. In ACL , 2019. URL http://arxiv.org/abs/1811.00491 .Sun, C., Baradel, F., Murphy, K., and Schmid, C. Con-trastive Bidirectional Transformer for Temporal Represen-tation Learning. 2019a. URL http://arxiv.org/abs/1906.05743 .Sun, C., Myers, A., Vondrick, C., Murphy, K., and Schmid,C. VideoBERT: A Joint Model for Video and LanguageRepresentation Learning. In ICCV , 2019b. URL http://arxiv.org/abs/1904.01766 . nifying Vision-and-Language Tasks via Text Generation Tan, H. and Bansal, M. LXMERT: Learning Cross-Modality Encoder Representations from Transformers.In EMNLP , 2019. URL http://arxiv.org/abs/1908.07490 .Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J.,Jones, L., Gomez, A. N., Kaiser, L., and Polo-sukhin, I. Attention Is All You Need. In NIPS ,2017. URL https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf .Vedantam, R., Zitnick, C. L., and Parikh, D. CIDEr:Consensus-based Image Description Evaluation. In CVPR , nov 2015. URL http://arxiv.org/abs/1411.5726 .Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., andBowman, S. R. Glue: A multi-task benchmark and analy-sis platform for natural language understanding. In ICLR ,2018.Williams, A., Nangia, N., and Bowman, S. R. A broad-coverage challenge corpus for sentence understandingthrough inference. In NAACL , 2017.Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue,C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz,M., and Brew, J. HuggingFace’s Transformers: State-of-the-art Natural Language Processing. 2019. URL http://arxiv.org/abs/1910.03771 .Xia, Q., Huang, H., Duan, N., Zhang, D., and Ji, L. XGPT :Cross-modal Generative Pre-Training for Image Caption-ing. 2020. URL https://arxiv.org/abs/2003.01473 .Xu, J., Mei, T., Yao, T., and Rui, Y. Msr-vtt: A large videodescription dataset for bridging video and language. In CVPR , 2016.Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov,R. R., and Le, Q. V. Xlnet: Generalized autoregres-sive pretraining for language understanding. In NeurIPS ,2019.Yao, S. and Wan, X. Multimodal Transformer for Mul-timodal Machine Translation. In ACL , pp. 4346–4350,2020. doi: 10.18653/v1/2020.acl-main.400.Young, P., Lai, A., Hodosh, M., and Hockenmaier,J. From Image Descriptions to Visual Denotations:New Similarity Metrics for Semantic Inferenceover Event Descriptions. TACL , 2(April):67–78,2014. ISSN 2307-387X. URL http://nlp.cs.illinois.edu/HockenmaierGroup/Papers/DenotationGraph.pdf . Yu, L., Poirson, P., Yang, S., Berg, A. C., and Berg, T. L.Modeling context in referring expressions. In ECCV ,2016.Yu, L., Lin, Z., Shen, X., Yang, J., Lu, X., Bansal, M., andBerg, T. L. MAttNet : Modular Attention Network forReferring Expression Comprehension. In CVPR , 2018a.URL https://arxiv.org/abs/1801.08186 .Yu, Y., Kim, J., and Kim, G. A joint sequence fusion modelfor video question answering and retrieval. In ECCV ,2018b.Zellers, R., Bisk, Y., Schwartz, R., and Choi, Y. Swag:A large-scale adversarial dataset for grounded common-sense inference. In EMNLP , 2018.Zellers, R., Bisk, Y., Farhadi, A., and Choi, Y. From Recog-nition to Cognition: Visual Commonsense Reasoning.In CVPR , 2019. URL http://arxiv.org/abs/1811.10830 .Zhang, P., Li, X., Hu, X., Yang, J., Zhang, L., Wang, L.,Choi, Y., and Gao, I. VinVL: Making Visual Representa-tions Matter in Vision-Language Models. 2021.Zhou, L., Xu, C., and Corso, J. J. Towards automatic learn-ing of procedures from web instructional videos. In AAAI ,2018.Zhou, L., Palangi, H., Zhang, L., Hu, H., Corso, J. J., andGao, J. Unified Vision-Language Pre-Training for ImageCaptioning and VQA. In AAAI , 2020. URL http://arxiv.org/abs/1909.11059 .Zhu, L. and Yang, Y. Actbert: Learning global-local video-text representations. In CVPR , 2020.Zhu, Y., Groth, O., Bernstein, M., and Fei-Fei, L. Visual7W:Grounded Question Answering in Images. In CVPR ,2016. ISBN 978-1-4673-8851-1. doi: 10.1109/CVPR.2016.540. URL http://arxiv.org/abs/1511.03416 . nifying Vision-and-Language Tasks via Text Generation Table 10. Summary of baseline vision-and-language transformers. a Since not all models report exact parameter numbers, we providerough estimates compared to BERT Base (86M; noted as P), where word embedding parameters are excluded. b LXMERT and XGPT arenot initialized from pretrained language models. LXMERT authors found pretraining from scratch was more effective than initializationfrom BERT Base in their experiments. XGPT uses text pretraining on Conceptual captions and COCO captions with Masked LM (Devlinet al., 2019) and Masked Seq2Seq (Song et al., 2019) objectives before V&L pretraining. c LXMERT (text+visual+cross-modal) andViLBERT (cross-modal) use dual-stream encoders. ViLBERT uses 768/1024-dim hidden states for text/visual streams respectively. XGPTuses AoA module (Huang et al., 2019) as visual encoder. Rest of the models use single-stream encoders. d For generation tasks, UnifiedVLP and Oscar use causal mask and reuse encoder as decoder similar to UniLM. e XGPT also uses shared parameters for encoder anddecoder, but its decoder is right-shifted and predicts next tokens. f Unified VLP is initialized from UniLM, which is initialized fromBERT Large . g Oscar uses object tags as additional text inputs. V&L Pretraining HyperparametersDataset a Hidden dim b c 2P 768 36 absoluteViLBERT CC 3M Encoder BERT Base c c ∼ 36 absoluteUNITER Base CC+SBU+COCO+VG 4M Encoder BERT Base 12 P 768 10 ∼ 100 absoluteUnified VLP CC 3M Encoder d UniLM f 12 P 768 100 absoluteOscar Base CC+SBU+COCO+VG+Flickr30K 4M Encoder d BERT Base 12 P 768 50 g absoluteXGPT CC+COCO 3M Enc-Dec e - b c +12+12 P 768 100 absoluteVL-T5 COCO+VG 180K Enc-Dec T5 Base Base Table 11. Pretraining tasks used in our vision-and-language pretraining. The images that have any intersection with evaluation set ofdownstream tasks (e.g., COCO caption, RefCOCOg) and the held-out validation set for pretraining are excluded. Task Image source Text source A. Summary of Vision-and-LanguageTransformers In Table 10, we compare the baseline vision-and-languagetransformers and our VL-T5 and VL-BART in detail. B. Pretraining and Downstream Task Details In Table 11 and Table 12, we show the detailed statisticsof our pretraining and downstream datasets and tasks. InTable 13, we show the hyperparameters that we used in ourpretraining and downstream task experiments. nifying Vision-and-Language Tasks via Text Generation Table 12. Statistics of the datasets used in downstream tasks. The data that are not used for training/validation (e.g., COCO test2015images) and data for leaderboard submissions (e.g., test-dev/test-std for VQA, test for GQA) are excluded. Datasets Image source Web Crawled 238K (206K) 100K (86K) AccuracyRefCOCOg COCO 26K (21K) 95K (80K) AccuracyVCR Movie Clips 110K (80K) 290K (212K) AccuracyCOCO Caption COCO 123K (113K) 616K (566K) BLEU,CIDEr,METEOR,SPICEMulti30K En-De Flickr30K 31K (29K) 31K (29K) BLEU Table 13. Hyperparameters for pretraining and downtream tasks Model Task Learning rate Batch size EpochsVL-T5 Pretraining 1e-4 320 30VCR Pretraining 5e-5 80 20VQA 5e-5 320 20GQA 1e-5 240 20NLVR2 is added in the beginning and usedas the decoder’s initial input token y . Likewise, an end-of-sequence token is appended to the end of decoderoutputs to indicate the completion of generation. The de-coder iteratively attends to previously generated tokens y ,following LXMERT and UNITER.In Table 3, we show the performance of the models on thesetwo subsets. Comparing models with the same backbone us-ing different modeling approaches, we notice the generativemodels improve upon the discriminative baselines acrossall the subsets. This improvement is more significant whenlooking at the out-of-domain subset, where the generativeVL-T5 and VL-BART achieve 6 and 6.2 points improvementover their discriminative counterparts, showing the effec-tiveness of using generative modeling. When compared tothe strong discriminative baseline UNITER Base (pretrainedwith 4M extra images), our generative models still showcomparable overall performance while significantly outper-form it on the out-of-domain subset. The task of NLVR (Suhr et al., 2019) is to determinewhether a natural language statement is true about two givenimages. To apply our model to this task, we concatenateregion features from two images and use different image idembeddings to disambiguate which image the features arefrom. Then our model learns to generate “true” or “false”.This is similar to Triplet (Fig. 4(a)) setting described inUNITER (Chen et al., 2020) The best-scoring answer is the ground-truth answer that hasthe best score according to the VQA scoring system.