[PDF] Unifying Vision-and-Language Tasks via Text Generation

Abstract

Existing methods for vision-and-language learning typically require designing task-specific architectures and objectives for each task. For example, a multi-label answer classifier for visual question answering, a region scorer for referring expression comprehension, and a language decoder for image captioning, etc. To alleviate these hassles, in this work, we propose a unified framework that learns different tasks in a single architecture with the same language modeling objective, i.e., multimodal conditional text generation, where our models learn to generate labels in text based on the visual and textual inputs. On 7 popular vision-and-language benchmarks, including visual question answering, referring expression comprehension, visual commonsense reasoning, most of which have been previously modeled as discriminative tasks, our generative approach (with a single unified architecture) reaches comparable performance to recent task-specific state-of-the-art vision-and-language models. Moreover, our generative approach shows better generalization ability on questions that have rare answers. Also, we show that our framework allows multi-task learning in a single architecture with a single set of parameters, achieving similar performance to separately optimized single-task models. Our code is publicly available at: this https URL

Full PDF

UUnifying Vision-and-Language Tasks via Text Generation

Jaemin Cho Jie Lei Hao Tan Mohit Bansal

UNC Chapel Hill { jmincho,jielei,haotan,mbansal } @cs.unc.edu Abstract

Existing methods for vision-and-language learn-ing typically require designing task-speciﬁc archi-tectures and objectives for each task. For example,a multi-label answer classiﬁer for visual questionanswering, a region scorer for referring expres-sion comprehension, and a language decoder forimage captioning, etc. To alleviate these hassles,in this work, we propose a uniﬁed framework thatlearns different tasks in a single architecture withthe same language modeling objective, i.e., mul-timodal conditional text generation, where ourmodels learn to generate labels in text based onthe visual and textual inputs. On 7 popular vision-and-language benchmarks, including visual ques-tion answering, referring expression comprehen-sion, visual commonsense reasoning, most ofwhich have been previously modeled as discrimi-native tasks, our generative approach (with a sin-gle uniﬁed architecture) reaches comparable per-formance to recent task-speciﬁc state-of-the-artvision-and-language models. Moreover, our gen-erative approach shows better generalization abil-ity on answering questions that have rare answers.In addition, we show that our framework allowsmulti-task learning in a single architecture with asingle set of parameters, which achieves similarperformance to separately optimized single-taskmodels.

1. Introduction

Mirroring the success of the pretraining-ﬁnetuning paradigmwith transformer language models (Devlin et al., 2019),recent vision-and-language transformers (Tan & Bansal(2019); Lu et al. (2019); Chen et al. (2020); Li et al. (2020b), inter alia ) have also been adopted in a wide range of vision-and-language tasks. These models are ﬁrst pretrained onthe large image-text corpus (e.g., COCO Caption (Chenet al., 2015)), then ﬁnetuned on downstream tasks (e.g., vi- Our code will be publicly available at: https://github.com/j-min/VL-T5 “vqa: what is the man jumping over?”“image text match: A cat is lying on a bed”“visual grounding: yellow fire hydrant”“span prediction: A is over fire hydrant” “ man jumping”“fire hydrant”“”“false”

Text Input Text Output

MultimodalLMVisual QAVisualGroundingImage-Text Matching

Figure 1.

Our uniﬁed framework for learning vision-and-languagetasks. While existing methods require designing task-speciﬁc ar-chitectures for different tasks, our framework uniﬁes them togetheras generating text labels conditioned on multimodal inputs. sual question answering (Goyal et al., 2019) and referringexpression comprehension (Mao et al., 2016)), which outper-formed many previous non-pretraining-ﬁnetuning methods.For each pretraining or downstream task, existing vision-and-language transformers typically require designing task-speciﬁc, separately-parameterized architectures on top ofthe transformer encoder (e.g., multi-label sigmoid classiﬁerfor visual question answering, and softmax classiﬁer for re-ferring expression comprehension). However, the reasoningskills required by these tasks overlap signiﬁcantly. Considerthe example in Fig. 1, answering the question “What is theman jumping over?” and grounding an image region corre-sponding to the referring phrase “yellow ﬁre hydrant”. Bothrequire models to recognize the object “ﬁre hydrant”.In addition, the labels for these tasks can be easily ex-pressed in text. For instance, we can assign a region id(e.g., “ ”, a special text token) to a speciﬁc region a r X i v : . [ c s . C L ] F e b nifying Vision-and-Language Tasks via Text Generation in the image, and then the referring expression comprehen-sion task can be expressed as generating the correct regionid. For visual question answering, the labels are already intext, although existing approaches (Anderson et al., 2018;Tan & Bansal, 2019; Chen et al., 2020) tackle the task aslearning a multi-label classiﬁer over a ﬁxed set of frequentanswers (See Fig. 3).Hence, in order to alleviate these hassles of designing task-speciﬁc architectures, we propose a uniﬁed framework forvision-and-language learning via generating labels in text .Speciﬁcally, we extend off-the-shelf pretrained languagemodels T5 (Raffel et al., 2019) and BART (Lewis et al.,2020) with visual understanding ability, named ‘VL-T5’and ‘VL-BART’. In contrast to existing vision-and-languagetransformers which train different architectures for differentpretraining and downstream tasks, our models tackle all thetasks with the same language modeling head. To learn anew task, we can simply rewrite its input and output in text ,without the need of adding extra parameters or designingnew architectures and objectives. This enables our models toadapt to different tasks easily. In addition, we can leveragethe text generation ability of pretrained language modelswhen making predictions . This is especially helpful whenwe answer open-ended questions that require non-trivialanswers, where discriminative methods can only answerfrom a predeﬁned set of frequent candidates, while ourmodels can generate open-ended natural language answers.To evaluate the effectiveness of our generative modelingapproach, we compare our models against recent vision-and-language transformers on a diverse set of 7 down-stream benchmarks, including visual question answeringon VQA (Goyal et al., 2019) and GQA (Hudson & Man-ning, 2019), referring expression comprehension on Ref-COCOg (Mao et al., 2016), natural language visual rea-soning on NLVR (Suhr et al., 2019), visual commonsensereasoning on VCR (Zellers et al., 2019), image captioningon COCO Caption (Chen et al., 2015), and multimodal ma-chine translation on Multi30K (Elliott et al., 2016). Ouruniﬁed generative method reaches comparable performanceto recent state-of-the-art vision-and-language pretrainingmethods. This is especially interesting because we use thesame uniﬁed language modeling architecture with the samemaximum likelihood estimation (MLE) objective for allthe tasks, while existing approaches use heavily-engineeredtask-speciﬁc architectures and objective functions. In addi-tion, we found that our generative models have better gener-alization ability compared to the discriminative versions onthe rare-answer scenario in visual question answering, whenground truth answers for given questions are rarely seen dur-ing training. Finally, we also experiment with our uniﬁedframework which allows for multi-task learning via a singleset of parameters. Our multi-task model is jointly ﬁnetunedwith visual question answering and referring comprehen- sion expression tasks and achieves similar performance tosingle-task models ﬁnetuned with separate parameters.

2. Related Works

Vision-and-Language pretraining

Large-scale lan-guage pretraining with transformers (Vaswani et al., 2017;Devlin et al., 2019; Liu et al., 2019; Lan et al., 2020; Clarket al., 2020; Yang et al., 2019; Raffel et al., 2019) haveachieved remarkable success for a spectrum of naturallanguage understanding tasks (Rajpurkar et al., 2016;Zellers et al., 2018; Wang et al., 2018; Williams et al.,2017). Following this success, in the vision-and-languagedomain, image+text pretraining models (Lu et al., 2019;Tan & Bansal, 2019; Chen et al., 2020; Huang et al., 2020;Li et al., 2020b; Cho et al., 2020; Radford et al., 2021)and video+text pretraining models (Sun et al., 2019b;a; Liet al., 2020a; Zhu & Yang, 2020; Miech et al., 2020) havealso shown to perform better than previous approaches (Yuet al., 2018a; Anderson et al., 2018; Kim et al., 2018; Yuet al., 2018b) without such pretraining, in a wide rangeof discriminative tasks (Goyal et al., 2019; Hudson &Manning, 2019; Lei et al., 2018; Mao et al., 2016; Xu et al.,2016; Zhou et al., 2018) and generative tasks (Chen et al.,2015; Xu et al., 2016; Zhou et al., 2018). In this work, wefocus on image+text tasks.Existing image+text models encode an image as a set ofbounding box region features (Lu et al., 2019; Tan & Bansal,2019; Chen et al., 2020) or grid features (Huang et al., 2020;Cho et al., 2020), analogous to text embeddings. Someworks study using better image encoder (Zhang et al., 2021),or using objects tags as additional text input (Li et al.,2020b). These improvements on stronger visual and textinput representations are orthogonal to ours. We expectthat our models can beneﬁt from using these stronger inputrepresentations.Common pretraining objectives include ( i ) multimodalmasked language modeling (Lu et al., 2019; Tan & Bansal,2019; Chen et al., 2020; Huang et al., 2020; Li et al., 2020b):predict masked words conditioned on the input image andneighboring text context; and ( ii ) image text matching (Luet al., 2019; Tan & Bansal, 2019; Chen et al., 2020; Huanget al., 2020): predict whether an input sentence matcheswith the input image. Besides these two objectives, wealso use visual question answering, visual grounding, andgrounded captioning as additional tasks for pretraining.For each pretraining and downstream task, existing ap-proaches typically train separately-parameterized task-speciﬁc architecture along with the transformer backbone.Though these pretrained models use shared encoders acrossmultiple tasks, the output layers for the downstream tasks,e.g., visual question answering (Goyal et al., 2019; Hudson nifying Vision-and-Language Tasks via Text Generation AutoregressiveText DecoderBidirectionalMultimodal Encoder

RoIfeaturesBox coordinatesImage idsRegion idsPrefix +++ + +1 ++1 ++1 visual grounding : fire hydrant (a) Our vision-and-language framework (b) Visual embedding

Visual embedding

Figure 2.

An illustration of our VL-T5 and VL-BART architectures for visual grounding task. Instead of task-speciﬁc architectures, ourmodels use text preﬁxes to adapt to different tasks. The green block in (a) refers to visual embeddings. (b) shows the components of visualembedding. Note that we reuse the text embeddings of visual sentinel tokens (ex. ) as region id embeddings, which allows ourmodels to tackle many discriminative vision-language tasks as text generation, including visual grounding. & Manning, 2019; Lei et al., 2018), referring expressioncomprehension (Kazemzadeh et al., 2014; Yu et al., 2016;Mao et al., 2016), and image captioning (Chen et al., 2015),are signiﬁcantly different. For example, UNITER (Chenet al., 2020) uses a multi-label sigmoid classiﬁer to regresssoft answer scores for visual question answering (Goyalet al., 2019), while using softmax classiﬁer head on ob-ject representations for referring expression comprehen-sion (Mao et al., 2016). In contrast, our method casts allpretraining and downstream tasks as text generation andalways uses the same uniﬁed architecture, i.e., languagemodeling head, alleviating the hassle of manually designingtask-speciﬁc architectures.

Uniﬁed frameworks

One line of work focus on solvingnatural language processing tasks in a uniﬁed format, asquestion answering (Mccann et al., 2018), span prediction(Keskar et al., 2019), or text generation (Raffel et al., 2019;Brown et al., 2020; Khashabi et al., 2020). These uniﬁedframeworks provide efﬁcient knowledge sharing among dif-ferent tasks and make it easy to leverage pretrained languagemodels. In relation to these works, we propose to unify pre-viously separately modeled vision-and-language tasks in asingle uniﬁed format, via text generation, conditioned onmultimodal inputs from the image and the textual context.

3. Model

We propose a new learning method that uniﬁes vision-and-language problems as multimodal conditional text gener-ation. We introduce VL-T5 and VL-BART based on twopretrained sequence-to-sequence transformer language mod-els: T5

Base (Raffel et al., 2019) and BART

Base (Lewiset al., 2020). Speciﬁcally, we extend their text encodersto multimodal encoders by incorporating image region em-beddings as additional input. The overall architecture ofour framework is shown in Fig. 2. Since the architecturedifferences between VL-T5 and VL-BART are minor, we will use VL-T5 as an example to illustrate our framework indetails in the rest of this section.

We represent an input image v with n object regions fromobject detector. Following previous works, we use the FasterR-CNN (Ren et al., 2015) trained on Visual Genome (Kr-ishna et al., 2016) for object and attribute classiﬁcation, pro-vided by Anderson et al. (2018). Following Tan & Bansal(2019), we use n =36 object regions per image.As shown in Fig. 2 ( b ), each image region is encoded as asum of four types of features: ( i ) RoI (Region of Interest)object features; ( ii ) RoI bounding box coordinates; ( iii )image ids ∈ { , } ; and ( iv ) region ids ∈ { , . . . , n } . RoIfeatures and bounding box coordinates are encoded witha linear layer, while image ids and region ids are encodedwith learned embeddings (Devlin et al., 2019). Image ids areused to discriminate regions from different images, and takeeffect only when multiple images are given to the model(e.g., in NLVR (Suhr et al., 2019), models take two in-put images). The ﬁnal visual embeddings are denoted as e v = { e v , . . . , e vn } . These embeddings have the same dimen-sion as the text embeddings that we will discuss next. Instead of designing task-speciﬁc architectures, we add dif-ferent preﬁxes to the original input text to adapt to differenttasks, as shown in Table. 1 ( top ). We show the preﬁxes fordifferent tasks in Table 1. Input text x is tokenized as { x , . . . , x | x | } and encoded aslearned embedding e x = { e x , . . . , e x | x | } . The embedding Note that since we use simple preﬁxes (e.g., “ vqa: ” for VQAtask), it is likely that engineering in text prompts (Gao et al., 2020)would improve the accuracy of our methods. As this is not thefocus of this paper, we leave it as future works. nifying Vision-and-Language Tasks via Text Generation parameters are shared by the encoder, decoder, and languagemodeling head (Press & Wolf, 2017). Since the attentionlayers are permutation-invariant, BART learns positionalembedding (Vaswani et al., 2017; Devlin et al., 2019) foreach absolute text position and adds them to the text tokenembeddings. In contrast, T5 adds relative position bias toeach self-attention layer (Shaw et al., 2018). Our modelsfollow the positional embedding conﬁgurations of their textbackbone models. At the same time, we use boundingbox coordinates to provide position information for visualembeddings, similar to absolute position embeddings fortext.In addition to the original vocabulary of T5 and BART,we introduce visual sentinel tokens { , . . . , } , which corresponds to image regions. As il-lustrated in Fig. 2, we use the text embeddings of visualsentinel tokens as region id embeddings in Sec. 3.1. Theembedding sharing enables our model to build the corre-spondence among query text, label text, and objects, whichare useful in the grounding tasks (e.g., visual grounding andgrounded captioning pretraining tasks in Sec. 4, referringexpression comprehension in Sec. 5.3). We use transformer encoder-decoder architecture (Vaswaniet al., 2017) to encode visual and text inputs and generatelabel text. Our bidirectional multimodal encoder is a stackof m transformer blocks. Each transformer block consistsof a self-attention layer and a fully-conencted layer withadditional residual connections. As shown in Fig. 2 ( a ),the encoder takes the concatenation of text embeddings andvisual embeddings as input and outputs their contextualizedjoint representations h = { h x , . . . , h x | x | , h v , . . . , h vn } =Enc( e x , e v ) . As discussed in Sec. 3.2, T5 model takesrelative positional embeddings by adding relative positionbias. Since there is no natural order between image regions,we set relative position bias over visual inputs as zeros forVL-T5. At the same time, the features of RoI bounding boxcoordinates in the visual embeddings serve as positionalembedding for the visual object regions.Our decoder is another stack of m transformer blocks sim-ilar to the multimodal encoder. However, each block hasan additional cross attention layer. For autoregressive gen-eration, the decoder’s inputs are shifted right, and a start-of-sequence token ~~is added in the beginning and usedas the decoder’s initial input token y . Likewise, an end-of-sequence token~~ is appended to the end of decoderoutputs to indicate the completion of generation. The de-coder iteratively attends to previously generated tokens y

Region scoringheadVQAhead

Existing methods: N heads for N tasksOurs: LM head for all tasks [CLS] What is the man jumping over? [CLS] fire hydrantTop-K answer scores

Sigmoid

Multi-labelClassification

Softmax “fire hydrant”

Classification

VL TransformerVL Transformer vqa:

What is the man jumping over? visual grounding: fire hydrant

VL Transformer “fire hydrant” “”

LanguageModeling (a) (b)(c) (d)

Figure 3.

Comparison between existing vision-language trans-formers and our framework on visual question answering and re-ferring expression comprehension (visual grounding) tasks. Whileexisting methods use task-speciﬁc architectures and objectives,our models use language modeling head and maximum likelihoodestimation on label text for all tasks. (Sec. 4) and downstream tasks (Sec. 5), we train our modelparameters θ by minimizing the negative log-likelihood oflabel text y tokens given input text x and image v (Eq. 1). L GEN θ = − | y | (cid:88) j =1 log P θ ( y j | y

In this subsection, we compare our uniﬁed framework withexisting vision-and-language transformers on two populartasks: visual question answering (Goyal et al., 2019) andreferring expression comprehension (Mao et al., 2016). Weillustrate this comparison in Fig. 3.Visual question answering task requires a model to an-swer a question to a given context image. As shownin Fig.3 ( a ), existing methods (Tan & Bansal, 2019; Luet al., 2019; Chen et al., 2020) typically formulate thistask as a discriminative task, i.e., multi-label classiﬁca-tion over a predeﬁned set of K frequent answer candidates { a , . . . , a K } . Speciﬁcally, they introduce a multi-layerperceptron (MLP) sigmoid scorer head on top of h x [CLS] tolearn the likelihood of each answer candidate being correct: P VQA θ ( correct | a, x, v ) = sigmoid ( MLP

VQA ( h x [CLS] )) . ThisVQA scorer head is trained end-to-end with the transformerencoder through a binary cross-entropy loss, by using VQA nifying Vision-and-Language Tasks via Text Generation Table 1.

Input-output formats for pretraining (Sec. 4) and downstream tasks (Sec. 5). a We use different preﬁxes (“vqa:”, “gqa:”,“visual7w:”) for questions from different datasets. b NLVR takes two images as visual input, for brevity, we only show one here. Tasks Input image Input text Target text

Pretraning tasks (Sec. 4)

Multimodal LM (VL-T5) span prediction: A is over a ﬁre hydrant. man jumpingMultimodal LM (VL-BART) denoise: A is over a ﬁre hydrant. A man is jumping over a ﬁre hydrant a Visual question answering vqa: what is the color of the man’s shirt? blueImage-text matching image text match: A man with blue shirt is jumping over ﬁre hydrant. trueVisual grounding visual grounding: yellow ﬁre hydrant

Grounded captioning caption region: yellow ﬁre hydrant

Downstream tasks (Sec. 5)

VQA vqa: [Q] [A]

GQA gqa: [Q] [A] b NLVR nlvr: [text] true/falseVCR Q → A vcr qa: question [Q] answer: [A] true/falseVCR Q → AR vcr qar: question [Q] answer: [A] rationale: [R] true/falseRefCOCOg visual grounding: [referring expression] [region id]

COCO captioning caption: [caption]

COCO captioning (w/ object tags) caption with tags: [Tag1 Tag2 ..] [caption]

Multi30K En-De translation translate English to German: [English text] [German text] score (Goyal et al., 2019) as soft target distribution (Eq. 2). L VQA θ = − K (cid:88) k =1 score ( a k , x, v ) log P VQA θ ( correct | a k , x, v ) (2)For referring expression comprehension, it requires modelsto localize a target region in an image that is describedby a given referring expression. Previous methods tacklethis task as multi-class (Chen et al., 2020) or binary (Luet al., 2019) classiﬁcation over image regions. For example,UNITER (Chen et al., 2020) introduces a region scoringhead (an MLP layer) on top of the output representations ofregions, as shown in Fig. 3( b ). This region scoring head isjointly trained with the encoder by minimizing the negativelog-likelihood of the target region r ∗ : L REF θ = − log P REF θ ( r ∗ | x, v ) (3)In contrast to existing methods that develop task-speciﬁcarchitectures and objectives (e.g., Eq. 2, 3), our uniﬁedframework is free from extra model designs for new tasks.As shown in Fig. 3 ( c,d ) and Table 1, we formulate thetask labels to corresponding text, and we learn these differ-ent tasks by predicting label text with the same languagemodeling objective (Eq. 1).

4. Pretraining

In this section, we describe how we pretrain our VL-T5and VL-BART models (Sec. 3). We start with the detailsof the pretraining data and illustrate how we formulate di-verse vision-and-language pretraining tasks as multimodalconditional text generation. score ( a, x, v )= min( ( a as the answer) ∗ . , We aggregate pretraining data from MS COCO (Lin et al.,2014; Chen et al., 2015) and Visual Genome (VG; Krishnaet al. (2016)) images . The captioning data from thesetwo datasets are used in the multimodal language modelingtask. The COCO captions are also used in the image-textmatching task to learn cross-modal alignment. Besidesthe captions, we also use three visual question answeringdatasets (VQA v2.0 Goyal et al. (2019), GQA balancedversion (Hudson & Manning, 2019), and Visual7W (Zhuet al., 2016)) as in Tan & Bansal (2019), but only usedthem in the visual question answering task. Details of thesepretraining tasks are in Sec. 4.2.Overall, our pretraining dataset contains 9.18M image-textpair on 180K distinct images. We carefully split our pre-training data to avoid any intersection between our trainingdata and the evaluation set of downstream tasks (e.g., COCOCaptioning, RefCOCOg). In this process, around 10K im-ages are excluded from the training sets of COCO and VG.We then take the COCO Karpathy val split (Karpathy &Fei-Fei, 2015) with 5,000 images as our validation set tomonitor pretraining performance.

We pretrain our models under a multi-task setup with diversepretraining tasks, including multimodal language model-ing, visual question answering, image-text matching, visualgrounding, and grounded captioning. Table 1 shows inputand output examples of our pretraining tasks. The trainingdata for each of these tasks are summarized in Table 11. Inthe rest of this section, we explain these tasks in detail. Existing vision-and-language transformers are trained withdifferent datasets and computational budget, thus their results maynot be directly comparable to each other. We show the number oftheir pretraining images in Table 2. nifying Vision-and-Language Tasks via Text Generation

Multimodal language modeling

We follow Raffel et al.(2019) and Lewis et al. (2020) to construct the languagemodeling pretraining task. The basic idea is to recover themasked input text based on both visual and textual context(while original methods are only based on textual context).For VL-T5, we mask 15% of input text tokens and replacecontiguous text span with sentinel tokens (e.g., ).Then we let the model predict the masked text spans. ForVL-BART, we mask 30% of input text tokens with tokens, and let the model reconstruct the entire original text.See Table 1 for examples.

Visual question answering

Similar to Tan & Bansal(2019), we include visual question answering in our pre-training tasks. The task requires models to answer a ques-tion to a given context image. While previous methods (Tan& Bansal, 2019; Lu et al., 2019; Chen et al., 2020) tacklethe task as classiﬁcation over predeﬁned answer candidates(illustrate in Fig. 3), we directly generate answers in theiroriginal text format.

Image-text matching

In this task, the model needs toverify whether an input text corresponds to the given inputimage. We consider the image and its captions as positivepairs. With a probability of 50%, we create a negativepair by randomly sampling another image from trainingset and taking its caption. The model then predicts thecorrespondence between the input image and text with “true”or “false” as shown in Table 1. Visual grounding

Besides the above image-text match-ing task, we also develop an object-text matching task toendow the model with grounding ability, which is requiredin several tasks (e.g., referring expression comprehensionand VCR). Previous vision-and-language transformers (Tan& Bansal, 2019; Lu et al., 2019; Chen et al., 2020) predictthe property of masked objects to indirectly learn object-textalignment. To explicitly learn this important grounding abil-ity, we give the model a region description and let it predictthe id of the related object region. With the help of the visualsentinel token (e.g., in Table 1), this task ﬁts natu-rally into our text generation objective. We make the regiondescriptions from the predictions of the object detector thatwe use for visual embeddings (see Sec. 3.1). Concretely, wesample an object region out of n region predictions. Thenwe concatenate its object name and attribute in their originaltext format (e.g., attribute: “yellow” + object: “ﬁre hydrant” → “yellow ﬁre hydrant”). This approach does not need We only use captions from COCO for this task, since manyshort captions from VG and visual questions are nondistinctivedescription of an image (e.g., ‘what is in the image?’). https://github.com/peteanderson80/bottom-up-attention/blob/master/data/genome/1600-400-20 extra annotation and could be extended to images withoutdense annotations (e.g., COCO images). Grounded captioning

To teach the model with object-level information, we also use an inverse task of the afore-mentioned visual grounding, called grounded captioning.As shown in Table 1, given a visual sentinel token (whichindicates a region in the image) as text input, the model isasked to generate a corresponding textual description of thisinput region. For both VL-T5 and VL-BART, it takes 4 days for 30-epochpretraining with mixed precision training (Narang et al.,2018) on 4 RTX 2080 Ti GPUs (4 x 11GB). We use batchsize 320 and 600 for VL-T5 and VL-BART, respectively.We use AdamW optimizer (Loshchilov & Hutter, 2019) with ( β , β ) = (0 . , . and learning rate 1e-4 with 5% lin-ear warmup schedule. We use the VQA validation score totrack the progress of pretraining. Our code is based on Py-Torch (Paszke et al., 2017) and Huggingface Transformers(Wolf et al., 2019).

5. Downstream Tasks and Results

In this section, we evaluate our generative architecturesVL-T5 and VL-BART on a diverse set of 7 downstreamtasks, including two image question answering tasks (Goyalet al., 2019; Hudson & Manning, 2019), referring expres-sion comprehension (Mao et al., 2016), natural languagevisual reasoning (Suhr et al., 2019), visual commonsensereasoning (Zellers et al., 2019), image captioning (Chenet al., 2015), and multimodal machine translation (Elliottet al., 2016). We summarize the statistics of the datasetsused in downstream tasks in Table 12. We compare ourmodels with strong vision-and-language pretrained trans-formers: LXMERT (Tan & Bansal, 2019), ViLBERT (Luet al., 2019), UNITER (Chen et al., 2020), Uniﬁed VLP(Zhou et al., 2020), Oscar (Li et al., 2020b), and XGPT (Xiaet al., 2020).As summarized in Table 2, our models achieve similar re-sults to most of the baselines. We highlight that our uniﬁedgenerative modeling approach (with the input-output for-mat shown in Table 1) is close to the performance of theheavily developed task-speciﬁc discriminative models. Notethat different vision-and-language transformers are trainedwith different setups (e.g., pretraining data, objectives, fea-ture extractor, hyperparameters, computational budget), thusthe results might not be directly comparable. For exam- Our grounded captioning task can be seen as a simpliﬁeddense captioning (Johnson et al., 2016) task, where only one objectis asked to describe at a time. nifying Vision-and-Language Tasks via Text Generation

Table 2.

Single model performance on downstream tasks. Note that the baseline models adopt task-speciﬁc objectives and architectures,whereas our models tackle all tasks, including discriminative tasks (e.g., RefCOCOg), as text generation with a single architecture andobjective. (cid:63)

See our discussion in Sec.5.3. † Submitted to the leaderboard (the result will be updated).

Method RefCOCOg VCR Q → AR COCO Cap Multi30K En-Detest-std test-std test-P test d test Karpathy test test 2018Acc Acc Acc Acc Acc CIDEr BLEULXMERT 180K 72.5 60.3 74.5 - - - -ViLBERT 3M 70.9 - - - 54.8 - -UNITER Base

4M 72.9 - 77.9 74.5 58.2 -Uniﬁed VLP 3M 70.7 - - - - 117.7 -Oscar

Base

4M 73.4 61.6 78.4 - - 123.7 -XGPT 3M - - - - - 120.1 -MeMAD - - - - - - - 38.5VL-T5 180K 70.3 60.8 73.6 71.3 58.9 116.5 38.6VL-BART 180K 71.3 60.5 70.3 22.4 (cid:63) - † ple, UNITER and Oscar use around 4M extra images fromSBU captions (Ordonez et al., 2011) and Conceptual Cap-tions (Sharma et al., 2018) for pretraining. The closestbaseline to our models is LXMERT as both are pretrainedon the same datasets and use the same visual features. SeeTable 10 in the appendix for a detailed comparison betweenbaselines and our models. We tune the hyperparametersbased on the validation set of each downstream task. SeeTable 13 for details. In the rest of this section, we’ll providea detailed comparison w.r.t. our models and the baselines,as well as elaborating the details of the evaluated tasks. The visual question answering task requires models to an-swer a question to a given context image. In this work, weevaluate our models on VQA (Goyal et al., 2019) and GQA(Hudson & Manning, 2019) datasets. Each question in VQAand GQA typically have multiple answers, at each trainingstep, we randomly sample one answer from the ground-truthanswer set and use it as the text generation target.Table 2 compares our models VL-T5 and VL-BART withexisting methods on visual question answering tasks VQAand GQA. For both tasks, our models achieve comparableperformance to existing approaches. Note that in additionto the Visual Genome and COCO Captions data that weuse, UNITER and Oscar also use around 4M extra imagesfrom SBU captions (Ordonez et al., 2011), Conceptual Cap-tions (Sharma et al., 2018), and Flicker30k (Young et al.,2014) (Oscar only) for pretraining. Chen et al. (2020) haveshown that adding these extra data during pretraining im-proves model performance across various downstream tasks.

Table 3.

VQA Karpathy-test split accuracy using generative anddiscriminative methods. We break down the questions into twosubsets in terms of whether the best-scoring answer a ∗ for eachquestion is included in the top-K answer candidates A topk . In-domain : a ∗ ∈ A topk , Out-of-domain : a ∗ / ∈ A topk . Method VQA Karpathy-test Acc.In-domain Out-of-domain Overall

Discriminative

UNITER

Base

VL-T5 70.2 7.1 66.4VL-BART 69.4 7.0 65.7

Generative

VL-T5 71.4 13.1 67.9VL-BART 72.1

Generative vs.

Discriminative model

Modern ap-proaches (Tan & Bansal, 2019; Lu et al., 2019; Chen et al.,2020; Zhou et al., 2020; Li et al., 2020b) are discriminativemodels, where they tackle visual question answering tasksas multi-label classiﬁcation over a predeﬁned set of answercandidates. For example, LXMERT and UNITER train atwo-layer MLP classiﬁer with sigmoid activation and usesoft target scores on 3,129 answers that appear 9 or moretimes in the VQA train2014 split (See Sec. 3.4 and Fig. 3).While this strategy has achieved strong performance, it maynot generalize to a real-world scenario where answers maynot exist in this ﬁxed answer set. In contrast, our models VL-T5 and VL-BART directly generate answers as free-formtext, allowing a truly open-ended setup.To quantitatively compare existing discriminative ap- nifying Vision-and-Language Tasks via Text Generation

Model

Image 1 TextTrue / False

Model Model

Multi-head Attention(a) Triplet (b) Pair (c) Pair-biattn

Image 2 Image 1 TextImage 2Text

Model Model

Image 1 TextImage 2TextTrue / False True / False

Figure 4.

Different encoding settings for NLVR . Pair and

Pair-biattn approximately double the computational cost over

Triplet which our models are based on. proaches and our generative approaches, we evaluate theirperformance on questions with rare answers, i.e., out-of-domain answers (for discriminative approaches). We breakdown VQA questions in Karpathy-test split, in terms ofwhether the best-scoring answer a ∗ for each question is in-cluded in the top-K ( K = 3 , ) answer candidates A topk .The questions with a ∗ / ∈ A topk can be treated as out-of-domain questions since their best-scoring answers are rareanswers and have been excluded in standard discriminativeapproaches. After this split, the in-domain subset contains24,722 questions, and the out-of-domain subset contains1,558 questions. For discriminative baselines, we intro-duce a two-layer MLP classiﬁer with sigmoid on top ofthe decoder representation of start-of-sequence token ,following LXMERT and UNITER.In Table 3, we show the performance of the models on thesetwo subsets. Comparing models with the same backbone us-ing different modeling approaches, we notice the generativemodels improve upon the discriminative baselines acrossall the subsets. This improvement is more signiﬁcant whenlooking at the out-of-domain subset, where the generativeVL-T5 and VL-BART achieve 6 and 6.2 points improvementover their discriminative counterparts, showing the effec-tiveness of using generative modeling. When compared tothe strong discriminative baseline UNITER Base (pretrainedwith 4M extra images), our generative models still showcomparable overall performance while signiﬁcantly outper-form it on the out-of-domain subset. The task of NLVR (Suhr et al., 2019) is to determinewhether a natural language statement is true about two givenimages. To apply our model to this task, we concatenateregion features from two images and use different image idembeddings to disambiguate which image the features arefrom. Then our model learns to generate “true” or “false”.This is similar to Triplet (Fig. 4(a)) setting described inUNITER (Chen et al., 2020) The best-scoring answer is the ground-truth answer that hasthe best score according to the VQA scoring system.

Table 4.

NLVR performance comparison under different encod-ing settings. Note that Triplet takes lower computational cost than

Pair and

Pair-biattn (See also Fig. 4).

Method Setting dev test-PUNITER

Base

Triplet 73.0 73.9UNITER

Base

Pair 75.9 75.8UNITER

Base

Pair-biattn

LXMERT Pair 74.9 74.5Oscar

Base

Pair 78.1 78.4VL-T5 Triplet 74.6 73.6VL-BART Triplet 71.7 70.3

Table 5.

Referring expression comprehension performance com-parison on RefCOCOg.

Method V&L PT val d test d MattNet 66.9 67.3UNITER

Base (cid:88)

VL-T5 63.4 62.9VL-T5 (cid:88) (cid:88) under different encoding settings (See Fig. 4): (a) Triplet : joint encoding of image pairs and text; (b)

Pair : theconcatenation of individual embedding of each image-textpair; (c)

Pair-biattn : bidirectional attention added to

Pair .UNITER shows that one can improve performance with amore complex encoding setting, i.e.,

Pair-biattn achievesbetter performance than

Pair , which is again better thanthe simplest

Triplet . Note that both the

Pair and the

Pair-biattn settings approximately double the computational costcompared to that of the

Triplet setting. While there’s the gapbetween our models and baselines in

Pair and

Pair-biattn setting, VL-T5 shows comparable performance to UNITERin

Triplet setting.

Referring expression comprehension is a visual groundingtask, where given a natural language referring expression(e.g., ‘the car on the left’) describing an object in an image,a model needs to correctly localize the object in this image(when object candidates are given, the task is reduced tochoose an object from a set of candidates). In this work,we evaluate models on the RefCOCOg (Mao et al., 2016)dataset. Similar to the visual grounding pretraining task inSec. 4, we give our model a referring phrase and candidate nifying Vision-and-Language Tasks via Text Generation

Table 6.

VCR accuracy.

Stage 1 refers to the original vision-and-language generic-domain pretraining and

Stage 2 refers to the in-domainpretraining on VCR.

Method V&L PT VCR val VCR testStage 1 Stage 2 Q → A QA → R Q → AR Q → A QA → R Q → ARViLBERT 69.3 71.0 49.5 - - -ViLBERT (cid:88)

Base

Base (cid:88)

Base (cid:88) (cid:88) (cid:88) (cid:88) (cid:88)

VL-BART 65.4 68.1 44.6 - - -VL-BART (cid:88) (cid:88) (cid:88) ) of the region corre-sponding to the phrase. Following previous works UNITERand MAttNet (Yu et al., 2018a), we use detected regionsfrom Mask R-CNN (He et al., 2017) as candidates andmark a selected region to be correct if its intersection overunion (IoU) with the ground truth region is greater than 0.5.Table 5 compares our generative models with discriminativebaselines. With pretraining, VL-T5 signiﬁcantly outper-forms the strong modular model MAttNet, and achieves areasonable performance compared to the UNITER modelthat has been pretrained on a much larger corpus. Whileour method did not achieve state-of-the-art performance,these results suggest that referring expression comprehen-sion/visual grounding can be effectively formulated as atext-generation task, rather than previously (Yu et al., 2018a;Chen et al., 2020) formulated classiﬁcation task over a setof visual regions, allowing more ﬂexible architecture design.We hope our work would inspire future works in this direc-tion. We also observe that our experiments with VL-BARTon RefCOCOg diverges. One reason might be the differencein positional encoding methods of T5 and BART. Duringtraining, BART adds learned absolute positional embeddingto text token embedding, whereas T5 uses relative positionbiases in self-attention layers instead. We hypothesize thatVL-BART found strong correspondence by memorizing thepositions of each training object (we observe high trainingaccuracy, but low validation accuracy). We are activelyinvestigating this interesting phenomenon and looking forpotential solutions. https://github.com/lichengunc/MAttNet Visual Commonsense Reasoning (VCR) (Zellers et al.,2019) is a multiple-choice question answering task thatrequires commonsense reasoning beyond object or actionrecognition. Each VCR question (Q) has 4 answers (A)and 4 rationales (R), it can be decomposed into two mul-tiple choices sub-tasks: question answering (Q → A), andanswer justiﬁcation (QA → R). The overall task (Q → AR)requires a model to not only select the correct answer tothe question, but also the correct rationale for choosing theanswer. Similar to Nogueira et al. (2020) that leverageslanguage model for document ranking, we concatenate con-text (image+question) with each candidate choice, and letour models generate “true” for the correct choice and gen-erate “false” otherwise. As shown in Table 1, for Q → A,we use “vcr qa: question: [Q] answer: [A] ” as text input.For QA → R, we use “vcr qar: question: [Q] answer: [A] rationale: [R] ” as text input. During inference, we use P ( true ) P ( true )+ P ( false ) to rank the choices and select the one withthe highest score as prediction. UNITER (Chen et al., 2020)has shown that a second-stage in-domain pretraining (withthe same pretraining objectives as generic-domain pretrain-ing) on the VCR dataset would signiﬁcant improve VCRtask performance. This is likely due to the domain differencebetween VCR and the generic-domain pretraining corpus(e.g., COCO Captions), e.g., the input text (concatenationof multiple sentences: [Q]+[A]+[R] ) in VCR is muchlonger than in generic-domain pretraining. We thus alsoexperimented with a second stage pretraining on VCR.The results are shown in Table 6. On the VCR test split,We notice that our best model VL-T5 achieves a compara-ble (slightly better) performance to UNITER, while signif-icantly higher performance when compared to ViLBERT. nifying Vision-and-Language Tasks via Text Generation Table 7.

COCO captioning scores on Karparthy-test split. Allmodels are trained with cross-entropy loss. PT and FT refer to theuse of object tags during pretraining and ﬁnetuning, respectively.

Method V&L PT Object tags COCO CaptioningB C M SOscar (cid:88)

PT+FT

VL-T5 (cid:88)

FT 34.5 116.5 28.7 21.9VL-BART (cid:88)

FT 35.1 116.6 28.7 21.5Oscar (cid:88)

Uniﬁed VLP (cid:88) (cid:88) (cid:88)

VL-BART (cid:88)

On the VCR val split, comparing to the model variants thatadapt different pretraining strategies, we ﬁnd that both Stage1 generic-domain pretraining and Stage 2 in-domain pre-training help improve the VCR task performance, which isconsistent with the ﬁndings in UNITER.

We evaluate automatic caption generation performance onMS COCO Caption dataset (Chen et al., 2015). We use

Karparthy split (Karpathy & Fei-Fei, 2015), which re-splitstrain2014 and val2014 COCO images (Lin et al., 2014) into113,287 / 5000 / 5000 for train / validation / test. While somemethods use reinforcement learning-based optimization onCIDEr, we only compare with methods using cross-entropyloss. Note that image captioning is the only task in our exper-iments that do not have meaningful textual contexts, whichresults in a notable difference in pretraining and ﬁnetuningw.r.t. the input format. Inspired by Oscar (Li et al., 2020a),we also experimented with using object tags as additionaltext inputs during ﬁnetuning. We use BLEU (Papineni et al.,2002), CIDEr (Vedantam et al., 2015), METEOR (Banerjee& Lavie, 2005), SPICE (Anderson et al., 2016) as evaluationmetrics using COCOEvalCap implementation. In Table 7, we compare our models with baselines in dif-ferent settings: use of vision-and-language pretraining anduse of object tag as additional text inputs. With and withoutvision-and-language pretraining, our models show compara-ble performance to baselines. Since the use of object tagsrequires signiﬁcant extra computation, we only use it forﬁnetuning. Using tags gives a comparable or slightly im-proved performance for both models, and the improvementis signiﬁcant (2.5) in CIDEr for VL-BART. We expect tag- https://github.com/tylin/coco-caption Table 8.

Multi30K En-De multimodal translation BLEU scores. † and * refer to data augmentation and ensemble, respectively. Weuse gray color for the ensemble model it is not fairly comparable. Method V&L PT test2016 test2017 test2018MSA 38.7 - -MeMAD 38.9 32.0 -MSA † † †∗ VL-T5 (cid:88) (cid:88) augmented pretraining like Oscar would further boost theperformance of our models.

We evaluate English to German multimodal machine transla-tion performance on Multi30K dataset (Elliott et al., 2016),which have been used in WMT multimodal machine transla-tion shared tasks (Barrault et al., 2018). Multi30K dataset iscollected by translating the Flickr30K (Young et al., 2014)dataset (in English) with paired German sentences. We re-port BLEU score using SacreBLEU (Post, 2018) implemen-tation , which produces ofﬁcial WMT BLEU scores. Sinceno pretrained vision-and-language transformers have beenevaluated on the multimodal machine translation task yet,we compare our models with state-of-the-art transformermodels: Multimodal self-attention (MSA) (Yao & Wan,2020), MeMAD (Gr¨onroos et al., 2018).Table 8 shows that our T5-based models outperformed allsingle-model baselines on all three test splits of Multi30K,without strong data augmentation (e.g., back-translation,captions from external image captioning model). Our vision-and-language models outperformed their original text-onlybackbones, but we did not observe notable improvementwith vision-and-language pretraining. Vision-and-languagepretraining degraded performance of VL-BART. We conjec-ture the reasons as ( i ) the source text in Multi30K containssufﬁcient information for machine translation without visualinputs as discussed in Caglayan et al. (2019). ( ii ) the visualgrounding ability which VL-BART failed to learn (Sec.5.3)is important for multimodal machine translation task. https://github.com/mjpost/sacrebleu nifying Vision-and-Language Tasks via Text Generation Table 9.

Multi-task ﬁnetuning results on VQA and RefCOCOg.With a single set of parmeters, our multi-task model achievessimilar performance to separately optimized single-task models.

Method Finetuning tasks VQA RefCOCOgKarpathy test testAcc AccVL-T5 VQA 67.9 -VL-T5 RefCOCOg - 71.3VL-T5 VQA + RefCOCOg 67.0 70.1

While our framework has uniﬁed the architecture for dif-ferent downstream tasks, the parameters are separately op-timized. To see whether we can go one step further, wetrain a single model that tackles different kinds of tasks atonce with the same set of weights. Speciﬁcally, we ﬁnetuneVL-T5 on two different tasks, VQA (Goyal et al., 2019)and RefCOCOg (Mao et al., 2016), in a multi-task learningsetup. At each ﬁnetuning step, we sample a mini-batch ofexamples from one of the two tasks. The existing vision-and-language multi-task learning method (Lu et al., 2020) trainsmultiple task-speciﬁc heads and only shares the backboneencoder, as illustrated in Fig. 3. With the help of our uniﬁedencoder-decoder architecture and generative pretraining, webuild a uniﬁed multi-task model, where only a single sharedlanguage modeling head is learned for both tasks.Table 9 shows the multi-task and single-task ﬁnetuning re-sults of VL-T5 on VQA and RefCOCOg. On both tasks, ourmulti-task model achieves similar performance compared tothe single-task models, while using a single set of weightsshared by both tasks. Since we did not use advanced multi-task learning strategies such as oversampling or dynamicstop-and-go (Lu et al., 2020), we expect the multi-task per-formance of our model to be further improved with theseorthogonal techniques.

6. Conclusion

In this work, we proposed VL-T5 and VL-BART whichtackle vision-and-language tasks with a uniﬁed text gener-ation objective. Experiments show VL-T5 and VL-BARTcan achieve comparable performance with state-of-the-artvision-and-language transformers on diverse vision-and-language tasks without hand-crafted architectures and objec-tives. Especially, we demonstrate our generative approach isbetter suited for open-ended visual question answering. Inaddition, we also showed it is possible to train two differenttasks simultaneously using the same architecture with thesame weight while not losing much performance, it wouldbe an interesting future work to further explore this directionby adding even more tasks.

Acknowledgments

We thank Hyounghun Kim, Zineng Tang, Swarnadeep Saha,Xiang Zhou for their comments and suggestions. Thiswork was supported by NSF-CAREER Award 1846185,ARO-YIP Award W911NF-18-1-0336, DARPA MCS GrantN66001-19-2-4031, Google Focused Research Award, andBloomberg Data Science Ph.D. Fellowship. The views,opinions, and/or ﬁndings contained in this article are thoseof the authors and not of the funding agency.

References

Anderson, P., Fernando, B., Johnson, M., and Gould, S.SPICE: Semantic Propositional Image Caption Evalua-tion. In

ECCV , 2016.Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M.,Gould, S., and Zhang, L. Bottom-Up and Top-DownAttention for Image Captioning and Visual Question An-swering. In

CVPR , 2018. URL http://arxiv.org/abs/1707.07998 .Banerjee, S. and Lavie, A. METEOR : An Automatic Met-ric for MT Evaluation with Improved Correlation withHuman Judgments. In

ACL Workshop , 2005.Barrault, L., Bougares, F., Specia, L., Lala, C., Elliott, D.,and Frank, S. Findings of the Third Shared Task onMultimodal Machine Translation. In

WMT , pp. 304–323,2018. doi: 10.18653/v1/w18-6402.Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan,J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G.,Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu,J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M.,Gray, S., Chess, B., Clark, J., Berner, C., McCandlish,S., Radford, A., Sutskever, I., and Amodei, D. LanguageModels are Few-Shot Learners. In

NeurIPS , 2020. URL http://arxiv.org/abs/2005.14165 .Caglayan, O., Madhyastha, P., Specia, L., and Barrault,L. Probing the Need for Visual Context in Multi-modal Machine Translation. In

NAACL , 2019. ISBN9781950737130. doi: 10.18653/v1/n19-1422.Chen, X., Fang, H., Lin, T.-Y., Vedantam, R., Gupta, S.,Dollar, P., and Zitnick, C. L. Microsoft COCO Captions:Data Collection and Evaluation Server. apr 2015. URL http://arxiv.org/abs/1504.00325 .Chen, Y.-c., Li, L., Yu, L., Kholy, A. E., Ahmed, F., Gan,Z., Cheng, Y., and Liu, J. UNITER: UNiversal Image-TExt Representation Learning. In

ECCV , 2020. URL https://arxiv.org/abs/1909.11740 . nifying Vision-and-Language Tasks via Text Generation Cho, J., Lu, J., Schwenk, D., Hajishirzi, H., and Kembhavi,A. X-LXMERT: Paint, Caption and Answer Questionswith Multi-Modal Transformers. In

EMNLP , 2020. doi:10.18653/v1/2020.emnlp-main.707.Clark, K., Luong, M.-T., Le, Q. V., and Manning, C. D. Elec-tra: Pre-training text encoders as discriminators ratherthan generators. In

ICLR , 2020.Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.BERT: Pre-training of Deep Bidirectional Transformersfor Language Understanding. In

NAACL , oct 2019. URL http://arxiv.org/abs/1810.04805 .Elliott, D., Frank, S., Sima’an, K., and Specia, L. Multi30K: Multilingual English-German Image Descriptions. In

ACL Workshop , pp. 70–74, 2016.Gao, T., Fisch, A., and Chen, D. Making Pre-trained Lan-guage Models Better Few-shot Learners. 2020. URL http://arxiv.org/abs/2012.15723 .Goyal, Y., Khot, T., Agrawal, A., Summers-Stay, D., Ba-tra, D., and Parikh, D. Making the V in VQA Matter:Elevating the Role of Image Understanding in VisualQuestion Answering.

International Journal of Com-puter Vision , 2019. ISSN 15731405. doi: 10.1007/s11263-018-1116-0.Gr¨onroos, S.-A., Huet, B., Kurimo, M., Laaksonen, J., Meri-aldo, B., Pham, P., Sj¨oberg, M., Sulubacak, U., Tiede-mann, J., Troncy, R., and V´azquez, R. The MeMADSubmission to the WMT18 Multimodal Translation Task.In

WMT , volume 2, pp. 609–617, 2018.He, K., Gkioxari, G., Dollar, P., and Girshick, R. MaskR-CNN.

ICCV , 2017.Huang, L., Wang, W., Chen, J., and Wei, X. Y. Attention onattention for image captioning. In

ICCV , pp. 4633–4642,2019. ISBN 9781728148038. doi: 10.1109/ICCV.2019.00473.Huang, Z., Zeng, Z., Liu, B., Fu, D., and Fu, J. Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers. 2020. URL http://arxiv.org/abs/2004.00849 .Hudson, D. A. and Manning, C. D. GQA: A new dataset forreal-world visual reasoning and compositional questionanswering. In

CVPR , 2019. ISBN 9781728132938. doi:10.1109/CVPR.2019.00686.Johnson, J., Karpathy, A., and Fei-Fei, L. DenseCap: FullyConvolutional Localization Networks for Dense Caption-ing. In

CVPR , 2016. Karpathy, A. and Fei-Fei, L. Deep Visual-Semantic Align-ments for Generating Image Descriptions. In

CVPR ,2015. ISBN 9781467369640. doi: 10.1109/TPAMI.2016.2598339.Kazemzadeh, S., Ordonez, V., Matten, M., and Berg, T.Referitgame: Referring to objects in photographs of natu-ral scenes. In

EMNLP , 2014.Keskar, N. S., McCann, B., Xiong, C., and Socher, R. Unify-ing Question Answering and Text Classiﬁcation via SpanExtraction. 2019. URL http://arxiv.org/abs/1904.09286 .Khashabi, D., Min, S., Khot, T., Sabharwal, A., Tafjord,O., Clark, P., and Hajishirzi, H. Uniﬁed QA : CrossingFormat Boundaries with a Single QA System. In

Findingsof EMNLP , 2020.Kim, J.-h., Jun, J., and Zhang, B.-t. Bilinear AttentionNetworks. In

NeurIPS , pp. 1–12, 2018.Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K.,Kravitz, J., Chen, S., Kalantidis, Y., Jia-Li, L., Shamma,D. A., Michael Bernstein, and Fei-Fei, L. VisualGenome: Connecting Language and Vision Using Crowd-sourced Dense Image Annotations.

International Jour-nal of Computer Vision , 2016. ISSN 15731405. doi:10.1007/s11263-016-0981-7.Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P.,and Soricut, R. Albert: A lite bert for self-supervisedlearning of language representations. In

ICLR , 2020.Lei, J., Yu, L., Bansal, M., and Berg, T. L. Tvqa: Localized,compositional video question answering. In

EMNLP ,2018.Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mo-hamed, A., Levy, O., Stoyanov, V., Zettlemoyer, L., andBart, P.-t. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation,and Comprehension. In

ACL , 2020.Li, L., Chen, Y.-C., Yu Cheng, Z. G., Yu, L., and Liu, J.HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training. In

EMNLP , 2020a.Li, X., Yin, X., Li, C., Zhang, P., Hu, X., Zhang, L.,Wang, L., Hu, H., Dong, L., Wei, F., Choi, Y., andGao, J. Oscar: Object-Semantics Aligned Pre-trainingfor Vision-Language Tasks. In

ECCV , 2020b. URL http://arxiv.org/abs/2004.06165 .Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ra-manan, D., Doll´ar, P., and Zitnick, C. L. Microsoft COCO:Common Objects in Context. In

ECCV , 2014. ISBN 978-3-319-10601-4. doi: 10.1007/978-3-319-10602-1 48. nifying Vision-and-Language Tasks via Text Generation

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D.,Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V.Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 , 2019.Loshchilov, I. and Hutter, F. Decoupled Weight De-cay Regularization. In

ICLR , 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7 .Lu, J., Batra, D., Parikh, D., and Lee, S. ViLBERT: Pre-training Task-Agnostic Visiolinguistic Representationsfor Vision-and-Language Tasks. In

NeurIPS , 2019. URL http://arxiv.org/abs/1908.02265 .Lu, J., Goswami, V., Rohrbach, M., Parikh, D., and Lee,S. 12-in-1: Multi-Task Vision and Language Repre-sentation Learning. In

CVPR , 2020. URL http://arxiv.org/abs/1912.02315 .Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A., andMurphy, K. Generation and Comprehension of Unam-biguous Object Descriptions. In

CVPR , 2016.Mccann, B., Keskar, N. S., Xiong, C., and Socher, R. TheNatural Language Decathlon : Multitask Learning asQuestion Answering. 2018.Miech, A., Alayrac, J.-B., Smaira, L., Laptev, I., Sivic, J.,and Zisserman, A. End-to-end learning of visual repre-sentations from uncurated instructional videos. In

CVPR ,2020.Narang, S., Diamos, G., Elsen, E., Micikevicius, P., Alben,J., Garcia, D., Ginsburg, B., Houston, M., Kuchaiev, O.,Venkatesh, G., and Wu, H. Mixed Precision Training.In

ICLR , 2018. URL https://openreview.net/forum?id=r1gs9JgRZ .Nogueira, R., Jiang, Z., Lin, J., Mar, I. R., Pradeep, R., andLin, J. Document Ranking with a Pretrained Sequence-to-Sequence Model. In

Findings of EMNLP , pp. 1–8,2020.Ordonez, V., Kulkarni, G., and Berg, T. L. Im2Text : De-scribing Images Using 1 Million Captioned Photographs.In

NIPS , 2011.Papineni, K., Roukos, S., Ward, T., and Zhu, W.W.-j. BLEU: a Method for Automatic Evaluationof Machine Translation. In

ACL , 2002. ISBN 1-55860-883-4. doi: 10.3115/1073083.1073135. URL http://portal.acm.org/citation.cfm?doid=1073083.1073135http://dl.acm.org/citation.cfm?id=1073135 .Paszke, A., Gross, S., Chintala, S., Chana, G., Yang, E.,DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., andLerer, A. Automatic differentiation in PyTorch. In

NIPS Workshop , 2017. URL https://openreview.net/pdf?id=BJJsrmfCZ .Post, M. A Call for Clarity in Reporting BLEU Scores. In

WMT , volume 1, pp. 186–191, 2018.Press, O. and Wolf, L. Using the Output Embedding toImprove Language Models. In

EACL , 2017.Radford, A., Wook, J., Chris, K., Aditya, H., Gabriel, R.,Sandhini, G., Sastry, G., Askell, A., Mishkin, P., Clark, J.,Krueger, G., and Sutskever, I. Learning Transferable Vi-sual Models From Natural Language Supervision. 2021.Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S.,Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploringthe Limits of Transfer Learning with a Uniﬁed Text-to-Text Transformer.

JMLR , 21:1–67, 2019. URL http://arxiv.org/abs/1910.10683 .Rajpurkar, P., Zhang, J., Lopyrev, K., and Liang, P. Squad:100,000+ questions for machine comprehension of text.In

EMNLP , 2016.Ren, S., He, K., Girshick, R., and Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Re-gion Proposal Networks. In

NIPS , 2015. URL https://arxiv.org/abs/1506.01497 .Sharma, P., Ding, N., Goodman, S., and Soricut, R. Con-ceptual captions: A cleaned, hypernymed, image alt-textdataset for automatic image captioning. In

ACL , 2018.ISBN 9781948087322. URL .Shaw, P., Uszkoreit, J., and Vaswani, A. Self-Attention withRelative Position Representations. In

NAACL , 2018.Song, K., Tan, X., Qin, T., Lu, J., and Liu, T.-Y. MASS:Masked Sequence to Sequence Pre-training for LanguageGeneration. In

ICML , 2019. URL http://arxiv.org/abs/1905.02450 .Suhr, A., Zhou, S., Zhang, A., Zhang, I., Bai, H., and Artzi,Y. A Corpus for Reasoning About Natural LanguageGrounded in Photographs. In

ACL , 2019. URL http://arxiv.org/abs/1811.00491 .Sun, C., Baradel, F., Murphy, K., and Schmid, C. Con-trastive Bidirectional Transformer for Temporal Represen-tation Learning. 2019a. URL http://arxiv.org/abs/1906.05743 .Sun, C., Myers, A., Vondrick, C., Murphy, K., and Schmid,C. VideoBERT: A Joint Model for Video and LanguageRepresentation Learning. In

ICCV , 2019b. URL http://arxiv.org/abs/1904.01766 . nifying Vision-and-Language Tasks via Text Generation Tan, H. and Bansal, M. LXMERT: Learning Cross-Modality Encoder Representations from Transformers.In

EMNLP , 2019. URL http://arxiv.org/abs/1908.07490 .Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J.,Jones, L., Gomez, A. N., Kaiser, L., and Polo-sukhin, I. Attention Is All You Need. In

NIPS ,2017. URL https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf .Vedantam, R., Zitnick, C. L., and Parikh, D. CIDEr:Consensus-based Image Description Evaluation. In

CVPR , nov 2015. URL http://arxiv.org/abs/1411.5726 .Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., andBowman, S. R. Glue: A multi-task benchmark and analy-sis platform for natural language understanding. In

ICLR ,2018.Williams, A., Nangia, N., and Bowman, S. R. A broad-coverage challenge corpus for sentence understandingthrough inference. In

NAACL , 2017.Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue,C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz,M., and Brew, J. HuggingFace’s Transformers: State-of-the-art Natural Language Processing. 2019. URL http://arxiv.org/abs/1910.03771 .Xia, Q., Huang, H., Duan, N., Zhang, D., and Ji, L. XGPT :Cross-modal Generative Pre-Training for Image Caption-ing. 2020. URL https://arxiv.org/abs/2003.01473 .Xu, J., Mei, T., Yao, T., and Rui, Y. Msr-vtt: A large videodescription dataset for bridging video and language. In

CVPR , 2016.Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov,R. R., and Le, Q. V. Xlnet: Generalized autoregres-sive pretraining for language understanding. In

NeurIPS ,2019.Yao, S. and Wan, X. Multimodal Transformer for Mul-timodal Machine Translation. In

ACL , pp. 4346–4350,2020. doi: 10.18653/v1/2020.acl-main.400.Young, P., Lai, A., Hodosh, M., and Hockenmaier,J. From Image Descriptions to Visual Denotations:New Similarity Metrics for Semantic Inferenceover Event Descriptions.

TACL , 2(April):67–78,2014. ISSN 2307-387X. URL http://nlp.cs.illinois.edu/HockenmaierGroup/Papers/DenotationGraph.pdf . Yu, L., Poirson, P., Yang, S., Berg, A. C., and Berg, T. L.Modeling context in referring expressions. In

ECCV ,2016.Yu, L., Lin, Z., Shen, X., Yang, J., Lu, X., Bansal, M., andBerg, T. L. MAttNet : Modular Attention Network forReferring Expression Comprehension. In

CVPR , 2018a.URL https://arxiv.org/abs/1801.08186 .Yu, Y., Kim, J., and Kim, G. A joint sequence fusion modelfor video question answering and retrieval. In

ECCV ,2018b.Zellers, R., Bisk, Y., Schwartz, R., and Choi, Y. Swag:A large-scale adversarial dataset for grounded common-sense inference. In

EMNLP , 2018.Zellers, R., Bisk, Y., Farhadi, A., and Choi, Y. From Recog-nition to Cognition: Visual Commonsense Reasoning.In

CVPR , 2019. URL http://arxiv.org/abs/1811.10830 .Zhang, P., Li, X., Hu, X., Yang, J., Zhang, L., Wang, L.,Choi, Y., and Gao, I. VinVL: Making Visual Representa-tions Matter in Vision-Language Models. 2021.Zhou, L., Xu, C., and Corso, J. J. Towards automatic learn-ing of procedures from web instructional videos. In

AAAI ,2018.Zhou, L., Palangi, H., Zhang, L., Hu, H., Corso, J. J., andGao, J. Uniﬁed Vision-Language Pre-Training for ImageCaptioning and VQA. In

AAAI , 2020. URL http://arxiv.org/abs/1909.11059 .Zhu, L. and Yang, Y. Actbert: Learning global-local video-text representations. In

CVPR , 2020.Zhu, Y., Groth, O., Bernstein, M., and Fei-Fei, L. Visual7W:Grounded Question Answering in Images. In

CVPR ,2016. ISBN 978-1-4673-8851-1. doi: 10.1109/CVPR.2016.540. URL http://arxiv.org/abs/1511.03416 . nifying Vision-and-Language Tasks via Text Generation Table 10.

Summary of baseline vision-and-language transformers. a Since not all models report exact parameter numbers, we providerough estimates compared to BERT

Base (86M; noted as P), where word embedding parameters are excluded. b LXMERT and XGPT arenot initialized from pretrained language models. LXMERT authors found pretraining from scratch was more effective than initializationfrom BERT

Base in their experiments. XGPT uses text pretraining on Conceptual captions and COCO captions with Masked LM (Devlinet al., 2019) and Masked Seq2Seq (Song et al., 2019) objectives before V&L pretraining. c LXMERT (text+visual+cross-modal) andViLBERT (cross-modal) use dual-stream encoders. ViLBERT uses 768/1024-dim hidden states for text/visual streams respectively. XGPTuses AoA module (Huang et al., 2019) as visual encoder. Rest of the models use single-stream encoders. d For generation tasks, UniﬁedVLP and Oscar use causal mask and reuse encoder as decoder similar to UniLM. e XGPT also uses shared parameters for encoder anddecoder, but its decoder is right-shifted and predicts next tokens. f Uniﬁed VLP is initialized from UniLM, which is initialized fromBERT

Large . g Oscar uses object tags as additional text inputs.

V&L Pretraining HyperparametersDataset a Hidden dim b c

2P 768 36 absoluteViLBERT CC 3M Encoder BERT

Base c c ∼

36 absoluteUNITER

Base

CC+SBU+COCO+VG 4M Encoder BERT

Base

12 P 768 10 ∼

100 absoluteUniﬁed VLP CC 3M Encoder d UniLM f

12 P 768 100 absoluteOscar

Base

CC+SBU+COCO+VG+Flickr30K 4M Encoder d BERT

Base

12 P 768 50 g absoluteXGPT CC+COCO 3M Enc-Dec e - b c +12+12 P 768 100 absoluteVL-T5 COCO+VG 180K Enc-Dec T5 Base

Base

Table 11.

Pretraining tasks used in our vision-and-language pretraining. The images that have any intersection with evaluation set ofdownstream tasks (e.g., COCO caption, RefCOCOg) and the held-out validation set for pretraining are excluded.

Task Image source Text source

A. Summary of Vision-and-LanguageTransformers

In Table 10, we compare the baseline vision-and-languagetransformers and our VL-T5 and VL-BART in detail.

B. Pretraining and Downstream Task Details

In Table 11 and Table 12, we show the detailed statisticsof our pretraining and downstream datasets and tasks. InTable 13, we show the hyperparameters that we used in ourpretraining and downstream task experiments. nifying Vision-and-Language Tasks via Text Generation

Table 12.

Statistics of the datasets used in downstream tasks. The data that are not used for training/validation (e.g., COCO test2015images) and data for leaderboard submissions (e.g., test-dev/test-std for VQA, test for GQA) are excluded.

Datasets Image source Web Crawled 238K (206K) 100K (86K) AccuracyRefCOCOg COCO 26K (21K) 95K (80K) AccuracyVCR Movie Clips 110K (80K) 290K (212K) AccuracyCOCO Caption COCO 123K (113K) 616K (566K) BLEU,CIDEr,METEOR,SPICEMulti30K En-De Flickr30K 31K (29K) 31K (29K) BLEU

Table 13.

Hyperparameters for pretraining and downtream tasks

Model Task Learning rate Batch size EpochsVL-T5 Pretraining 1e-4 320 30VCR Pretraining 5e-5 80 20VQA 5e-5 320 20GQA 1e-5 240 20NLVR2

Related Researches

Leveraging cross-platform data to improve automated hate speech detection

by John D Gallacher

Learning Modality-Specific Representations with Self-Supervised Multi-Task Learning for Multimodal Sentiment Analysis

by Wenmeng Yu

BembaSpeech: A Speech Recognition Corpus for the Bemba Language

by Claytone Sikasote

Decontextualization: Making Sentences Stand-Alone

by Eunsol Choi

AuGPT: Dialogue with Pre-trained Language Models and Data Augmentation

by Jonáš Kulhánek

NewsBERT: Distilling Pre-trained Language Model for Intelligent News Application

by Chuhan Wu

Bootstrapping Relation Extractors using Syntactic Search by Examples

by Matan Eyal

Transfer Learning Approach for Arabic Offensive Language Detection System -- BERT-Based Model

by Fatemah Husain

Broader terms curriculum mapping: Using natural language processing and visual-supported communication to create representative program planning experiences

by Rogério Duarte

Quality Estimation without Human-labeled Data

by Yi-Lin Tuan

Generate and Revise: Reinforcement Learning in Neural Poetry

by Andrea Zugarini

Effects of Layer Freezing when Transferring DeepSpeech to New Languages

by Onno Eberhard

How True is GPT-2? An Empirical Analysis of Intersectional Occupational Biases

by Hannah Kirk

SLUA: A Super Lightweight Unsupervised Word Alignment Model via Cross-Lingual Contrastive Learning

by Di Wu

In-Order Chart-Based Constituent Parsing

by Yang Wei

A study of text representations in Hate Speech Detection

by Chrysoula Themeli

A Hybrid Task-Oriented Dialog System with Domain and Task Adaptive Pretraining

by Boliang Zhang

Clinical Outcome Prediction from Admission Notes using Self-Supervised Knowledge Integration

by Betty van Aken

Wake Word Detection with Streaming Transformers

by Yiming Wang

The Singleton Fallacy: Why Current Critiques of Language Models Miss the Point

by Magnus Sahlgren

OntoEnricher: A Deep Learning Approach for Ontology Enrichment from Unstructured Text

by Lalit Mohan Sanagavarapu

CSS-LM: A Contrastive Framework for Semi-supervised Fine-tuning of Pre-trained Language Models

by Yusheng Su

Spoiler Alert: Using Natural Language Processing to Detect Spoilers in Book Reviews

by Allen Bao

An open access NLP dataset for Arabic dialects : Data collection, labeling, and model construction

by ElMehdi Boujou

Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention

by Yunyang Xiong

«

1

2

3

4

»

Submitted on 4 Feb 2021 (v1), last revised 23 May 2021 (this version, v2) Updated

arXiv.org Original Source

NASA ADS

Google Scholar

Semantic Scholar