[PDF] VQA-E: Explaining, Elaborating, and Enhancing Your Answers for Visual Questions

Abstract

Most existing works in visual question answering (VQA) are dedicated to improving the accuracy of predicted answers, while disregarding the explanations. We argue that the explanation for an answer is of the same or even more importance compared with the answer itself, since it makes the question and answering process more understandable and traceable. To this end, we propose a new task of VQA-E (VQA with Explanation), where the computational models are required to generate an explanation with the predicted answer. We first construct a new dataset, and then frame the VQA-E problem in a multi-task learning architecture. Our VQA-E dataset is automatically derived from the VQA v2 dataset by intelligently exploiting the available captions. We have conducted a user study to validate the quality of explanations synthesized by our method. We quantitatively show that the additional supervision from explanations can not only produce insightful textual sentences to justify the answers, but also improve the performance of answer prediction. Our model outperforms the state-of-the-art methods by a clear margin on the VQA v2 dataset.

Full PDF

VVQA-E: Explaining, Elaborating, and EnhancingYour Answers for Visual Questions

Qing Li , Qingyi Tao , Shaﬁq Joty , Jianfei Cai , Jiebo Luo University of Science and Technology of China, Nanyang Technological University, NVIDIA AI Technology Center, University of Rochester

Abstract.

Most existing works in visual question answering (VQA) arededicated to improving the accuracy of predicted answers, while disre-garding the explanations. We argue that the explanation for an answeris of the same or even more importance compared with the answer itself,since it makes the question answering process more understandable andtraceable. To this end, we propose a new task of VQA-E (VQA withExplanation), where the models are required to generate an explanationwith the predicted answer. We ﬁrst construct a new dataset, and thenframe the VQA-E problem in a multi-task learning architecture. OurVQA-E dataset is automatically derived from the VQA v2 dataset byintelligently exploiting the available captions. We also conduct a userstudy to validate the quality of the synthesized explanations . We quan-titatively show that the additional supervision from explanations can notonly produce insightful textual sentences to justify the answers, but alsoimprove the performance of answer prediction. Our model outperformsthe state-of-the-art methods by a clear margin on the VQA v2 dataset.

Keywords:

Visual Question Answering, Model with Explanation

In recent years, visual question answering (VQA) has been widely studied byresearchers in both computer vision and natural language processing communi-ties [2,34,8,27,31,11]. Most existing works perform VQA by utilizing attentionmechanism and combining features from two modalities for predicting answers.Although promising performance has been reported, there is still a huge gapfor humans to truly understand the model decisions without any explanation forthem. A popular way to explain the predicted answers is to visualize attentionmaps to indicate ‘where to look’ . The attended regions are pointed to trace thepredicted answer back to the image content. However, the visual justiﬁcationthrough attention visualization is implicit and it cannot entirely reveal what themodel captures from the attended regions for answering the questions. Therecould be many cases where the model attends to right regions but predictswrong answers. What’s worse, the visual justiﬁcation is not accessible to visuallyimpaired people who are the potential users of the VQA techniques. Therefore,in this paper we intend to explore textual explanations to compensate for theseweaknesses of visual attention in VQA. a r X i v : . [ c s . C V ] A ug Qing Li, Qingyi Tao, Shaﬁq Joty, Jianfei Cai, and Jiebo Luo

Fig. 1.

VQA-E provides insightful information that can explain, elaborate or enhancepredicted answers compared with the traditional VQA task. Q=Question, A=Answer,E=Explanation. (Left) From the answer, there is no way to trace the correspondingvisual content to tell the name of the hotel. The explanation clearly points out whereto look for the answer. (Middle) The explanation provides a real answer to the aspectasked. (Right) The word “anything” in the question refers to a vague concept withoutspeciﬁc indication. The answer is enhanced by the “madonna shirt” in the explanation.

Another crucial advantage of textual explanation is that it elaborates andenhances the predicted answer with more relevant information. As shown inFig. 1, a textual explanation can be a clue to justify the answer, or a comple-mentary delineation that elaborates on the context of the question and answer,or a detailed speciﬁcation about abstract concepts mentioned in the QA to en-hance the short answer. Such textual explanations are important for eﬀectivecommunication since they provide feedbacks that enable the questioners to ex-tend the conversation. Unfortunately, although textual explanations are desiredfor both model interpretation and eﬀective communication in natural contexts,little progress has been made in this direction, partly because almost all thepublic datasets, such as VQA [2,8], COCO-QA [22], and Visual7W [34], do notprovide explanations for the annotated answers.In this work, we aim to address the above limitations of existing VQA systemsby introducing a new task called VQA-E (VQA with Explanations). In VQA-E, the models are required to provide a textual explanation for the predictedanswer. We conduct our research in two steps. First, to foster research in thisarea, we construct a new dataset with textual explanations for the answers. TheVQA-E dataset is automatically derived from the popular VQA v2 dataset [8] bysynthesizing an explanation for each image-question-answer triple. The VQA v2dataset is one of the largest VQA datasets with over 650k question-answer pairs,and more importantly, each image in the dataset is coupled with ﬁve descriptionsfrom MSCOCO captions [4]. Although these captions were written without con-sidering the questions, they do include some QA-related information and thusexploiting these captions could be a good initial point for obtaining explanationsfree of cost. We further explore several simple but eﬀective techniques to synthe-

QA-E 3 size an explanation from the caption and the associated question-answer pair.To relieve concern about the quality of the synthesized explanations, we con-duct a comprehensive user study to evaluate a randomly-selected subset of theexplanations. The user study results show that the explanation quality is goodfor most question-answer pairs while being a little inadequate for the questionsasking for a subjective response or requiring common sense (pragmatic knowl-edge). Overall, we believe the newly created dataset is good enough to serve asa benchmark for the proposed VQA-E task.To show the advantages of learning with textual explanations, we also pro-pose a novel VQA-E model, which addresses both the answer prediction and theexplanation generation in a multi-task learning architecture. Our dataset enablesus to train and evaluate the VQA-E model, which goes beyond a short answer byproducing a textual explanation to justify and elaborate on it. Through exten-sive experiments, we ﬁnd that the additional supervisions from explanations canhelp the model better localize the important image regions and lead to an im-provement in the accuracy of answer prediction. Our VQA-E model outperformsthe state-of-the-art methods in the VQA v2 dataset.

Attention in Visual Question Answering.

Attention mechanism is ﬁrstlyused in machine translation [3] and then is brought into the vision-to-languagetasks [29,32,28,31,18,15,19,33,10,9,30]. The visual attention in the vision-to-languagetasks is used to address the problem of “where to look” [25]. In VQA, the questionis used as a query to search for the relevant regions in the image. [31] proposes astacked attention model which queries the image for multiple times to infer theanswer progressively. Beyond the visual attention, Lu et al. [18] exploit a hier-archical question-image co-attention strategy to attend to both related regionsin the image and crucial words in the question. [19] proposes the dual attentionnetwork, which reﬁnes the visual and textual attention via multiple reasoningsteps. Attention mechanism can ﬁnd the question-related regions in the image,which can account for the answer to some extent. [6] has studied how well thevisual attention is aligned with the human gaze. The results show that whenanswering a question, current attention-based models do not seem to be “look-ing” at the same regions of the image as humans do. Although attention is agood visual explanation for the answer, it is not accessible for visually impairedpeople and is somehow limited in real-world applications.

Model with Explanations.

Recently, a number of works [14,20,17] have beendone for explaining the decisions from deep learning models, which are typicallyblack boxes due to the end-to-end training procedure. [14] proposes a novelexplanation model for bird classiﬁcation. However, their class relevance metricsare not applicable to VQA since there is no pre-deﬁned semantic category forthe questions and answers. Therefore, we build a reference dataset to directlytrain and evaluate models for VQA with explanations. The most similar work

Qing Li, Qingyi Tao, Shaﬁq Joty, Jianfei Cai, and Jiebo Luo

Fig. 2.

An example of the pipeline to fuse the question (Q), the answer (A) and therelevant caption (C) into an explanation (E). Each question-answer pair is convertedinto a statement (S). The statement and the most relevant caption are both parsedinto constituency trees. These two trees are then aligned by the common node. Thesubtree including the common node in the statement is merged into the caption treeto obtain the explanation. to ours is

Multimodal Explanations [20] that proposes a multimodal explanationdataset for VQA, which is human-annotated and of high quality. In contrast,our dataset focuses on textual explanations and is built free of cost and over sixtimes bigger (269,786 v.s. 41,817) than theirs.

We now introduce our VQA-E dataset. We begin by describing the process ofsynthesizing explanations from image descriptions for question-answer pairs, fol-lowed by dataset analysis and a user study to assess the quality of our dataset.

The ﬁrst step is to ﬁnd the caption most relevant to the question andanswer. Given an image caption C , a question Q and an answer A , we tokenizeand encode them into GloVe word embeddings [21]: W c = { w , ..., w T c } , W q = { w , ..., w T q } , W a = { w , ..., w T a } , where T c , T q , T a are the number of wordsin the caption, question, and answer, respectively. We compute the similaritybetween the caption and question-answer pair as follows: s ( w i , w j ) = 12 (1 + w Ti w j || w i || · || w j || ) (1a) S ( Q , C ) = 1 T q (cid:88) w i ∈ W q max w j ∈ W c s ( w i , w j ) (1b) S ( A , C ) = 1 T a (cid:88) w i ∈ W a max w j ∈ W c s ( w i , w j ) (1c) S ( < Q , A >, C ) = 12 ( S ( Q , C ) + S ( A , C )) (1d) QA-E 5

Fig. 3.

Top: similarity score distribution. Bottom: illustration of VQA-E examples atdiﬀerent similarity levels.

For each question-answer pair, we ﬁnd the most relevant caption, coupledwith a similarity score. We have tried other more complex techniques like usingTerm Frequency and Inverse Document Frequency to adjust the weights of dif-ferent words, but we ﬁnd this simple mean-max formula in Eq.(1) works better.To generate a good explanation, we intend to fuse the information from boththe question-answer pair and the most relevant caption. Firstly the question andanswer are merged into a declarative statement. We achieve this by designingsimple merging rules based on the question types and the answer types. Simi-lar rule-based methods have been explored in NLP to generate questions fromdeclarative statements [13] (i.e., opposite direction). We then fuse this QA state-ment with the caption via aligning and merging their constituency parse trees.We further reﬁne the combined sentence by a grammar check and correction toolto obtain the ﬁnal explanation, and compute its similarity to the question-answerpair with Eq. 1. An example of our pipeline is shown in Fig. 2.

Similarity distribution.

Due to the large size and diversity of questions, andthe limited sources of captions for each image, it is not guaranteed that a goodexplanation could be generated for each Q&A. The explanations with low sim-ilarity scores are removed from the dataset to reduce noise. We present someexamples in Fig. 3. It shows a gradual improvement in explanation quality whenthe similarity scores increase. With some empirical investigation, we select asimilarity threshold of 0.6 to ﬁlter out those noisy explanations. We also plotthe similarity score histogram in Fig. 3. Interestingly, we observe a clear troughat 0.6 that makes the explanations well separated by this threshold.

Qing Li, Qingyi Tao, Shaﬁq Joty, Jianfei Cai, and Jiebo Luo a r e a r e t h e a r e t h e r e a r e t h e r e a n y a r e t h e s e a r e t h e y c a n y o u c o u l d d o d o y o u d o e s t h e d o e s t h i s h a s h o w h o w m a n y h o w m a n y p e o p l e a r e h o w m a n y p e o p l e a r e i n i s i s h e i s i t i s t h a t a i s t h e i s t h e m a n i s t h e p e r s o n i s t h e w o m a n i s t h e r e i s t h e r e a i s t h i s i s t h i s a i s t h i s a n i s t h i s p e r s o n n o n e o f t h e a b o v e w a s w h a t w h a t a n i m a l i s w h a t a r e w h a t a r e t h e w h a t b r a nd w h a t c o l o r w h a t c o l o r a r e t h e w h a t c o l o r i s w h a t c o l o r i s t h e w h a t d o e s t h e w h a t i s w h a t i s i n t h e w h a t i s o n t h e w h a t i s t h e w h a t i s t h e c o l o r o f t h e w h a t i s t h e m a n w h a t i s t h e n a m e w h a t i s t h e p e r s o n w h a t i s t h e w o m a n w h a t i s t h i s w h a t k i nd o f w h a t nu m b e r i s w h a t r oo m i s w h a t s p o r t i s w h a t t i m e w h a t t y p e o f w h e r e a r e t h e w h e r e i s t h e w h i c h w h o i s w h y w h y i s t h e Good explanaton Bad explanation

Fig. 4.

Distribution of synthesized explanations by diﬀerent question types.

Table 1.

Statistics for our VQA-E dataset.Dataset Split

VQA-E

Train 72,680 181,298 181,298 77,418 9,491 115,560Val 35,645 88,488 88,488 42,055 6,247 56,916Total 108,325 269,786 269,786 108,872 12,450 171,659

VQA-v2

Train 82,783 443,757 0 151,693 22,531 0Val 40,504 214,354 0 81,436 14,008 0Total 123,287 658,111 0 215,076 29,332 0

In this section, we analyze our VQA-E dataset, particularly the automaticallysynthesized explanations. Out of 658,111 existing question-answer pairs in orig-inal VQA v2 dataset, our approach generates relevant explanations with highsimilarity scores for 269,786 QA pairs (41%). More statistics about the datasetare given in Table 1.We plot the distribution of the number of synthesized explanations for eachquestion type in Fig. 4. While looking into diﬀerent question types, the percent-age of relevant explanations varies from type to type.

Abstract questions v.s. Speciﬁc questions.

It is observed that the percent-age of relevant explanations is generally higher for ‘is/are’ and ‘what’ questionsthan ‘how’, ‘why’ and ‘do’ questions. This is because ‘is/are’ and ‘what’ ques-tions tend to be related to speciﬁc visual contents which are more likely beingdescribed by image captions. In addition, a more speciﬁc question type couldfurther help in the explanation generation. For example, for ‘what sport is’ andfor ‘what room is’ questions, our approach successfully generates explanationsfor 90% and 87% question and answer pairs, respectively. The rates of havinggood explanations for these types of questions are much higher than the general‘what’ questions (40%).

QA-E 7

Fig. 5.

Subjective examples: our method cannot handle the questions involving emo-tional feeling (left), commonsense knowledge (middle) or behavioral reasoning (right).

Subjective questions: Do you/Can you/Do/Could?

The existing VQAdatasets involve some questions that require subjective feeling, logical thinkingor behavioral reasoning. These questions often fall in the question types startingwith ‘do you’, ‘can you’, ‘do’, ‘could’, and etc. For these questions, there maybe underlying clues from the image contents but the evidence is usually opaqueand indirect and thus it is hard to synthesize a good explanation. We illustrateexamples of such questions in Fig. 5 and the generated explanations are generallyinadequate to provide relevant details regarding the questions and answers.Due to the inadequacy in handling the above mentioned cases, we onlyachieve small percentages of good explanations for these question types. Thepercentages of ‘do you’, ‘can you’, ‘do’ and ‘could’ questions are 4%, 5%, 13%and 6% respectively which are far below the average 41%.

It is not easy to use quantitative metrics to evaluate whether the synthesizedexplanations can provide valid, relevant and complementary information to theanswers of the visual questions. Therefore, we conduct a user study to assess ourVQA-E dataset from human perspective. Particularly, we measure the explana-tion quality from four aspects: ﬂuent , correct , relevant , complementary . Fluent measures the ﬂuency of the explanation. A ﬂuent explanation shouldbe correct in grammar and idiomatic in wording. The correct metric indicateswhether the explanation is correct according to the image content. The relevant metric assesses the relevance of an explanation to the question and answer pair.If an explanation is relevant, users should be able to infer the answer from theexplanation. This metric is important to measure whether the proposed wordembedding similarity can eﬀectively select and ﬁlter the explanations. Throughthe user study, we evaluate the relevance of explanations from human under-standing to verify whether the synthesized explanations are closely tied to theircorresponding QA pairs. Last but not least, we evaluate whether an explanationis complementary to the answer. It is essential that the explanation can pro-

Qing Li, Qingyi Tao, Shaﬁq Joty, Jianfei Cai, and Jiebo Luo

Table 2.

User assessment results for the synthesized explanation, the most similarcaption, the random caption, and the generated explanation. To avoid bias, they areevaluated jointly and in each sample, their order is shuﬄed and unknown to users.They are assessed by the human evaluators in 1-5 grades: 1-very poor, 2-poor, 3-barelyacceptable, 4-good, 5-very good. Here we show the average scores of 2,000 questions.Fluent Correct Relevant ComplementarySynthesized Explanation 4.89 4.78

Most Similar Caption vide complementary details to the abbreviate answers so that visual accordancebetween the answer and the image could be enhanced.

Evaluation results summary.

We show the human evaluation results in Ta-ble. 2. Since the synthesized explanations are derived from existing human anno-tated captions, their average ﬂuency and correctness scores are both close to 5.More importantly, their relevance and complementariness scores are both above4, which indicates that the overall quality of the explanations is good from hu-man perspective. These two metrics diﬀerentiate a general caption of an imageand our speciﬁc explanation dedicated for a visual question-answer pair.

Fig. 6.

An overview of the multi-task VQA-E network. Firstly, an image is representedby a pre-trained CNN, while the question is encoded via a single-layer GRU. Then theimage features and question features are input to the Attention module to obtain imagefeatures for question-guided regions. Finally, the question features and attended imagefeatures are used to simultaneously predict an answer and generate an explanation.

Based on the well-constructed VQA-E dataset, in this section, we introducethe proposed multi-task VQA-E model. Fig. 6 gives an overview of our model.Given an image I and a question Q , our model can simultaneously predict ananswer A and generate a textual explanation E . QA-E 9

We adopt a pre-trained convolutional neural network (CNN) to extract a high-level representation φ of the input image I : φ = CNN( I ) = { v , ..., v P } (2)where v i is the feature vector of the i th image patch and P is the total numberof patches. We experiment with three types of image features: – Global . We extract the outputs of the ﬁnal pooling layer (‘pool5’) of theResNet-152 [12] as global features of the image. For these image features, P = 1, and visual attention is not applicable. – Grid . We extract the outputs of the ﬁnal convolutional layer (‘res5c’) ofResNet-152 as the feature map of the image, which corresponds to a uniformgrid of equally-sized image patches. In this case, P = 7 × – Bottom-up . [1] proposes a new type of image features based on objectdetection techniques. They utilize Faster R-CNN to propose salient regions,each with an associated feature vector from the ResNet-101. The bottom-upimage features provide a more natural basis at the object level for attentionto be considered. We choose P = 36 in this case. The question Q is tokenized and encoded into word embeddings W q = { w , ..., w T q } .Then the word embeddings are fed into a gated recurrent unit [5]: q = GRU( W q ) . We use the ﬁnal state of the GRU as the representation of the question.

We use the classical question-guided soft attention mechanism similar to mostmodern VQA models. For each patch in the image, the feature vector v i andthe question embedding q are ﬁrstly projected by non-linear layers to the samedimension. Next we use the Hadamard product (i.e., element-wise multiplication)to combine the projected representations and input to a linear layer to obtain ascalar attention weight associated with that image patch. The attention weights τττ are normalized over all patches with softmax function. Finally, the imagefeatures from all patches are weighted by the normalized attention weights andsummed into a single vector v as the representation of the attended image. Theformulas are as follow and we omit the bias terms for simplicity: τ i = w T (Relu( W v v i ) (cid:12) Relu( W q q )) ααα = softmax( τττ ) v = P (cid:88) i =1 α i v i (3) Note that we adopt a simple one-glimpse, one-way attention, as opposed tocomplex schemes proposed by recent works [31,16,18].Next, the representations of the question q and the image v are projected tothe same dimension by non-linear layers and then fused by a Hadamard product: h = Relu( W qh q ) (cid:12) Relu( W vh v ) (4)where h is a joint representation of the question and the image, and then fed tothe subsequent modules for answer prediction and explanation generation. We formulate the answer prediction task as a multi-label regression problem,instead of a single-label classiﬁcation problem in many other works. A set ofcandidate answers is pre-determined from all the correct answers in the trainingset that appear more than 8 times. This leads to N = 3129 candidate answers.Each question in the dataset has K = 10 human-annotated answers, which aresometimes not same, especially when the question is ambiguous or subjectiveand has multiple correct or synonymous answers. To fully exploit the disagree-ment between annotators, we adopt soft accuracies as the regression targets. Theaccuracy for each answer is computed as:Accuracy( a ) = 1 K K (cid:88) k =1 min( (cid:80) ≤ j ≤ K,j (cid:54) = k ( a = a j )3 ,

1) (5)Such soft target provides more information for training and is also in line withthe evaluation metric.The joint representation h is input into a non-linear layer and then througha linear mapping to predict a score for each answer candidate:ˆ s = sigmoid ( W o Relu ( W f h )) (6)The sigmoid function squeezes the scores into (0 ,

1) as the probability of theanswer candidate. Our loss function is similar to the binary cross-entropy losswhile using soft targets: L vqa = − M (cid:88) i =1 N (cid:88) j =1 s ij log ˆ s ij + (1 − s ij ) log(1 − ˆ s ij ) (7)where M are the number of training samples and s is the soft targets computed inEq.5. This ﬁnal step can be seen as a regression layer that predicts the correctnessof each answer candidate. To generate an explanation, we adopt an LSTM-based language model thattakes the joint representation h as input. Given the ground-truth explanation QA-E 11 E = { w , w , ..., w T e } , the loss function is: L vqe = − log( p ( E| h ))= − T e (cid:88) t =0 log( p ( w t | h , w , ..., w t − )) (8)The ﬁnal loss of multi-task learning is the sum of the VQA and VQE loss: L = L vqa + L vqe (9) We use 300 dimension word embeddings, initialized with pre-trained GloVe vectors [21]. For the question embedding, we use a single-layerGRU with 1024 hidden units. For explanation generation, we use a single-layerforward LSTM with 1024 hidden units. The question embedding and the expla-nation generation share the word embedding matrix to reduce the number ofparameters. We use Adam solver with a ﬁxed learning rate 0.01 and the batchsize is 512. We use weight normalization [24] to accelerate the training. Dropoutand early stop (15 epochs) are used to reduce overﬁtting.

Model variants.

We experiment with the following model variants: – Q-E : generating explanation from question only. – I-E : generating explanation from image only. – QI-E : generating explanation from question and image and only trainingthe branch of explanation generation. – QI-A : predicting answer from question and image and only training thebranch of answer prediction. – QI-AE : predicting answer and generating explanations, training both branches. – QI-AE(relevant) : predicting answer and generating explanation and train-ing both branches. The explanation used in this variant is the relevant cap-tion obtained in the process of explanation synthesis in Section 3.1. – QI-AE(random) : predicting answer and generating explanation and train-ing both branches. The explanation is randomly selected from the ground-truth captions for the same image except the relevant caption.

In this section, we evaluate the task of explanation generation. Table. 3 shows theperformance of all model variants on the validation split of the VQA-E dataset.First, the I-E model outperforms Q-E. This implies that it is easier to generatean explanation from only the image than from only the question, and this image

Table 3.

Performance of explanation generation task on the validation split of theproposed VQA-E dataset, where B-N, M, R, and C are short for BLEU-N, METEOR,ROUGE-L, and CIDEr-D. All scores are reported in percentage (%).

Model Image Features B-1 B-2 B-3 B-4 M C R

Q-E - 26.80 10.90 4.20 1.80 7.98 13.42 24.90I-E Global 32.50 17.20 9.30 5.20 12.38 48.58 29.79QI-E Global 34.70 19.30 11.00 6.50 14.07 61.55 31.87Grid 36.30 21.10 12.50 7.60 15.50 73.70 34.00Bottom-up 38.00 22.60 13.80 8.60 16.57 84.07 34.92QI-AE Global 35.10 19.70 11.30 6.70 14.40 64.62 32.39Grid 38.30 22.90 14.00 8.80 16.85 87.04 35.16Bottom-up bias is contrary to the well-known language bias in the VQA where it is easierto predict an answer from only the question than from only the image. Second,the QI-E models outperform both the I-E and Q-E by a large margin, whichmeans that both the question and the image are critical for generating goodexplanations. Attention mechanism is helpful for the performance and bottom-up image features are consistently better than grid image features. Finally, theQI-AE using bottom-up image features improves the performance further andachieves the best performance across all evaluation metrics. This shows that thesupervision on the answer side is helpful for the explanation generation task,thus proving the eﬀectiveness of our multi-task learning scheme.

In this section, we evaluate the task of answer prediction, as shown in Table. 4.Overall, the QI-AE models consistently outperform QI-A models across all ques-tion types. This indicates that forcing the model to explain can help it predict amore accurate answer. We argue that the supervision on explanation in QI-AEmodels can alleviate the headache of language bias in the QI-A models, becausein order to generate a good explanation, the model has to fully exploit the im-age content, learn to attend to important regions, and explicitly interpret theattended regions in the context of questions. In contrast, during the training ofQI-A models without explanations, when an answer can be guessed from thequestion itself, the model can easily get the loss down to zero by understandingthe question only regardless of the image content. In this case, the training sam-ple is not fully exploited to help the model learn how to attend to the importantregions. Another observation from Table. 4 can further support our argument.The additional supervision on explanation produces a much bigger improvementon the attention-based models (Grid and Bottom-up) than the models withoutattention (Global).QI-AE(random)-Bottom-up produces a much lower accuracy than QI-AE-Bottom-up, even lower than QI-A-Bottom-up. This implies that low-quality or

QA-E 13

Table 4.

Performance of the answer prediction task on the validation split of VQA v2dataset. Accuracies in percentage (%) are reported.

Model Image features All Yes/No Number Other

QI-A Global 57.26 77.19 39.73 46.74Grid 59.25 76.31 39.99 51.38Bottom-up 61.78 78.63 41.30 52.54QI-AE Global 57.92 78.01 40.46 47.25Grid 60.57 78.35 39.36 52.66Bottom-up

QI-AE(random) Bottom-up 58.74 78.75 40.79 48.26QI-AE(relevant) Bottom-up 62.18 79.02 41.07 53.26

Table 5.

Performance comparison with the state-of-the-art VQA methods on the test-standard split of VQA v2 dataset. BUTD-ensemble is an ensemble of 30 models andit will not participate in ranking. Accuracies in percentage (%) are reported.

Method All Yes/No Number Other

Prior [8] 25.98 61.20 0.36 1.17Language-only [8] 44.26 67.01 31.55 27.37d-LSTM+n-I [8] 54.22 73.46 35.18 41.83MCB [7,8] 62.27 78.82 38.28 53.36BUTD [26,1] 65.67 82.20 irrelevant explanations might confuse the model, thus leading to a big drop inthe performance. It also relieves the concern that the improvement is brought bylearning to describe the image, rather than explaining the answer. This furthersubstantiates the eﬀectiveness of the additional supervision on explanation.Table. 5 presents the performance of our method and the state-of-the-art ap-proaches on the test-standard split of VQA v2 dataset. Our method outperformsthe state-of-the-art methods over the answer types ‘Yes/No’ and ‘Other’ as wellas in the overall accuracy, while producing a slightly lower accuracy over theanswer type ‘Number’ than BUTD [26,1].

In this section, we show qualitative examples to demonstrate the strength ofour multi-task VQA-E model, as shown in Fig.7. Overall, the QI-AE model cangenerate relevant and complementary explanations for the predicted answers.For example, in the (a) of Fig. 7, the QI-AE model not only predicts the correctanswer ‘Yes’, but also provides more details in the ‘kitchen’, i.e., ‘fridge’, ‘sink’,and ‘cabinets’. Besides, the QI-AE model can better localize the important re-

Fig. 7.

Qualitative comparison between the QI-A and QI-AE models (both usingbottom-up image features). We visualize the attention by rendering a red box overthe region that has the biggest attention weight. gions than the QI-A model. As shown in the (b) of Fig. 7, the QI-AE modelgives the biggest attention weight on the person’s hand and thus predicts theright answer ‘Feeding giraﬀe’, while the QI-A model focuses more on the giraﬀe,leading to a wrong answer ‘Standing’. In the (c), both QI-AE and QI-E modelsattend to the right region, but these two models predict the opposite answers.This interesting contrast implies that the QI-AE model, which has to fully ex-ploit the image content to generate an explanation, can better understand theattended region than the QI-A model that only needs to predict a short answer.

In this work, we have constructed a new dataset and proposed a task of VQA-E to promote research on justifying answers for visual questions. Explanationsin our dataset are of high quality for those visually-speciﬁc questions, whilebeing inadequate for subjective ones whose evidences are indirect. For subjectivequestions, we will need extra knowledge bases to ﬁnd good explanations for them.We have also proposed a novel multi-task learning architecture for the VQA-Etask. The additional supervision from explanations not only enables our model togenerate reasons to justify predicted answers, but also brings a big improvementin the accuracy of answer prediction. Our VQA-E model is able to better localizeand understand the important regions in images than the original VQA model.In the future, we will adopt more advanced approaches to train our model, likethe reinforcement learning in image captioning [23].

Acknowledgements.

We thank Qianyi Wu etc. for helpful feedback on the userstudy. This research is partially supported by NTU-CoE Grant and Data Science& Artiﬁcial Intelligence Research Centre@NTU (DSAIR). Jiebo Luo would liketo thank the support of Adobe and NSF Award

QA-E 15

References

1. Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., Zhang,L.: Bottom-up and top-down attention for image captioning and visual questionanswering. CVPR (2018)2. Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Lawrence Zitnick, C., Parikh,D.: Vqa: Visual question answering. In: ICCV (2015)3. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learningto align and translate. ICLR (2014)4. Chen, X., Fang, H., Lin, T.Y., Vedantam, R., Gupta, S., Doll´ar, P., Zitnick, C.L.:Microsoft coco captions: Data collection and evaluation server. CoRR (2015)5. Cho, K., Van Merri¨enboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk,H., Bengio, Y.: Learning phrase representations using rnn encoder-decoder forstatistical machine translation. arXiv preprint arXiv:1406.1078 (2014)6. Das, A., Agrawal, H., Zitnick, L., Parikh, D., Batra, D.: Human attention in vi-sual question answering: Do humans and deep networks look at the same regions?Computer Vision and Image Understanding , 90–100 (2017)7. Fukui, A., Park, D.H., Yang, D., Rohrbach, A., Darrell, T., Rohrbach, M.: Multi-modal compact bilinear pooling for visual question answering and visual grounding.EMNLP (2016)8. Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the v invqa matter: Elevating the role of image understanding in visual question answering.CVPR (2017)9. Gu, J., Cai, J., Wang, G., Chen, T.: Stack-captioning: Coarse-to-ﬁne learning forimage captioning. AAAI (2018)10. Gu, J., Wang, G., Cai, J., Chen, T.: An empirical study of language cnn for imagecaptioning. In: ICCV (2017)11. Gurari, D., Li, Q., Stangl, A.J., Guo, A., Lin, C., Grauman, K., Luo, J., Bigham,J.P.: Vizwiz grand challenge: Answering visual questions from blind people. CVPR(2018)12. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition.In: CVPR (2016)13. Heilman, M., Smith, N.A.: Good question! statistical ranking for question gen-eration. In: Human Language Technologies: The 2010 Annual Conference of theNorth American Chapter of the Association for Computational Linguistics. pp.609–617. HLT ’10, Association for Computational Linguistics, Stroudsburg, PA,USA (2010), http://dl.acm.org/citation.cfm?id=1857999.1858085

14. Hendricks, L.A., Akata, Z., Rohrbach, M., Donahue, J., Schiele, B., Darrell, T.:Generating visual explanations. In: ECCV. pp. 3–19. Springer (2016)15. Ilievski, I., Yan, S., Feng, J.: A focused dynamic attention model for visual questionanswering. ECCV (2016)16. Kazemi, V., Elqursh, A.: Show, ask, attend, and answer: A strong baseline forvisual question answering. arXiv preprint arXiv:1704.03162 (2017)17. Li, Q., Fu, J., Yu, D., Mei, T., Luo, J.: Tell-and-answer: Towards explainable visualquestion answering using attributes and captions. arXiv preprint arXiv:1801.09041(2018)18. Lu, J., Yang, J., Batra, D., Parikh, D.: Hierarchical question-image co-attentionfor visual question answering. In: NIPS. pp. 289–297 (2016)19. Nam, H., Ha, J.W., Kim, J.: Dual attention networks for multimodal reasoningand matching. CVPR (2017)6 Qing Li, Qingyi Tao, Shaﬁq Joty, Jianfei Cai, and Jiebo Luo20. Park, D.H., Hendricks, L.A., Akata, Z., Rohrbach, A., Schiele, B., Darrell, T.,Rohrbach, M.: Multimodal explanations: Justifying decisions and pointing to theevidence. In: CVPR (2018)21. Pennington, J., Socher, R., Manning, C.: Glove: Global vectors for word represen-tation. In: EMNLP. pp. 1532–1543 (2014)22. Ren, M., Kiros, R., Zemel, R.: Image question answering: A visual semantic em-bedding model and a new dataset. NIPS1