[PDF] Question Type Guided Attention in Visual Question Answering

Abstract

Visual Question Answering (VQA) requires integration of feature maps with drastically different structures and focus of the correct regions. Image descriptors have structures at multiple spatial scales, while lexical inputs inherently follow a temporal sequence and naturally cluster into semantically different question types. A lot of previous works use complex models to extract feature representations but neglect to use high-level information summary such as question types in learning. In this work, we propose Question Type-guided Attention (QTA). It utilizes the information of question type to dynamically balance between bottom-up and top-down visual features, respectively extracted from ResNet and Faster R-CNN networks. We experiment with multiple VQA architectures with extensive input ablation studies over the TDIUC dataset and show that QTA systematically improves the performance by more than 5% across multiple question type categories such as "Activity Recognition", "Utility" and "Counting" on TDIUC dataset. By adding QTA on the state-of-art model MCB, we achieve 3% improvement for overall accuracy. Finally, we propose a multi-task extension to predict question types which generalizes QTA to applications that lack of question type, with minimal performance loss.

Full PDF

QQuestion Type Guided Attention in Visual QuestionAnswering

Yang Shi (cid:63) , Tommaso Furlanello , Sheng Zha , and Animashree Anandkumar University of California, Irvine [email protected] University of Southern California [email protected] Amazon AI { zhasheng } , { anima } @amazon.com California Institute of Technology

Abstract.

Visual Question Answering (VQA) requires integration of feature mapswith drastically different structures. Image descriptors have structures at mul-tiple spatial scales, while lexical inputs inherently follow a temporal sequenceand naturally cluster into semantically different question types. A lot of previousworks use complex models to extract feature representations but neglect to usehigh-level information summary such as question types in learning. In this work,we propose Question Type-guided Attention (QTA). It utilizes the informationof question type to dynamically balance between bottom-up and top-down vi-sual features, respectively extracted from ResNet and Faster R-CNN networks.We experiment with multiple VQA architectures with extensive input ablationstudies over the TDIUC dataset and show that QTA systematically improves theperformance by more than 5% across multiple question type categories such as“Activity Recognition”, “Utility” and “Counting” on TDIUC dataset compared tothe state-of-art. By adding QTA on the state-of-art model MCB, we achieve 3%improvement in overall accuracy. Finally, we propose a multi-task extension topredict question types which generalizes QTA to applications that lack questiontype, with a minimal performance loss.

Keywords:

Visual question answering, Attention, Question type, Feature selec-tion, Multi-task

The relative maturity and ﬂexibility of deep learning allow to build upon the successof computer vision [17] and natural language [13, 20] to face new complex and mul-timodal tasks. Visual Question Answering(VQA) [4] focus on providing a natural lan-guage answer given any image and any free-form natural language question. To achievethis goal, information from multiple modalities must be integrated. Visual and lexi-cal inputs are ﬁrst processed using specialized encoding modules and then integratedthrough differentiable operators. Image features are usually extracted by convolution (cid:63)

Work partially done while the author was working at Amazon AI a r X i v : . [ c s . C V ] J u l Y. Shi and T. Furlanello and S. Zha and A. Anandkumar neural networks [7], while recurrent neural networks [13, 26] are used to extract ques-tion features. Additionally, attention mechanism [30–32] forces the system to look at informative regions in both text and vision. Attention weight is calculated from thecorrelation between language and vision features and then is multiplied to the originalfeature.Previous works explore new features to represent vision and language. Pre-trainedResNet [12] and VGG [24] are commonly used in VQA vision feature extraction.The authors in [27] show that post-processing CNN with region-speciﬁc image fea-tures [3] such as Faster R-CNN [22] can lead to an improvement of VQA performance.Along with generating language feature from either sentence-level or word-level us-ing LSTM [13] or word embedding, Lu et al . [19] propose to model the question fromword-level, phrase-level, and entire question-level in a hierarchical fashion.Through extensive experimentation and ablation studies, we notice that the role of“raw” visual features from ResNet and processed region-speciﬁc features from FasterR-CNN is complementary and leads to improvement over different subsets of questiontypes. However, we also notice that trivial information in VQA dataset: question/answertype is omitted in training. Generally, each sample in any VQA dataset contains oneimage ﬁle, one natural language question/answer and sometimes answer type. A lot ofwork use the answer type to analyze accuracy per type in result [4] but neglect to use itduring learning. TDIUC [15] is a recently released dataset that contains question typefor each sample. Compared to answer type, question type has less variety and is easierto interpret when we only have the question.The focus of this work is the development of an attention mechanism that exploitshigh-level semantic information on the question type to guide the visual encoding pro-cess. This procedure introduces information leakage between modalities before the clas-sical integration phase that improves the performance on VQA task. Speciﬁcally, Weintroduce a novel VQA architecture

Question Type-guided Attention (QTA) that dy-namically gates the contribution of ResNet and Faster R-CNN features based on thequestion type. Our results with QTA allow us to integrate the information from multiplevisual sources and obtain gains across all question types. A general VQA network withour QTA is shown in Figure 1.

Q: “What’s her mustache made of?”

Vision Feature ExtractorText Feature Extractor CombinePredictor

A: “Banana”Question Type:“Subordinate Object Recognition”

Question Type Guided Attention

Fig. 1: General VQA network with QTA uestion Type Guided Attention in Visual Question Answering 3

The contributions of this paper are:(1) We propose question type-guided attentionto balance between bottom-up and top-down visual features, which are respectivelyextracted from ResNet and Faster R-CNN networks. Our results show that QTA sys-tematically improves the performance by more than 5% across multiple question typecategories such as “Activity Recognition”, “Utility” and “Counting” on TDIUC dataset.By adding QTA to the state-of-art model MCB, we achieve 3% improvement in over-all accuracy. (2) We propose a multi-task extension that is trained to predict questiontypes from the lexical inputs during training time that do not require ground truth la-bels during inference. We get more than 95% accuracy for the question type predictionwhile keeping the VQA task accuracy almost same as before. (3) Our analysis revealssome problems in the TDIUC VQA dataset. Though the “Absurd” question is intendedto help reduce bias, it contains too many similar questions, speciﬁcally, questions re-garding color. This will mislead the machine to predict wrong question types. Our QTAmodel gets 17% improvement on simple accuracy compared to the baseline in [15]when we exclude absurd questions in training.

VQA task is ﬁrst proposed in [4]. It focuses on providing a natural language answergiven any image and any free-form natural language question. Collecting data and solv-ing the task are equally challenging as they require the understanding of the joint rela-tion between image and language without any bias.

Datasets

VQA dataset v1 is ﬁrst released by Antol et al . [4]. The dataset consistsof two subsets: real images and abstract scenes. However, the inherent structure of ourworld is biased and it results in a biased dataset. In another word, a speciﬁc questiontends to have the same answer regardless of the image. For example, when people askabout the color of the sky, the answer is most likely blue or black. It is unusual tosee the answer be yellow. This is the bottleneck when we give a yellow color sky andask the machine to answer it. Goyal et al . [10] release VQA dataset v2. This datasetpairs the same question with similar images that lead to different answers to reduce thesample bias. Agrawal et al . [2] also noticed that every question type has different priordistributions of answers. Based on that they propose GVQA and new splits of the VQAv1/v2. In the new split, the distribution of answers per question type is different in thetest data compared to the training data. Zhang et al . [33, 34] also propose a methodto reduce bias in abstract scenes dataset at question level. By extracting representativeword tuples from questions, they can identify and control the balance for each question.Vizwiz [11] is another recently released dataset that uses pictures taken by blind people.Some pictures are of poor quality, and the questions are spoken. These data collectionmethods help reduce bias in the dataset.Johnson et al . [14] introduce Compositional Language and Elementary Visual Rea-soning (CLEVR) diagnostic dataset that focuses on reasoning. Strub et al . [25] proposea two-player guessing game: guess a target in a given image with a sequence of ques-tions and answers. This requires both visual question reasoning and spatial reasoning.The Task Driven Image Understanding Challenge dataset(TDIUC) [15] contains atotal of over 1.6 million questions in 12 different types. It contains images and annota-

Y. Shi and T. Furlanello and S. Zha and A. Anandkumar tions from MSCOCO [18] and Visual genome [16]. The key difference between TDIUCand the previous VQA v1/v2 dataset is the categorization of questions: Each questionbelongs to one of the 12 categories. This allows a task-oriented evaluation such as perquestion-type accuracies. They also include an “Absurd” question category in whichquestions are irrelevant to the image contents to help balance the dataset.

Feature Selection

VQA requires solving several tasks at once involving both vi-sual and textual input: visual perception, question understanding, and reasoning. Usu-ally, features are extracted respectively with convolutional neural networks [7] from theimage, and with recurrent neural networks [13, 26] from the text.Pre-trained ResNet and VGG are commonly used in VQA vision feature extrac-tion. The authors in [27] show that post-processing CNN with region-speciﬁc imagefeatures [3] can lead to an improvement of VQA performance. Speciﬁcally, they usepre-trained Faster R-CNN model to extract image features for VQA task. They won theVQA challenge 2017.On the language side, pre-trained word embeddings such as Word2Vec [20] are usedfor text feature extraction. There is a discussion about the sufﬁciency of language inputfor VQA task. Agrawal et al . [1] have shown that state-of-art VQA models converge tothe same answer even if only given half of the question compared to if given the wholesentence.

Generic Methods

Information of both modalities are used jointly through means ofcombination, such as concatenation, product or sum. In [4], authors propose a baselinethat combines LSTM embedding of the question and CNN embedding of the image viaa point-wise multiplication followed by a multi-layer perceptron classiﬁer.

Pooling Methods

Pooling methods are widely used in visual tasks to combine in-formation for various streams into one ﬁnal feature representation. Common poolingmethods such as average pooling and max pooling bring the property of translation in-variance and robustness to elastic distortions at the cost of spatial locality. Bilinear pool-ing can preserve spatial information, which is performed with the outer product betweentwo feature maps. However, this operation entails high output dimension( O ( M N ) forfeature maps of dimension M and N ). This exponential growth with respect to thenumber of feature maps renders it too costly to be applied to huge real image datasets.There have been several proposals for new pooling techniques to address this problem: – Count sketch [5] is applied as a feature hashing operator to avoid dimension ex-panding in bilinear pooling. Given a vector a ∈ R n , random hash function f ∈ R n : [ n ] → [ b ] and binary variable s ∈ R n : [ n ] → ± , the count sketch [5] operator cs ( a, h, s ) ∈ R b is: cs ( a, f, s )[ j ] = (cid:88) f [ i ]= j s [ i ] a [ i ] , j ∈ , · · · , b (1)Gao et al . [9] use convolution layers from two different neural networks as thelocal descriptor extractors of the image and combine them using count sketch. “ α -pooling” [23] allows the network to learn the pooling strategy: a continuous transi-tion between linear and polynomial pooling. They show that higher α gives largergain for ﬁne-grained image recognition tasks. However, as α goes up, the compu-tation complexity increases in polynomial order. uestion Type Guided Attention in Visual Question Answering 5 – Fukui et al . [8] use count sketch as a pooling method in VQA tasks and obtainsthe best results on VQA dataset v1 in VQA challenge 2016. They compute countsketch approximation of the visual and textual representation at each spatial loca-tion. Given text feature v ∈ R L and image features I ∈ R C × H × W , Fukui et al . [8]propose MCB as:

M CB ( I [: , h, w ] ⊗ v )[ t , h, w ]= ( cs ( I [: , h, w ] , f, s ) (cid:63) cs ( v, f, s ))[ t , h, w ]= IF F T F F T cs ( I [: , h, w ] , f, s ))[ t , h, w ] ◦ F F T cs ( v, f, s ))[ t ]) h ∈ { , · · · H } , w ∈ { , · · · W } , t ∈ { , · · · , b } (2) ⊗ denotes outer product. ◦ denotes element-wise product. (cid:63) denotes convolutionoperator. This procedure preserves spatial information in the image feature. Attention

Focusing on the objects in the image that are related to the question is the keyto understand the correlation between the image and the question. Attention mechanismis used to address this problem. There are soft attention and hard attention [31] basedon whether the attention term/loss function is differentiable or not. Yang et al . [32]and Xu et al . [30] propose word guided spatial attention speciﬁcally for VQA task.Attention weight at each spatial location is calculated by the correlation between theembedded question feature and the embedded visual features. The attended pixels are atthe maximum correlations. Wang et al . [28] explore mechanisms of triplet attention thatinteract between the image, question and candidate answers based on image-questionpairs.

Question type is very important in predicting the answer regardless whether we havethe corresponding image or not. For example, questions starting with “how many” willmostly lead to numerical answers. Agrawal et al . [1] have shown that state-of-art VQAmodels converge to the same answer even if only given half of the question comparedto if given the whole sentence. Besides that, inspired by [27], we are curious aboutcombining bottom-up and top-down visual features in VQA task. To get a deep under-standing of visual feature preference for different questions, we try to ﬁnd an attentionmechanism between these two. Since question type is representing the question, wepropose Question Type-guided Attention(QTA).Given several independent image features F , F , · · · F k , such as features fromResNet, VGG or Faster R-CNN, we concatenate them as one image feature: F =[ F , F , · · · F k ] ∈ R M . Assume there are N different question types, QTA is deﬁned as F ◦ W Q , where Q ∈ R N is the one-hot encoding of the question type, and W ∈ R M × N is the hidden weight. We can learn the weight by back propagation through the network.In other words, we learn a question type embedding and use it as attention weight.QTA can be used in both generic and complex pooling models. In Figure 2, weshow a simple concatenation model with question type as input. We describe it in detail Y. Shi and T. Furlanello and S. Zha and A. Anandkumar

Fig. 2: Concatenation model with QTAstructure for VQA task(CATL-QTA W inSection 4) Fig. 3: Concatenation modelwith QTA structure for multi-task(CATL-QTA-M W in Sec-tion 4)in Section 4. To fully exploit image features in different channels and preserve spa-tial information, we also propose MCB with question type-guided image attention inFigure 4.One obvious limitation of QTA is that it requires question type label. In the realworld scenario, the question type for each question may not be available. In this case,it is still possible to predict the question type from the text, and use it as input to theQTA network. Thus, we propose a multi-task model that focuses on VQA task alongwith the prediction of the question type in Figure 3. This model operates in the settingwhere true question type is available only at training time. In Section 5, we also showthrough experiment that it is a relatively easy task to predict the question type fromquestion text, and thus making our method generalizable to those VQA settings thatlack question type.Fig. 4: MCB model with QTA structure(MCB-QTA in Section 4) uestion Type Guided Attention in Visual Question Answering 7 In this section, we describe the dataset in Section 4.1, evaluation metrics in Section 4.2,model features in Section 4.3, and model structures are explained in Section 4.4.

Our experiments are conducted on the Task Driven Image Understanding Challengedataset(TDIUC) [15], which contains over 1.6 million questions in 12 different types.This dataset includes VQA v1 and Visual Genome, with a total of 122429 trainingimages and 57565 test images. The annotation sources are MSCOCO (VQA v1), Visualgenome annotations, and manual annotations. TDIUC introduces absurd questions thatforce an algorithm to determine if a question is valid for a given image. There are1115299 total training questions and 538543 total test questions. The total number ofsamples is 3 times larger than that in VQA v1 dataset.

There are total 12 different question types in TDIUC dataset as we mentioned in Sec-tion 2. We calculate the simple accuracy for each type separately and also report thearithmetic and harmonic means across all per question-type(MPT) accuracies.

We use the output of “pool” of a 152-layer ResNet as an image featurebaseline. The output dimension is × × . Faster R-CNN [22] focuses on objectdetection and classiﬁcation. Teney et al . [27] use it to extract object-oriented features forVQA dataset and show better performance compared to the ones using ResNet feature.We ﬁx the number of detected objects to be 36 and extract the image features basedon their pre-trained Faster R-CNN model. As a result, the extracted image feature isa × matrix. To ﬁt in MCB model, which requires spatial representation, wereshape it into a × × tensor. Text feature

We use common word embedding library: 300-dim Word2Vec [20]as a pre-trained text feature: we sum over the word embeddings for all words in thesentence. A two-layer LSTM is used as an end-to-end text feature extractor. We alsouse the encoder of google neural machine translation(NMT) system [29] as a pre-trainedtext feature and compare it with Word2Vec. The pre-trained NMT model is trained onUN parallel corpus 1.0 in MXnet [6]. Its BLEU score is 34. The output dimension ofthe encoder is . Baseline models are based on a one-layer MLP: A fully connectednetwork classiﬁer with one hidden layer with ReLu non-linearity, followed by a softmaxlayer. The input is a concatenation of image and text feature. There are 8192 units inthe hidden state.To compare different image and text feature, we have

CAT1 , CAT1L and

CATL . Tocheck the complementarity of different features between ResNet and Faster R-CNN andshow how they perform differently across question types, we set up baseline

CAT2 . In

Y. Shi and T. Furlanello and S. Zha and A. Anandkumar

Table 1: Baseline models

Name Image feature Text feature ModalCAT1

ResNet/Faster R-CNN vector feature Skipthought/NMT/Word2Vec pre-trined feature MLP

CAT1L

ResNet/Faster R-CNN vector feature End-to-end 2-layer LSTM’s last hidden state MLP

CATL

Concatenation of ResNet End-to-end 2-layer LSTM’s last hidden state MLPand Faster R-CNN vector features

CAT2

Concatenation of ResNet Skipthought/NMT/Word2Vec pre-trined feature MLPand Faster R-CNN vector features

LSTM, the hidden state length is 1024. The word embedding dimension is 300. Detaileddeﬁnitions are in Table 1.To further exam and explain our QTA proposal, we use more sophisticate featureintegration operators as a strong baseline to compare with.

MCB-A , as we mentionedin Section 2, is proposed in [8].

RAU [21] is a framework that combines the embed-ding, attention and predicts operation together inside a recurrent network. We referenceresults of these two models from [15].

QTA models

From the baseline analysis, we realize that ResNet and Faster R-CNNfeatures are complementary to each other. Using question type as guidance for imagefeature selection is the key to make image feature stronger. Therefore, we propose QTAnetworks in MLP model(

CATL-QTA ) and MCB model(

MCB-QTA ). The out dimen-sion of the count sketch in the MCB is 8000. The structures are in Figure 2, 4. Thedescriptions are in Table 2.To check whether the model beneﬁts from the QTA mechanism or from added ques-tion type information itself, we design a network that only uses question type embed-ding without attention.

CAT-QT and

CATL-QT are the two proposed network usingWord2Vec and LSTM lexical feature.As mentions in Section 3, we propose a multi-task network for QTA in case we don’thave question type label at inference.

CATL-QTA-M is a multi-task model based onCATL-QTA. The output of LSTM is connected to a one-layer MLP to predict questiontype for the input question. The prediction result is then fed into QTA part throughargmax. The Multi-task MLP is in Figure 3.

We ﬁrst focus in Sections 5.1 and 5.2 on results concerning the complementarity ofdifferent features across question category types. For the visual domain, we explore theuse of Faster R-CNN and ResNet features, while for the lexical domain we use NMT,LSTM and pre-trained Word2Vec features. We then analyze the effect of question typeboth as input and with QTA in VQA tasks in Section 5.3. Finally, in the remainingsubsections, we extend the basic concatenation QTA model to MCB style pooling; in-troduce question type as both input and output during training such that the network can uestion Type Guided Attention in Visual Question Answering 9

Table 2: QTA models

Name Image feature Text feature ModalCATL-QTA

QTA weighted pre-trained vector features End-to-end 2-layer LSTM’s last hidden state MLPfrom ResNet and Faster R-CNN

MCB-QTA

QTA weighted pre-trained spatial features End-to-end 2-layer LSTM’s last hidden state MCBfrom ResNet and Faster R-CNN

CAT-QT

Concatenation of ResNet Concatenation of Word2Vec pre-trined feature MLPand Faster R-CNN vector features and a 1024-dim question type embedding

CATL-QT

Concatenation of ResNet Concatenation of end-to-end 2-layer LSTM’s last MLPhidden state and Faster R-CNN vector features and a 1024-dim question type embedding

CATL-QTA-M

QTA weighted pre-trained spatial features End-to-end 2-layer LSTM’s last hidden state Multi-task MLPfrom ResNet and Faster R-CNN produce predicted question types during inference; and study more in depth the effectof the question category “Absurd” on the overall model performance across categories.

Table 3 reports our extensive ablation analysis of simple concatenation models usingmultiple visual and lexical feature sources. From the results in the second and thirdcolumns, we see that overall the model with Faster R-CNN features outperform the oneusing ResNet features when using NMT features. We show in column 4 that the featuressources are complementary, and their combination is better across most categories (inbold) with respect to the single source models in columns 2 and 3. In columns 5,6; 7,8and 9,10 we replicate the same comparison between ResNet and R-CNN features usingmore sophisticate models to embed the lexical information. We reach more than 10 %accuracy increase, from 69.53 % to 80.16 % using a simple concatenation model withan accurate selection of the feature type.

The ﬁrst four columns in Table 3 show the results of models with text features fromNMT. To fully explore the text feature extractor in VQA system, we substitute the NMTpre-trained language feature extractor with a jointly-trained two layer LSTM model.The improved performance of jointly-training text feature extractor can be appreciatedby comparing the results of the 4 left-most and right most columns in Table 3. For ex-ample, comparing second column and ﬁfth column in Table 3, we get 6% improvementusing LSTM while keeping image feature and network same.We obtain the best model by concatenating the output of the LSTM and the pre-trained NMT/Word2Vec feature, as shown in Table 3. It gives us improvement for“Utility and Affordances” when we look at the ﬁfth and seventh column. We ﬁnd theuse of Word2Vec is better than NMT feature in last four columns in Table 3. We thinkthe better performance of Word2Vec with respect to the NMT encoder, might be dueto the more similar structure of single sentence samples of Word2Vec training set withthose from classical VQA dataset with respect to those used for training NMT models.

Table 3: Benchmark results of concatenation models on TDIUC dataset using differ-ent image features and pre-trained language feature. 1: Use ResNet feature and Skip-Gram feature 2: Use ResNet feature and NMT feature 3: Use Faster R-CNN featureand NMT feature 4: Use ResNet feature and end-to-end LSTM feature 5: Use FasterR-CNN feature and end-to-end LSTM feature. N denotes that additional NMT embed-ding is concatenated to LSTM output. W denotes that additional Word2Vec embeddingis concatenated to LSTM output(Following tables also use the same notation)

Columns 1 2 3 4 5 6 7 8 9 10Accuracy(%)

CAT1 [15] CAT1 CAT1 CAT2 CAT1L CAT1L CAT1L N CAT1L N CAT1L W CAT1L W Scene Recognition 72.19 68.51 68.81

Sport Recognition 85.16 89.67 92.36

Color Attributes 43.69 32.90 34.35

Other Attributes 42.89 38.05

Activity Recognition 24.16 39.34 45.75

Positional Reasoning 25.15 25.63 27.16

Sub. Object Recognition 80.92 83.94 85.67

Absurd 96.96 94.98 94.77

Object Presence 69.43 77.21 77.90

Counting 44.82 48.46 52.18

Sentiment Understanding 53.00 43.45 46.49

Overall (Arithmetic MPT) 55.25 55.67 57.57 58.31 59.05 62.15 58.96 60.66 59.38

Overall (Harmonic MPT) 44.13 45.37 47.99 48.44 44.09

Accuracy(%)

CATL CATL-QTA CATL W CATL-QTA W Scene Recognition 93.18 93.45 93.31 93.80Sport Recognition 94.69 95.45 94.96 95.55Color Attributes 54.66 56.08 57.59 60.16Other Attributes 48.52 50.30 52.25 54.36Activity Recognition 53.36 58.43 54.59 60.10Positional Reasoning 32.73 31.94 33.63 34.71Sub. Object Recognition 86.56 86.76 86.52 86.98Absurd 95.03 100.00 98.01 100.00Utility and Affordances 29.01 23.46 29.01 31.48Object Presence 93.34 93.48 94.13 94.55Counting 50.08 49.93 52.97 53.25Sentiment Understanding 56.23 56.87 62.62 64.38Overall (Arithmetic MPT) 65.62 66.34 67.46 69.11Overall (Harmonic MPT) 55.95 54.60 57.83 60.08Overall Accuracy 82.23 83.62 83.92 85.03

Table 4: QTA in concatenation modelson TDIUC dataset Fig. 5: Evaluation of different ways to utilize in-formation from question type

We use QTA in concatenation models to study the effect of QTA. The framework isin Figure 2. We compare the network using a weighted feature with the same networkusing an unweighted concatenated image feature in Table 4. As we can see, the modelusing the weighted feature has more power than the one using the unweighted feature.9 out of 12 categories get improved results. “Color” and “Activity Recognition” getaround 2 % and 6 % accuracy increases. uestion Type Guided Attention in Visual Question Answering 11 To ensure that the improvement is not because of the added question type informa-tion but the attention mechanism using question type, we show the comparison of QTAwith QT in Figure 5. With same text feature and image feature and approximately samenumber of parameters in the network, QTA is 3-5 % better than QT.We show the effect of QTA on image feature norms in Figure 6. By weighing theimage features by question type, we ﬁnd that our model relies more on Faster R-CNNfeatures for “Absurd” question samples while it relies more on ResNet features for“Color” questions.Fig. 6: Effects of weighting by QTA. Top: raw feature norms, Middle: feature normsweighted by QTA, Bottom: differences of norms after weighting vs before weighting.For color questions, the feature norms shift towards ResNet features, while for absurdquestions they shift towards Faster-RCNN features.The best setting we get in concatenation model is using a weighted image featureconcatenated with the output of the LSTM and Word2Vec feature(CATL-QTA W ). Itgets 5% improvement compared to complicated deep network such as RAU and MCB-A in Table 5. To show how to combine QTA with more complicated feature integration operator,we propose MCB-QTA structure. Even though MCB-QTA in Table 5 doesn’t win withsimple accuracy, it shows great performance in many categories such as “Object Recog-nition” and “Counting”. Accuracy in “Utility and Affordances” is improved by 6%compared to our CATL-QTA model. It gets 8% improvement in “Activity recognition”compared to state-of-art model MCB-A and also gets the best Arithmetic and HarmonicMPT value.

Table 5: Results of QTA models on TDIUC dataset compared to state-of-art models

Accuracy(%)

CATL-QTA W MCB-QTA MCB-A [15]

RAU [15]Scene Recognition 93.80 93.56 93.06

Sport Recognition 95.55

In this part, we will discuss how we use QTA when we have questions without speciﬁcquestion types. It is quite easy to predict the question type from the question itself.We use a 2-layer LSTM followed by a classiﬁer and the test accuracy is 96% after 9epochs. The problem is whether we can predict the question type while keeping thesame performance for VQA task or not. As described in Figure 3, we use the predictedquestion type as input of the QTA network in a multi-task setting. We get 84.33% testsimple accuracy for VQA task as shown in Table 9. When we compare it to MCB-Aor RAU in Table 5, though accuracy gets a little affected for most of the categories, westill get 2% improvement in “Sports Recognition” and “Counting”.We ﬁne-tune our model on VQA v1 using a pre-trained multi-task model that wastrained on TDIUC. We use the question type predictor in the multi-task model as theinput of QTA. Our model’s performance is better than MCB in Table 6 with an approx-imately same number of parameters in the network.Table 6: Results of test-dev accuracy on VQA v1. Models are trained on the VQA v1train split and tested on test-dev

Accuracy(%)

Element-wise Sum [8] 56.50Concatenation [8] 57.49Concatenation + FC [8] 58.40Element-wise Product [8] 58.57Element-wise Product + FC [8] 56.44MCB(2048 × → uestion Type Guided Attention in Visual Question Answering 13 To further analyze the effects of the question type prediction part in this multi-taskframework, we list the confusion matrix for the question type prediction results in Ta-ble 7. “Color” and “Absurd” question type predictions are most often bi-directionallyconfused. The reason for this is that among all absurd questions, more than 60% arequestions start with “What color”. To avoid this bias, we remove all absurd questionsand run our multi-task model again. In this setting, our question type prediction didmuch better than before. Almost all categories get 99% accuracy as shown in Table 8.We also compare our QTA models’ performance without absurd questions in Table 9.In CATL-QTA network, removing absurd questions doesn’t help much because in testwe feed in the true question type labels. But it is useful when we consider the multi-task model. From third and fourth columns, we see that without absurd questions, weget improved performance among all categories. This is because we remove the absurdquestions that may mislead the network to predict “color” question type in the test.Table 7: Confusion matrix for test question types prediction in CATL-QTA-M usingTDIUC dataset. 1. Other Attributes 2. Sentiment Understanding 3. Sports Recognition4. Position Reasoning 5. Object Utilities/Affordances 6. Activity Recognition 7. SceneClassiﬁcation 8. Color 9. Object Recognition 10.Object Presence 11.Counting 12. Ab-surd

Target Predicted Acc(%)1 2 3 4 5 6 7 8 9 10 11 12 95.661 77.76 0.00 0.89 3.20 0.00 0.08 0.42 1.15 0.12 0.00 0.00 16.382 0.80 60.51 1.77 8.83 0.00 2.25 2.57 0.00 1.44 0.96 0.16 20.713 0.31 0.00 73.08 0.37 0.00 0.17 0.00 0.03 0.02 0.00 0.01 26.014 2.95 0.02 0.01 89.52 0.00 0.01 0.02 0.19 1.88 0.03 0.03 5.355 12.50 0.63 3.12 45.62 0.00 0.00 3.12 0.00 11.25 0.00 0.00 23.756 0.79 0.00 14.56 1.76 0.00 13.18 0.00 0.00 2.21 0.00 0.07 67.437 0.04 0.00 0.04 0.40 0.00 0.01 99.40 0.02 0.00 0.00 0.06 0.038 0.32 0.00 0.18 0.13 0.00 0.00 0.00 86.10 0.00 0.00 0.00 13.289 0.01 0.00 0.00 0.31 0.00 0.00 0.00 0.00 98.96 0.01 0.00 0.7110 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.0011 0.00 0.00 0.00 0.01 0.00 0.00 0.02 0.00 0.02 0.05 99.90 0.0012 0.35 0.00 0.18 0.41 0.00 0.03 0.00 3.18 0.40 0.00 0.00 95.46

We propose a question type-guided visual attention (QTA) network. We show empir-ically that with the question type information, models can balance between bottom-up and top-down visual features and achieve state-of-the-art performance. Our resultsshow that QTA systematically improves the performance by more than 5% across mul-tiple question type categories such as “Activity Recognition”, “Utility” and “Counting”

Table 8: Confusion matrix for test question types prediction in CATL-QTA-M usingTDIUC dataset without absurd questions. Numbers represent same categories as in Ta-ble 7

Target Predicted Acc(%)1 2 3 4 5 6 7 8 9 10 11 12 99.501 98.39 0.00 0.07 0.15 0.00 0.13 0.08 0.63 0.55 0.00 0.00 N/A2 0.16 84.03 3.67 0.00 0.00 3.35 5.59 0.00 0.48 0.00 2.72 N/A3 0.00 0.08 97.31 0.00 0.00 2.37 0.01 0.00 0.10 0.02 0.11 N/A4 1.01 0.00 0.00 98.07 0.00 0.01 0.00 0.51 0.41 0.00 0.00 N/A5 8.64 3.70 14.81 0.00 0.00 59.26 7.41 1.23 4.94 0.00 0.00 N/A6 0.45 0.15 31.42 0.00 0.00 67.39 0.04 0.04 0.45 0.00 0.07 N/A7 0.02 0.03 0.00 0.00 0.00 0.03 99.86 0.02 0.00 0.00 0.04 N/A8 0.06 0.00 0.00 0.13 0.00 0.04 0.07 99.70 0.00 0.00 0.00 N/A9 0.06 0.00 0.13 0.01 0.00 0.02 0.00 0.00 99.76 0.01 0.00 N/A10 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 N/A11 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.01 99.98 N/A12 N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A

Table 9: Results of test accuracy when question type is hidden with/without absurdquestions in training. We compare them with similar QTA models. * denotes trainingand testing without absurd questions

CATL-QTA W CATL W ∗ CATL-QTA W ∗ CATL-QTA-M CATL-QTA-M ∗ CAT1 ∗ [15]Scene Recognition 93.80 93.46 93.62 93.74 93.82 72.75Sport Recognition 95.55 94.97 95.47 94.80 95.31 89.40Color Attributes 60.16 57.84 58.63 57.62 59.73 50.52Other Attributes 54.36 53.90 53.44 52.05 56.17 51.47Activity Recognition 60.10 57.38 59.43 53.13 58.61 48.55Positional Reasoning 34.71 33.98 34.63 33.90 34.70 27.73Sub. Object Recognition 86.98 86.62 86.74 86.89 86.80 81.66Absurd 100.00 N/A N/A 98.57 N/A N/AUtility and Affordances 31.48 27.78 34.57 24.07 35.19 30.99Object Presence 94.55 93.87 94.22 94.57 94.60 69.50Counting 53.25 52.33 52.20 53.59 55.30 44.84Sentiment Understanding 64.38 64.06 65.81 60.06 61.31 59.94Overall (Arithmetic MPT) 69.11 65.11 66.25 66.92 66.88 57.03Overall (Harmonic MPT) 60.08 55.89 58.51 55.77 58.82 50.30Simple Accuracy 85.03 79.79 80.13 84.33 80.95 63.30 on TDIUC dataset. We consider the case when we don’t have question type for test andpropose a multi-task model to overcome this limitation by adding question type predic-tion task in the VQA task. We get around 95% accuracy for the question type predictionwhile keeping the VQA task accuracy almost same as before. Acknowledgements

We thank Amazon AI for providing computing resources. YangShi is supported by Air Force Award FA9550-15-1-0221. uestion Type Guided Attention in Visual Question Answering 15

References

1. Agrawal, A., Batra, D., Parikh, D.: Analyzing the behavior of visual question answeringmodels. EMNLP 2016 (2016)2. Agrawal, A., Batra, D., Parikh, D., Kembhavi, A.: Don’t just assume; look and answer: Over-coming priors for visual question answering. CVPR 20183. Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., Zhang, L.: Bottom-upand top-down attention for image captioning and VQA. arXiv:1707.07998 (2017), http://arxiv.org/abs/1707.07998

4. Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., Parikh, D.: VQA: VisualQuestion Answering. In: International Conference on Computer Vision (ICCV) (2015)5. Charikar, M., Chen, K., Farach-Colton, M.: Finding frequent items in data streams. In Pro-ceedings of ICALP’02 (2002)6. Chen, T., Li, M., Li, Y., Lin, M., Wang, N., Wang, M., Xiao, T., Xu, B., Zhang, C., Zhang,Z.: Mxnet: A ﬂexible and efﬁcient machine learning library for heterogeneous distributedsystems. Neural Information Processing Systems, Workshop on Machine Learning Systems2015 (2015)7. Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., Darrell, T.: Decaf: Adeep convolutional activation feature for generic visual recognition. arXiv:1310.1531 (2013), http://arxiv.org/abs/1310.1531

8. Fukui, A., Park, D.H., Yang, D., Rohrbach, A., Darrell, T., Rohrbach, M.: Multimodal com-pact bilinear pooling for visual question answering and visual grounding. EMNLP 2016(2016)9. Gao, Y., Beijbom, O., Zhang, N., Darrell, T.: Compact bilinear pooling. Computer Visionand Pattern Recognition (CVPR), 2016 (2016)10. Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in VQA matter:Elevating the role of image understanding in Visual Question Answering. In: Conference onComputer Vision and Pattern Recognition (CVPR) (2017)11. Gurari, D., Li, Q., Stangl, A.J., Guo, A., Lin, C., Grauman, K., Luo, J., Bigham, J.P.: Vizwizgrand challenge: Answering visual questions from blind people. arXiv:1802.08218 (2018)12. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In Pro-ceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).13. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural computation (8), 1735–1780 (1997)14. Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Zitnick, C.L., Girshick, R.B.:CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning.arXiv:1612.06890 (2016), http://arxiv.org/abs/1612.06890

15. Kaﬂe, K., Kanan, C.: An analysis of visual question answering algorithms. In: ICCV (2017)16. Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li,L., Shamma, D.A., Bernstein, M.S., Li, F.: Visual genome: Connecting language and visionusing crowdsourced dense image annotations. arXiv:1602.07332 (2016)17. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classiﬁcation with deep convolutionalneural networks. In: Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.) Ad-vances in Neural Information Processing Systems 25, pp. 1097–1105. Curran Associates,Inc. (2012)18. Lin, T., Maire, M., Belongie, S.J., Bourdev, L.D., Girshick, R.B., Hays, J., Perona, P., Ra-manan, D., Doll´ar, P., Zitnick, C.L.: Microsoft COCO: common objects in context. In ECCV(2014)19. Lu, J., Yang, J., Batra, D., Parikh, D.: Hierarchical question-image co-attention for visualquestion answering. In NIPS (2016)6 Y. Shi and T. Furlanello and S. Zha and A. Anandkumar20. Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations ofwords and phrases and their compositionality. Advances in neural information processingsystems pp. 3111–311921. Noh, H., Han, B.: Training recurrent answering units with joint loss minimization for vqa.arXiv:1606.03647 (2016)22. Ren, S., He, K., Girshick, R.B., Sun, J.: Faster R-CNN: towards real-time object detectionwith region proposal networks. arXiv:1506.01497 (2015), http://arxiv.org/abs/1506.01497

23. Simon, M., Rodner, E., Gao, Y., Darrell, T., Denzler, J.: Generalized orderless pooling per-forms implicit salient matching (2017)24. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recog-nition. arXiv:1409.1556 (2014), http://arxiv.org/abs/1409.1556

25. Strub, F., de Vries, H., Mary, J., Piot, B., Courville, A.C., Pietquin, O.: End-to-end optimiza-tion of goal-driven and visually grounded dialogue systems. In: International Joint Confer-ence on Artiﬁcial Intelligence (IJCAI) (2017)26. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks.arXiv:1409.3215 (2014), http://arxiv.org/abs/1409.3215

27. Teney, D., Anderson, P., He, X., van den Hengel, A.: Tips and tricks for visual question an-swering: Learnings from the 2017 challenge. arXiv:1708.02711 (2017), http://arxiv.org/abs/1708.02711

28. Wang, Z., Liu, X., Chen, L., Wang, L., Qiao, Y., Xie, X., Fowlkes, C.: Structured tripletlearning with pos-tag guided attention for visual question answering. IEEE Winter Conf. onApplications of Computer Vision (2018)29. Wu, Y., Schuster, M., Chen, Z., Le, Q.V., Norouzi, M., Macherey, W., Krikun, M., Cao,Y., Gao, Q., Macherey, K., Klingner, J., Shah, A., Johnson, M., Liu, X., Łukasz Kaiser,Gouws, S., Kato, Y., Kudo, T., Kazawa, H., Stevens, K., Kurian, G., Patil, N., Wang, W.,Young, C., Smith, J., Riesa, J., Rudnick, A., Vinyals, O., Corrado, G., Hughes, M., Dean, J.:Google’s neural machine translation system: Bridging the gap between human and machinetranslation. arXiv:1609.08144 (2016), http://arxiv.org/abs/1609.08144http://arxiv.org/abs/1609.08144