Learning Rich Image Region Representation for Visual Question Answering
Bei Liu, Zhicheng Huang, Zhaoyang Zeng, Zheyu Chen, Jianlong Fu
aa r X i v : . [ c s . C V ] O c t Learning Rich Image Region Representation for Visual Question Answering
Bei Liu, Zhicheng Huang, Zhaoyang Zeng, Zheyu Chen, Jianlong FuMicrosoft Research Asia { bei.liu, t-zhihua, v-zhazen, t-zheche, jianf } @microsoft.com Abstract
We propose to boost VQA by leveraging more powerfulfeature extractors by improving the representation ability ofboth visual and text features and the ensemble of models.For visual feature, some detection techniques are used toimprove the detector. For text feature, we adopt BERT as thelanguage model and find that it can significantly improveVQA performance. Our solution won the second place inthe VQA Challenge 2019.
1. Introduction
The task of Visual Question Answering (VQA) requiresour model to answer text question based on the input im-age. Most works [6, 9, 16] leverage visual features ex-tracted from images and text features extracted from ques-tion to perform classification to obtain answers. Thus, vi-sual and textual features serve as basic components whichcan directly impact the final performance. In this paper, wepropose to improve the performance of VQA by extractingmore powerful visual and text features.For visual features, most existing works [6, 9, 16] adoptbottom-up-attention features released by [1], whose featureextractor is a Faster R-CNN object detector built upon aResNet-101 backbone. We adopt more powerful backbones(i.e. ResNeXt-101, ResNeXt-152) to train stronger detec-tors. Some techniques (i.e. FPN, multi-scale training) thatare useful to improve the accuracy of detectors can also helpto boost the performance of VQA.For text features, we build upon recent state-of-art tech-niques in the NLP community. Large-scale language mod-els such as ELMO [10], GPT [11] and BERT [4], haveshown excellent results for various NLP tasks in both tokenand sentence level. BERT uses masked language modelsto enable pre-trained deep bidirectional representations andallows the representation to fuse the right and left context.While in VQA model, to get the question answer, we needthe token level features to contain questions’ contextual in-formation to fuse with the visual tokens for the reasoning.So we adopt the BERT as our language mode. Experiments on VQA 2.0 dataset shows the effective-ness of each component of our solution. Our final modelachieves . accuracy on test-standard split, whichwon the second place in the VQA Challenge 2019.
2. Feature Representation
Existing works [2, 13] show that detection features aremore powerful than classification features on VQA task,therefore we train object detectors on large-scale dataset forfeature extraction.
Dataset.
We adopt Visual Genome 1.2[7] as object de-tection dataset. Following the setting in [2], we adopt , object classes and attribute classes as training cate-gories. The dataset is divided into train , val , and test splits,which contain , , , and , images, respec-tively. We train detectors on train split, and use val and test as validation set to tune parameters. Detector.
We follow the pipeline of Faster R-CNN [12]to build our detectors. We adopt ResNeXt [15] with FPN [8]as backbone, and use parameters pretrained on ImageNet[3] for initialization. We use RoIAlign [5] to wrap regionfeatures into fixed size, and embed them into -dim viatwo fully connected layer. Similar to [2], we extend a clas-sification branch on the top of region feature, and utilize at-tribute annotations as additional supervision. Such attributebranch is only used to enhance feature representation abil-ity in the training stage, and will be discarded in the featureextraction stage.
Feature.
Given an image, we first feed it into the traineddetector, and apply non-maximum suppression (NMS) oneach category to remove duplicate bbox. Then we seek boxes with highest object confidence, and extract their -dim FC feature. These boxes with their featuresare considered as the representation of the given image.
The BERT model, introduced in [4], can be seen as amulti-layer bidirectional Transformer based on [14]. Themodel consists of Embedding layers, Transformer blocks1 plit Backbone FPN dim Attribute Language Yes/No Num Others Score test-dev
Bottom-up-attention (ResNet-101) - X Glove 85.42 54.04 60.52 70.04FaceBook pythia (ResNeXt-101) 512 X Glove 85.56 52.68 60.87 70.11ResNeXt-101 256 Glove 83.1 53.0 55.62 66.64ResNeXt-101 256 X Glove 85.44 54.2 60.87 70.23ResNeXt-152 256 X Glove 86.42 55.11 61.88 71.22ResNeXt-152 512 X Glove 86.59 56.44 62.06 71.53ResNeXt-152 (ms-train) 256 X Glove 86.46 56.37 62.24 71.55ResNeXt-152 (ms-train) 512 X Glove 86.54 56.90 62.31 71.68ResNeXt-152 256 X BERT 88.00 56.28 62.90 72.48ResNeXt-152 512 X BERT 88.15 56.79 62.98 72.64ResNeXt-152 (ms-train) 256 X BERT 88.14 56.74 63.3 72.79ResNeXt-152 (ms-train) 512 X BERT 88.18 55.35 63.16 72.58Ensemble (5 models) - - - 89.65 58.53 65.27 74.55Ensemble (20 models) - - - 89.81 58.89 65.39 74.71 test-std
Ensemble (20 models) - - - 89.81 58.36 65.69 74.89
Table 1. Experiment results on VQA 2.0 test-dev and test-std splits. We adopt BAN as VQA model in all settings. The first two rowsindicate the results of models we train on released features. “ms-train” means using multi-scale strategy in detectors training. and self-attention heads, which has two different modelsize. For the base model, there are layers Transformerblocks, the hidden size is -dim, and the number of self-attention heads is . The total parameters are M. Forthe large model size, the model consists of Transformerblocks which hidden size is , and self-attentionheads. And the model total parameters is M The modelcan process a single text sentence or a pair of text sentences(i.e.,[Question, Answer]) in one token sequence. To sep-arate the pair of text sentence, we can add special token([SEP]) between two sentences, add a learned sentence Aembedding to every token of the first sentence and a sen-tence B embedding to every token of the second sentence.For VQA task, there is only one sentence, we only use thesentence A embedding.Considering that the total parameter of VQA model isless than
M, we use the base BERT as our languagemodel to extract question features. To get each token’s rep-resentation, we only use the hidden state corresponding tothe last attention block features of the full sequence. Pre-trained BERT model has shown to be effective for boostingmany natural language processing tasks, we adopt the baseBERT uncased pre-train weight as our initial parameters.
3. VQA Model
Recent years, there are many VQA models, which haveachieved surprising results. We adopt the Bilinear Atten-tion Networks [6] (BAN) as our base model. The sin-gle model with eight-glimpse can get . on VQA2.0test-dev subset. The BAN model uses Glove and GRU asthe language model. And the language feature is a vector [ questionlength, . To improve the VQA model per-formance, we replace the language model with base BERTand modify the BAN language input feature dimension. To Train the BAN with BERT, we use all settings from BAN,but set the max epoch is with costing learn rate sched-uler. To use the base BERT pre-trained parameters we setthe learning rate of the BERT module to e − .
4. Experiments
Table 1 shows all our ablation study on each compo-nent, including attribute head, FPN dimension, languagemodel. From rd and th row, we can find that the attributehead can bring more than improvement to the final per-formance, which shows the effectiveness of such module.From th to th row, we find that BERT can boost theperformance by more than point improvement stably. Be-sides, increasing FPN dimension and adopting multi-scaletraining can both slightly improve the VQA accuracy. We select BAN trained on Bottom-up-attention andFacebook features as baselines. Our single model resultachieves . accuracy on test-dev split, which signifi-cantly outperforms all existing state-of-the-arts. We also en-semble several models we trained by averaging their proba-bilities output. The result by models’ ensemble achieves . and . accuracy on VQA test-dev and test-std splits, respectively. Such result won the second place in theVQA Challenge 2019.
5. Conclusion
We have shown that for VQA task, the representationcapacity of both visual and textual features is critical for thefinal performance. eferences [1] Peter Anderson, Xiaodong He, Chris Buehler, DamienTeney, Mark Johnson, Stephen Gould, and Lei Zhang.Bottom-up and top-down attention for image captioning andvisual question answering. In
CVPR , 2018.[2] Peter Anderson, Xiaodong He, Chris Buehler, DamienTeney, Mark Johnson, Stephen Gould, and Lei Zhang.Bottom-up and top-down attention for image captioning andvisual question answering. In
Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition , pages6077–6086, 2018.[3] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li,and Li Fei-Fei. Imagenet: A large-scale hierarchical imagedatabase. In , pages 248–255. Ieee, 2009.[4] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and KristinaToutanova. Bert: Pre-training of deep bidirectionaltransformers for language understanding. arXiv preprintarXiv:1810.04805 , 2018.[5] Kaiming He, Georgia Gkioxari, Piotr Doll´ar, and Ross Gir-shick. Mask r-cnn. In
Proceedings of the IEEE internationalconference on computer vision , pages 2961–2969, 2017.[6] Jin-Hwa Kim, Jaehyun Jun, and Byoung-Tak Zhang. Bilin-ear Attention Networks. In
Advances in Neural InformationProcessing Systems 31 , pages 1571–1581, 2018.[7] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson,Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan-tidis, Li-Jia Li, David A Shamma, et al. Visual genome:Connecting language and vision using crowdsourced denseimage annotations.
International Journal of Computer Vi-sion , 123(1):32–73, 2017.[8] Tsung-Yi Lin, Piotr Doll´ar, Ross Girshick, Kaiming He,Bharath Hariharan, and Serge Belongie. Feature pyramidnetworks for object detection. In
Proceedings of the IEEEConference on Computer Vision and Pattern Recognition ,pages 2117–2125, 2017.[9] Gao Peng, Hongsheng Li, Haoxuan You, Zhengkai Jiang,Pan Lu, Steven Hoi, and Xiaogang Wang. Dynamic fusionwith intra-and inter-modality attention flow for visual ques-tion answering. arXiv preprint arXiv:1812.05252 , 2018.[10] Matthew E. Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and Luke Zettle-moyer. Deep contextualized word representations. In
Proc.of NAACL , 2018.[11] Alec Radford, Karthik Narasimhan, Time Salimans, and IlyaSutskever. Improving language understanding with unsuper-vised learning. Technical report, Technical report, OpenAI,2018.[12] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.Faster r-cnn: Towards real-time object detection with regionproposal networks. In
Advances in neural information pro-cessing systems , pages 91–99, 2015.[13] Damien Teney, Peter Anderson, Xiaodong He, and Antonvan den Hengel. Tips and tricks for visual question an-swering: Learnings from the 2017 challenge. In
Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition , pages 4223–4232, 2018. [14] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and IlliaPolosukhin. Attention is all you need. In
Advances in neuralinformation processing systems , pages 5998–6008, 2017.[15] Saining Xie, Ross Girshick, Piotr Doll´ar, Zhuowen Tu, andKaiming He. Aggregated residual transformations for deepneural networks. In
Proceedings of the IEEE conference oncomputer vision and pattern recognition , pages 1492–1500,2017.[16] Zhou Yu, Jun Yu, Chenchao Xiang, Jianping Fan, andDacheng Tao. Beyond bilinear: Generalized multimodalfactorized high-order pooling for visual question answering.