Neurocomputing | 2021

Multi visual and textual embedding on visual question answering for blind people

 
 
 

Abstract


Abstract Visual impairment community, especially blind people have a thirst for assis- tance from advanced technologies for understanding and answering the image. Through the development and intersection between vision and language, Visual Question Answering (VQA) is to predict an answer from a textual question on an image. It is essential and ideal to help blind people with capturing the image and answering their questions automatically. Traditional approaches of- ten utilize the strength of convolution and recurrent networks, which requires a great effort for learning and optimizing. A key challenge in VQA is find- ing an effective way to extract and combine textual and visual features. To take advantage of previous knowledge in different domains, we propose BERT- RG, delicate integration of pre-trained models into feature extractors, relies on the interaction between residual and global features in the image and linguis- tic features in the question. Moreover, our architecture integrates a stacked attention mechanism that exploits the relationship between textual and visual objects. Specifically, the partial regions of images interact with partial keywords in question to enhance the text-vision representation. Besides, we also propose a novel perspective by considering a specific question type in VQA. Our proposal is significantly meaningful enough to develop a specialized system instead of putting forth the effort to dig for unlimited and unrealistic approaches. Experi- ments on VizWiz-VQA, a practical benchmark dataset, show that our proposed model outperforms existing models on the VizWiz VQA dataset in the Yes/No question type.

Volume None
Pages None
DOI 10.1016/j.neucom.2021.08.117
Language English
Journal Neurocomputing

Full Text