2021 International Joint Conference on Neural Networks (IJCNN) | 2021

Attentive Contextual Network for Image Captioning

 
 
 

Abstract


Existing image captioning approaches fail to generate fine-grained captions due to the lack of rich encoding representation of an image. In this paper, we present an attentive contextual network (ACN) to learn the spatially transformed image features and dense multi-scale contextual information of an image to generate semantically meaningful captions. At first, we construct deformable network on intermediate layers of convolutional neural network (CNN) to cultivate spatial invariant features. And the multi-scale contextual features are produced by employing contextual network on top of last layers of CNN. Then, we exploit attention mechanism on contextual network to extract dense contextual features. Further, the extracted spatial and contextual features are combined to encode the holistic representation of an image. Finally, a multi-stage caption decoder with visual attention module is incorporated to generate fine-grained captions. The performance of the proposed approach is demonstrated on COCO dataset, the largest dataset for image captioning.

Volume None
Pages 1-8
DOI 10.1109/IJCNN52387.2021.9533970
Language English
Journal 2021 International Joint Conference on Neural Networks (IJCNN)

Full Text