2021 International Joint Conference on Neural Networks (IJCNN) | 2021
Looking Back and Forward: Enhancing Image Captioning with Global Semantic Guidance
Abstract
Image captioning requires models to translate an image into natural language description. Recent captioning methods are mostly based on the popular encoder-decoder framework, which uses the encoder to encode an image into a set of vectors and a decoder to produce the caption word-by-word. However, when generating the next word, the decoder only looks back at the preceding words but fails to foresee the posterior ones that are supposed to be produced, lacking awareness of the global semantic information of the image. Moreover, the existing generated caption provides global semantic information of a given image but sometimes contains inconsistent content that needs to be fixed. This paper proposes a novel Temporal-Free Semantic-Guided attention mechanism (TFSG) to utilize the raw caption pre-generated by a primary decoder as the extra input to provide global semantic guidance during generation, deepening visual understanding by balancing the semantic and visual information. Specifically, we design two attention-based structures for TFSG to selectively take the essence from the semantic and visual information to produce refined descriptions. We apply TFSG to two popular captioning models and conduct extensive experiments on the public MS COCO benchmark. Results show that our proposed model achieves superior performance over many state-of-the-art models.