IEEE Access | 2021

Multi-Perspective Attention Network for Fast Temporal Moment Localization

 
 

Abstract


Temporal moment localization (TML) aims to retrieve the temporal interval for a moment semantically relevant to a sentence query. This is challenging because it requires understanding a video, a sentence, and the relationship between them. Existing TML methods have shown impressive performances by modeling interactions between videos and sentences using fine-grained techniques. However, these fine-grained techniques require a high computational overhead, making them impractical. This work proposes an effective and efficient multi-perspective attention network for temporal moment localization. Inspired by the way humans understand an image from multiple perspectives and different contexts, we devise a novel multi-perspective attention mechanism consisting of perspective attention and multi-perspective modal interactions. Specifically, a perspective attention layer based on multi-head attention takes two memory sequences, one as the base and the other as the reference memory, as inputs. Perspective attention assesses the two different memories, models the relationship, and encourages the base memory to focus on features related to the reference memory, providing an understanding of the base memory from the perspective of the reference memory. Furthermore, multi-perspective modal interactions model the complex relationship between a video and sentence query, and obtain the modal-interacted memory, consisting of a visual feature that selectively learned query-related information. Similar to the heavyweight fine-grained TML methods, the proposed network obtains the accurate complex relationship while being lightweight like coarse-grained TML methods. We also adopt a fast action recognition network to efficiently extract visual features, which reduce the computational overhead. Through experiments on three TML benchmark datasets, we demonstrate the effectiveness and efficiency of the proposed network.

Volume 9
Pages 116962-116972
DOI 10.1109/access.2021.3106698
Language English
Journal IEEE Access

Full Text