IEEE Transactions on Circuits and Systems for Video Technology | 2021

Learning Video Moment Retrieval Without a Single Annotated Video

 
 

Abstract


Video moment retrieval has progressed significantly over the past few years, aiming to search the moment that is most relevant to a given natural language query. Most existing methods are trained in a fully-supervised or a weakly-supervised manner, which requires a time-consuming and expensive manually labeling process. In this work, we propose an alternative approach to achieving video moment retrieval that requires no textual annotations of videos and instead leverages the existing visual concept detectors and a pre-trained image-sentence embedding space. Specifically, we design a video-conditioned sentence generator to produce a suitable sentence representation by utilizing the mined visual concepts in videos. We then design a GNN-based relation-aware moment localizer to reasonably select a portion of video clips under the guidance of the generated sentence. Finally, the pre-trained image-sentence embedding space is adopted to evaluate the matching scores between the generated sentence and moment representations with the knowledge transferred from the image domain. By maximizing these scores, the sentence generator and moment localizer can enhance and complement each other to achieve the moment retrieval task. Experimental results on the Charades-STA and ActivityNet Captions datasets demonstrate the effectiveness of our proposed method.

Volume None
Pages 1-1
DOI 10.1109/TCSVT.2021.3075470
Language English
Journal IEEE Transactions on Circuits and Systems for Video Technology

Full Text