Multim. Tools Appl. | 2021

Temporal video scene segmentation using deep-learning

 
 

Abstract


The automatic temporal video scene segmentation (also known as video story segmentation) is still an open problem without definite solutions in most cases. Among the available techniques, the ones which shows better results are multimodal using features extracted from multiple modalities. Multimodal fusion may be performed to fuse each modality as a single representation (early fusion) or by each modality segmentation (late fusion), the latter been widely due to multimodal fusion simplicity. Recently, deep learning techniques such as convolutional neural networks (CNN) has been successfully employed to extract features from multiple data sources, easing the development of early fusion methods. However, CNNs cannot adequately learn cues which are temporally distributed along the video due to difficulties to model temporal features data dependencies. A particular deep learning approach which can learn such cues is the recurrent neural network (RNN). Successfully employed on text processing, RNNs are fitted to analyze sequences of data of variable length and may better grasp the temporal relationship among low-level features of video segments, hopefully obtaining more accurate scene boundary detection. This paper goes beyond direct applying RNNs and proposes a new multimodal approach to temporally segment a video into scenes. This approach builds a new architecture carefully combining CNN and RNN capabilities, obtaining better efficacy results on the task when compared with related techniques on a public video dataset.

Volume 80
Pages 17487-17513
DOI 10.1007/S11042-020-10450-2
Language English
Journal Multim. Tools Appl.

Full Text