Expert Systems with Applications | 2021
Audiovisual speaker indexing for Web-TV automations
Abstract
Abstract The current paper introduces a multimodal framework to provide Web-TV automations for live broadcasting and overall big streaming data management. The term indexing refers to the spatiotemporal localization of speakers participating in a discussion panel. Multiple modalities acting in parallel form the data-driven decision-making pipeline. The automated workflow includes the tasks of active speaker detection and localization, frame selection, and creation of a semantically annotated database. For improved performance and robustness, an information fusion model is proposed, which makes use of different audio and visual modalities. Audio-driven Voice Activity Detection follows the Enhanced Temporal Integration methodology applied on a standard audio feature set. For the localization of the dominant audio source, the argument that maximizes the General Cross-Correlation method is calculated. The visual modalities include face and mouth detection and Visual Voice Activity Detection. A Long Short Term Memory network is trained with mouth image sequences to determine voice activity. The values of the audio and visual Voice Activity Detection modules, as well as the General Cross-Correlation result, are used to train an Adaptive Neuro-Fuzzy model, which is responsible for the final decision. Experimental results prove the superiority of the information fusion approach compared to unimodal audio and visual models.