2021 2nd International Conference on Artificial Intelligence and Information Systems | 2021
Audio-Visual Salieny Network with Audio Attention Module
Abstract
Recently, inspired by the fact that people rely on visual and audio modalities to observe the world, the audio-visual multi-modal method for computer vision tasks in deep learning networks has attracted more and more attention, and a lot of work has been achieved good progress. However, the audio feature extraction of most audio-visual multi-modal networks only simply extracts the features of input audio, but rarely considers whether the audio noise will influence the prediction results. Based on this fact, we propose a detachable CNN audio attention module and apply it to a current classic audio-visual saliency prediction network. Experimental results demonstrate that our audio attention module has little impact on the prediction speed of the existing network but significant improvement on the accuracy of prediction, which proves the effectiveness of this module for enhancing audio feature extraction.