2021 7th International Conference on Computing and Artificial Intelligence | 2021

Online Audio-Visual Speech Separation with Generative Adversarial Training

 
 
 
 

Abstract


Audio-visual speech separation has been demonstrated to be effective in solving the cocktail party problem. However, most of the models cannot meet online processing, which limits their application in video communication and human-robot interaction. Besides, SI-SNR, the most popular training loss function in speech separation, results in some artifacts in the separated audio, which would harm downstream applications, such as automatic speech recognition (ASR). In this paper, we propose an online audio-visual speech separation model with generative adversarial training to solve the two problems mentioned above. We build our generator (i.e., audio-visual speech separator) with causal temporal convolutional network block and propose a streaming inference strategy, which allows our model to do speech separation in an online manner. The discriminator is involved in optimizing the generator, which can reduce the negative effects of SI-SNR. Experiments on simulated 2-speaker mixtures based on challenging audio-visual dataset LRS2 show that our model outperforms the state-of-the-art audio-only model Conv-TasNet and audio-visual model advr-AVSS under the same model size. We test the running time of our model on GPU and CPU, and results show that our model meets online processing. The demo and code can be found at https://github.com/aispeech-lab/oavss.

Volume None
Pages None
DOI 10.1145/3467707.3467764
Language English
Journal 2021 7th International Conference on Computing and Artificial Intelligence

Full Text