IEEE/ACM Transactions on Audio, Speech, and Language Processing | 2021

Speech Enhancement via Attention Masking Network (SEAMNET): An End-to-End System for Joint Suppression of Noise and Reverberation

 
 

Abstract


This paper proposes the Speech Enhancement via Attention Masking Network (SEAMNET), a neural network-based end-to-end single-channel speech enhancement system designed for joint suppression of noise and reverberation. It formalizes an end-to-end network architecture, referred to as b-Net, which accomplishes noise suppression through attention masking in a learned embedding space. A key contribution of SEAMNET is that the b-Net architecture contains both an enhancement and an autoencoder path. This paper proposes a novel loss function which simultaneously trains both the enhancement and the autoencoder paths, so that disabling the masking mechanism during inference causes SEAMNET to reconstruct the input speech signal. This allows dynamic control of the level of suppression applied by SEAMNET via a minimum gain level, which is not possible in other state-of-the-art approaches to end-to-end speech enhancement. This paper also proposes a perceptually-motivated waveform distance measure. In addition to the b-Net architecture, this paper proposes a novel method for designing target waveforms for network training, so that joint suppression of additive noise and reverberation can be performed by an end-to-end enhancement system, which has not been previously possible. Experimental results show the SEAMNET system to outperform a variety of state-of-the-art baselines systems, both in terms of objective speech quality measures and subjective listening tests. Finally, this paper draws parallels between SEAMNET and conventional statistical model-based enhancement approaches, offering interpretability of many network components.

Volume 29
Pages 515-526
DOI 10.1109/TASLP.2020.3043655
Language English
Journal IEEE/ACM Transactions on Audio, Speech, and Language Processing

Full Text