2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) | 2019

Speaker-discriminative Embedding Learning via Affinity Matrix for Short Utterance Speaker Verification

 
 
 
 

Abstract


Text-independent short utterance speaker verification (TI-SUSV) task remains more challenging compared to the full-length utterance SV task due to inaccurately estimated feature statistics or insufficient distinguishable speaker embeddings. It is noted that recently developed end-to-end SV systems (E2E-SV) achieve the state-of-the-art on several datasets, which directly learn a mapping from speech features to the compact fixed length speaker embeddings. In this study, following the E2E-SV pipeline, we strive to further improve the accuracy of TI-SUSV task. Our research is based on two intuitive ideas: better speech feature representation for SUs and better training loss function to obtain more discriminative embeddings. Specifically, a bidirectional gated recurrent unit network with residual connection (Res-BGRU) is firstly designed to improve feature representation capability. Secondly, a novel affinity loss is proposed where the mini-batch data has been manipulated to obtain more supervision information. In details, a speaker identity affinity matrix formed by one-hot speaker identity vectors is taken as the supervisor of the speaker embedding affinity matrix to obtain better inter-speaker separability and intra-speaker compactness. Experimental results on the Voxceleb1 dataset show that our system outperforms a conventional i-vector and x-vector system on TI-SUSV.

Volume None
Pages 314-319
DOI 10.1109/APSIPAASC47483.2019.9023024
Language English
Journal 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)

Full Text