Archive | 2019

Combining Speaker Recognition and Metric Learning for Speaker-Dependent Representation Learning

 
 
 

Abstract


In this paper, we tackle automatic speaker verification under a text-independent setting. Speaker modelling is performed by a deep convolutional neural network on top of time-frequency speech representations. Convolutions performed over the time dimension provide the means for the model to take both shortterm dependencies into account, given the nature of the learned filters which operate over short-windows, as well as long-term dependencies, since depth in a convolutional stack implies dependency of outputs across large portions of input samples. Additionally, various pooling strategies across the time dimension are compared so as to effectively map varying length recordings into fixed dimensional representations while simultaneously providing the neural network with an extra mechanism to model long-term dependencies. We finally propose a training scheme under which well-known metric learning approaches, namely triplet loss minimization, is performed along with speaker recognition in a multi-class classification setting. Evaluation on well-known datasets and comparisons with stateof-the-art benchmarks show that the proposed setting is effective in yielding speaker-dependent representations, thus is wellsuited for voice biometrics downstream tasks.

Volume None
Pages 4015-4019
DOI 10.21437/interspeech.2019-2974
Language English
Journal None

Full Text