Jesús Villalba
University of Zaragoza
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Jesús Villalba.
international carnahan conference on security technology | 2011
Jesús Villalba; Eduardo Lleida
In this paper, we describe a system for detecting spoofing attacks on speaker verification systems. We understand as spoofing the fact of impersonating a legitimate user. We focus on detecting two types of low technology spoofs. On the one side, we try to expose if the test segment is a far-field microphone recording of the victim that has been replayed on a telephone handset using a loudspeaker. On the other side, we want to determine if the recording has been created by cutting and pasting short recordings to forge the sentence requested by a text dependent system. This kind of attacks is of critical importance for security applications like access to bank accounts. To detect the first type of spoof we extract several acoustic features from the speech signal. Spoofs and non-spoof segments are classified using a support vector machine (SVM). The cut and paste is detected comparing the pitch and MFCC contours of the enrollment and test segments using dynamic time warping (DTW). We performed experiments using two databases created for this purpose. They include signals from land line and GSM telephone channels of 20 different speakers. We present results of the performance separately for each spoofing detection system and the fusion of both. We have achieved error rates under 10% for all the conditions evaluated. We show the degradation on the speaker verification performance in the presence of this kind of attack and how to use the spoofing detection to mitigate that degradation.
BioID'11 Proceedings of the COST 2101 European conference on Biometrics and ID management | 2011
Jesús Villalba; Eduardo Lleida
In this paper, we describe a system for detecting spoofing attacks on speaker verification systems. By spoofing we mean an attempt to impersonate a legitimate user. We focus on detecting if the test segment is a far-field microphone recording of the victim. This kind of attack is of critical importance in security applications like access to bank accounts. We present experiments on databases created for this purpose, including land line and GSM telephone channels. We present spoofing detection results with EER between 0% and 9% depending on the condition. We show the degradation on the speaker verification performance in the presence of this kind of attack and how to use the spoofing detection to mitigate that degradation.
IberSPEECH | 2012
David Martinez; Eduardo Lleida; Alfonso Ortega; Antonio Miguel; Jesús Villalba
The paper presents a set of experiments on pathological voice detection over the Saarbrucken Voice Database (SVD) by using the MultiFocal toolkit for a discriminative calibration and fusion. The SVD is freely available online containing a collection of voice recordings of different pathologies, including both functional and organic. A generative Gaussian mixture model trained with mel-frequency cepstral coefficients, harmonics-to-noise ratio, normalized noise energy and glottal-to-noise excitation ratio, is used as classifier. Scores are calibrated to increase performance at the desired operating point. Finally, the fusion of different recordings for each speaker, in which vowels /a/, /i/ and /u/ are pronounced with normal, low, high, and low-high-low intonations, offers a great increase in the performance. Results are compared with the Massachusetts Eye and Ear Infirmary (MEEI) database, which makes possible to see that SVD is much more challenging.
international conference on acoustics, speech, and signal processing | 2013
Jesús Villalba; Eduardo Lleida
In this work, we address the problem of having i-vectors that have been produced in different channel conditions. Traditionally, this problem has been handled training the LDA covariance matrices pooling the data of all the conditions or averaging the covariance matrices of each condition in different ways. We present a PLDA variant that we call, multi-channel SPLDA, where the speaker space distribution is common to all i-vectors and the channel space distribution depends on the type of channel where the segment has been recorded. We test our approach on the telephone part of the NIST SRE10 extended condition where we added some additive noises to the test segments. We compare results of a SPLDA model trained only with clean data, SPLDA trained with pooled noisy and clean data and our MCSPLDA model.
international conference on acoustics, speech, and signal processing | 2014
Jesús Villalba; Eduardo Lleida
State-of-the-art speaker recognition relays on models that need a large amount of training data. This models are successful in tasks like NIST SRE because there is sufficient data available. However, in real applications, we usually do not have so much data and, in many cases, the speaker labels are unknown. We present a method to adapt a PLDA model from a domain with a large amount of labeled data to another with unlabeled data. We describe a generative model that produces both sets of data where the unknown labels are modeled like latent variables. We used variational Bayes to estimate the hidden variables. We performed experiments adapting a model trained on Switchboard to NIST SRE without labels. The adapted model is evaluated on NIST SRE10. Compared to the non-adapted model, EER improved by 42% and 49% by adapting with 200 and with all the NIST speakers respectively.
ieee automatic speech recognition and understanding workshop | 2011
Luis Javier Rodriguez-Fuentes; Mikel Penagarikano; Amparo Varona; Mireia Diez; Germán Bordel; David Martinez; Jesús Villalba; Antonio Miguel; Alfonso Ortega; Eduardo Lleida; Alberto Abad; Oscar Koller; Isabel Trancoso; Paula Lopez-Otero; Laura Docio-Fernandez; Carmen García-Mateo; Rahim Saeidi; Mehdi Soufifar; Tomi Kinnunen; Torbjørn Svendsen; Pasi Fränti
Best language recognition performance is commonly obtained by fusing the scores of several heterogeneous systems. Regardless the fusion approach, it is assumed that different systems may contribute complementary information, either because they are developed on different datasets, or because they use different features or different modeling approaches. Most authors apply fusion as a final resource for improving performance based on an existing set of systems. Though relative performance gains decrease as larger sets of systems are considered, best performance is usually attained by fusing all the available systems, which may lead to high computational costs. In this paper, we aim to discover which technologies combine the best through fusion and to analyse the factors (data, features, modeling methodologies, etc.) that may explain such a good performance. Results are presented and discussed for a number of systems provided by the participating sites and the organizing team of the Albayzin 2010 Language Recognition Evaluation. We hope the conclusions of this work help research groups make better decisions in developing language recognition technology.
IberSPEECH | 2012
Jesús Villalba; Eduardo Lleida; Alfonso Ortega; Antonio Miguel
In some situations the quality of the signals involved in a speaker verification trial is not as good as needed to take a reliable decision. In this work, we use Bayesian networks to model the relations between the speaker verification score, a set of speech quality measures and the trial reliability. We use this model to detect and discard unreliable trials. We present results on the NIST SRE2010 dataset artificially degraded with different types and levels of additive noise and reverberation. We show that a speaker verification system, that is well calibrated for clean speech, produces an unacceptable actual DCF on the degraded dataset. We show how this method can be used to reduce the actual DCF to values lower than 1. We compare results using different quality measures and Bayesian network configurations.
ieee automatic speech recognition and understanding workshop | 2015
Jesús Villalba; Alfonso Ortega; Antonio Miguel; Eduardo Lleida
This paper describes the ViVoLab speaker diarization system for the Multi-Genre Broadcast (MGB) Challenge at ASRU2015. The challenge data consisted of BBC TV programmes of different genres. Diarization followed a longitudinal setup, i.e., the speakers of the current episode had to be linked to the speakers in previous episodes of the same show. We propose a system based on the i-vector paradigm. After an initial segmentation step, we compute an i-vector per speech segment. Then, a generative model based on Bayesian PLDA clusters the speakers. In this model, the speaker labels are latent variables that we optimize by variational Bayes iterations. The number of speakers in each episode was decided by maximizing the variational lower bound. The system includes several phases of segment-merging and re-clustering. We re-compute i-vectors after each merging step, which reduces the i-vector uncertainty. This approach attained a DER around 30% in the development set.
Speech Communication | 2016
Jesús Villalba; Alfonso Ortega; Antonio Miguel; Eduardo Lleida
Quality measures were investigated to predict reliability of speaker verification.Bayesian networks estimate the reliability from the score and quality measures.Best measures for additive noise: SNR and modulation index.Best measures for reverberation: UBM log-likelihood.Novel measures based on vector Taylor series performed well. Despite the great advances made in the speaker recognition field, like joint factor analysis (JFA) and i-vectors, there are still situations where the quality of the speech signals involved in a speaker verification (SV) trial are not good enough to take reliable decisions. This fact motivated us to investigate speech quality measures that are related to the SV performance. We analyzed measures like signal-to-noise ratio (SNR), modulation index, number of speech frames, jitter, shimmer, or likelihood of the data given the universal background model (UBM), JFA and probabilistic linear discriminant analysis models. Besides, we introduce a novel and promising measure based on the vector Taylor series (VTS) paradigm, used to adapt a clean GMM to noisy speech. We used Bayesian networks to combine these measures and produce a probabilistic reliability measure. We applied it to detect trials badly classified. We trained our Bayesian network on NIST SRE08 distorted with noise and reverberation and evaluated on a distorted version of SRE10. We found that, for noise, the best measures were SNR and modulation index; and for reverberation, the UBM likelihood. VTS based measures performed well for both types of distortions.
international conference on acoustics, speech, and signal processing | 2013
Diego Castán; Alfonso Ortega; Jesús Villalba; Antonio Miguel; Eduardo Lleida
This paper proposes a novel audio segmentation-by-classification system based on Factor Analysis (FA) with a channel compensation matrix for each class and scoring the fixed-length segments as the log-likelihood ratio between class/no-class. The scores are smoothed and the most probable sequence is computed with a Viterbi algorithm. The system described here is designed to segment and classify the audio files coming from broadcast programs into five different classes: speech (SP), speech with noise (SN), speech with music (SM), music (MU) or others (OT). This task was proposed in the Albayzin 2010 evaluation campaign. The system is compared with the winning system of the evaluation achieving lower error rate in SP and SN. These classes represent 3/4 of the total amount of the data. Therefore, the FA segmentation system gets a reduction in the average segmentation error rate.