Javier Franco-Pedroso

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Javier Franco-Pedroso is active.

Explore More

Publication

Featured researches published by Javier Franco-Pedroso.

international conference on biometrics | 2013

The 2013 speaker recognition evaluation in mobile environment

Elie Khoury; B. Vesnicer; Javier Franco-Pedroso; Ricardo Paranhos Velloso Violato; Z. Boulkcnafet; L. M. Mazaira Fernandez; Mireia Diez; J. Kosmala; Houssemeddine Khemiri; T. Cipr; Rahim Saeidi; Manuel Günther; J. Zganec-Gros; R. Zazo Candil; Flávio Olmos Simões; M. Bengherabi; A. Alvarez Marquina; Mikel Penagarikano; Alberto Abad; M. Boulayemen; Petr Schwarz; D.A. van Leeuwen; J. Gonzalez-Dominguez; M. Uliani Neto; E. Boutellaa; P. Gómez Vilda; Amparo Varona; Dijana Petrovska-Delacrétaz; Pavel Matejka; Joaquin Gonzalez-Rodriguez

This paper evaluates the performance of the twelve primary systems submitted to the evaluation on speaker verification in the context of a mobile environment using the MOBIO database. The mobile environment provides a challenging and realistic test-bed for current state-of-the-art speaker verification techniques. Results in terms of equal error rate (EER), half total error rate (HTER) and detection error trade-off (DET) confirm that the best performing systems are based on total variability modeling, and are the fusion of several sub-systems. Nevertheless, the good old UBM-GMM based systems are still competitive. The results also show that the use of additional data for training as well as gender-dependent features can be helpful.

international conference on acoustics, speech, and signal processing | 2011

Calibration and weight of the evidence by human listeners. The ATVS-UAM submission to NIST HUMAN-aided speaker recognition 2010

Daniel Ramos; Javier Franco-Pedroso; Joaquin Gonzalez-Rodriguez

This work analyzes the performance of speaker recognition when carried out by human lay listeners. In forensics, judges and jurors usually manifest intuition that people is proficient to distinguish other people from their voices, and therefore opinions are easily elicited about speech evidence just by listening to it, or by means of panels of listeners. There is a danger, however, since little attention has been paid to scientifically measure the performance of human listeners, as well as to the strength with which they should elicit their opinions. In this work we perform such a rigorous analysis in the context of NIST Human-Aided Speaker Recognition 2010 (HASR). We have recruited a panel of listeners who have elicited opinions in the form of scores. Then, we have calibrated such scores using a development set, in order to generate calibrated likelihood ratios. Thus, the discriminating power and the strength with which human lay listeners should express their opinions about the speech evidence can be assessed, giving a measure of the amount of information given by human listeners to the speaker recognition process.

Speech Communication | 2016

Linguistically-constrained formant-based i-vectors for automatic speaker recognition

Javier Franco-Pedroso; Joaquin Gonzalez-Rodriguez

We present an approach to automatic speaker verification through linguistically-constrained i-vector systems based on formant frequencies.An analysis of discriminative and calibration properties is presented for every linguistic unit (phones and diphones).An analysis of the best-performing units for different speakers reveals remarkable speaker-dependent specificities.Different approaches for selection and fusion of different linguistic units are also analysed.The fusion of a cepstral-based and formant-based systems obtain improved performance. This paper presents a large-scale study of the discriminative abilities of formant frequencies for automatic speaker recognition. Exploiting both the static and dynamic information in formant frequencies, we present linguistically-constrained formant-based i-vector systems providing well calibrated likelihood ratios per comparison of the occurrences of the same isolated linguistic units in two given utterances. As a first result, the reported analysis on the discriminative and calibration properties of the different linguistic units provide useful insights, for instance, to forensic phonetic practitioners. Furthermore, it is shown that the set of units which are more discriminative for every speaker vary from speaker to speaker. Secondly, linguistically-constrained systems are combined at score-level through average and logistic regression speaker-independent fusion rules exploiting the different speaker-distinguishing information spread among the different linguistic units. Testing on the English-only trials of the core condition of the NIST 2006 SRE (24,000 voice comparisons of 5 minutes telephone conversations from 517 speakers -219 male and 298 female-), we report equal error rates of 9.57 and 12.89% for male and female speakers respectively, using only formant frequencies as speaker discriminative information. Additionally, when the formant-based system is fused with a cepstral i-vector system, we obtain relative improvements of ~6% in EER (from 6.54 to 6.13%) and ~15% in minDCF (from 0.0327 to 0.0279), compared to the cepstral system alone.

PLOS ONE | 2016

Gaussian Mixture Models of Between-Source Variation for Likelihood Ratio Computation from Multivariate Data

Javier Franco-Pedroso; Daniel Ramos; Joaquin Gonzalez-Rodriguez

In forensic science, trace evidence found at a crime scene and on suspect has to be evaluated from the measurements performed on them, usually in the form of multivariate data (for example, several chemical compound or physical characteristics). In order to assess the strength of that evidence, the likelihood ratio framework is being increasingly adopted. Several methods have been derived in order to obtain likelihood ratios directly from univariate or multivariate data by modelling both the variation appearing between observations (or features) coming from the same source (within-source variation) and that appearing between observations coming from different sources (between-source variation). In the widely used multivariate kernel likelihood-ratio, the within-source distribution is assumed to be normally distributed and constant among different sources and the between-source variation is modelled through a kernel density function (KDF). In order to better fit the observed distribution of the between-source variation, this paper presents a different approach in which a Gaussian mixture model (GMM) is used instead of a KDF. As it will be shown, this approach provides better-calibrated likelihood ratios as measured by the log-likelihood ratio cost (Cllr) in experiments performed on freely available forensic datasets involving different trace evidences: inks, glass fragments and car paints.

international conference on biometrics | 2013

Formant trajectories in linguistic units for text-independent speaker recognition

Javier Franco-Pedroso; Fernando Espinoza-Cuadros; Joaquin Gonzalez-Rodriguez

Inspired by successful work in forensic speaker identification, this work presents a higher level system for text-independent speaker recognition by means of the temporal trajectories of formant frequencies in linguistic units. Feature extraction from unit-dependent trajectories provides a very flexible system able to be applied in different scenarios. At a fine-grained level, it is possible to provide a calibrated likelihood ratio per linguistic unit under analysis (extremely useful in applications such as forensics), and at a coarse-grained level, the individual contributions of different units can be combined to obtain a more discriminative single system with high potential for combination with short term spectral systems. With development data being extracted from NIST SRE 2004 and 2005 datasets, this approach has been tested on NIST SRE 2006 1side-1side task, English-only male trials, consisting of 9,720 trials from 219 speakers. Remarkable results have been obtained for some single units from extremely short segments of speech, and the combination of several units leads to a relative improvement of 17.2% on EER when fusing with an i-vector system.

Odyssey 2016 | 2016

Feature-based likelihood ratios for speaker recognition from linguistically-constrained formant-based i-vectors.

Javier Franco-Pedroso; Joaquin Gonzalez-Rodriguez

In this paper, a probabilistic model is introduced to obtain feature-based likelihood ratios from linguistically-constrained formant-based i-vectors in a NIST SRE task. Linguisticallyconstrained formant-based i-vectors summarize both the static and dynamic information of formant frequencies in the occurrences of a given linguistic unit in a speech recording. In this work, a two-covariance model is applied to these “higher-level” features in order to obtain likelihood ratios through a probabilistic framework. While the performance of the individual linguistically-constrained systems are not comparable to that of a state-of-the-art cepstral-based system, calibration loss is low enough, providing informative likelihood ratios that can be directly used, for instance, in forensic applications. Furthermore, this procedure avoids the need for further calibration steps, which usually require additional datasets. Finally, the fusion of several linguistically-constrained systems greatly improves the overall performance, achieving very remarkable results for a system solely based on formant features. Testing on the English-only trials of the core condition of the NIST 2006 SRE (and using only NIST SRE 2004 and 2005 data for background and development, respectively), we report equal error rates of 8.47% and 9.88% for male and female speakers respectively, using only formant frequencies as speaker discriminative information.

Entropy | 2018

Deconstructing Cross-Entropy for Probabilistic Binary Classifiers

Daniel Ramos; Javier Franco-Pedroso; Alicia Lozano-Diez; Joaquin Gonzalez-Rodriguez

In this work, we analyze the cross-entropy function, widely used in classifiers both as a performance measure and as an optimization objective. We contextualize cross-entropy in the light of Bayesian decision theory, the formal probabilistic framework for making decisions, and we thoroughly analyze its motivation, meaning and interpretation from an information-theoretical point of view. In this sense, this article presents several contributions: First, we explicitly analyze the contribution to cross-entropy of (i) prior knowledge; and (ii) the value of the features in the form of a likelihood ratio. Second, we introduce a decomposition of cross-entropy into two components: discrimination and calibration. This decomposition enables the measurement of different performance aspects of a classifier in a more precise way; and justifies previously reported strategies to obtain reliable probabilities by means of the calibration of the output of a discriminating classifier. Third, we give different information-theoretical interpretations of cross-entropy, which can be useful in different application scenarios, and which are related to the concept of reference probabilities. Fourth, we present an analysis tool, the Empirical Cross-Entropy (ECE) plot, a compact representation of cross-entropy and its aforementioned decomposition. We show the power of ECE plots, as compared to other classical performance representations, in two diverse experimental examples: a speaker verification system, and a forensic case where some glass findings are present.

international conference on acoustics, speech, and signal processing | 2012

A linguistically-motivated speaker recognition front-end through session variability compensated cepstral trajectories in phone units

Joaquin Gonzalez-Rodriguez; Javier Gonzalez-Dominguez; Javier Franco-Pedroso; Daniel Ramos

In this paper a new linguistically-motivated front-end is presented showing major performance improvements from the use of session variability compensated cepstral trajectories in phone units. Extending our recent work on temporal contours in linguistic units (TCLU), we have combined the potential of those unit-dependent trajectories with the ability of feature domain factor analysis techniques to compensate session variability effects, which has resulted in consistent and discriminant phone-dependent trajectories across different recording sessions. Evaluating with NIST SRE04 English-only 1s1s task, we report EERs as low as 5.40% from the trajectories in a single phone, with 29 different phones producing each of them EERs smaller than 10%, and additionally showing an excellent calibration performance per unit. The combination of different units shows significant complementarity reporting EERs as 1.63% (100×DCF=0.732) from a simple sum fusion of 23 best phones, or 0.68% (100×DCF=0.304) when fusing them through logistic regression.

IEEE Journal of Selected Topics in Signal Processing | 2010

Multilevel and Session Variability Compensated Language Recognition: ATVS-UAM Systems at NIST LRE 2009

Javier Gonzalez-Dominguez; Ignacio Lopez-Moreno; Javier Franco-Pedroso; Daniel Ramos; Doroteo Torre Toledano; Joaquin Gonzalez-Rodriguez

Eurasip Journal on Audio, Speech, and Music Processing | 2015

Albayzín-2014 evaluation: audio segmentation and classification in broadcast news domains

Diego Castán; David Tavarez; Paula Lopez-Otero; Javier Franco-Pedroso; Héctor Delgado; Eva Navas; Laura Docio-Fernandez; Daniel Ramos; Javier Serrano; Alfonso Ortega; Eduardo Lleida

Explore More