Luis Javier Rodriguez-Fuentes
University of the Basque Country
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Luis Javier Rodriguez-Fuentes.
international conference on biometrics | 2013
Elie Khoury; B. Vesnicer; Javier Franco-Pedroso; Ricardo Paranhos Velloso Violato; Z. Boulkcnafet; L. M. Mazaira Fernandez; Mireia Diez; J. Kosmala; Houssemeddine Khemiri; T. Cipr; Rahim Saeidi; Manuel Günther; J. Zganec-Gros; R. Zazo Candil; Flávio Olmos Simões; M. Bengherabi; A. Alvarez Marquina; Mikel Penagarikano; Alberto Abad; M. Boulayemen; Petr Schwarz; D.A. van Leeuwen; J. Gonzalez-Dominguez; M. Uliani Neto; E. Boutellaa; P. Gómez Vilda; Amparo Varona; Dijana Petrovska-Delacrétaz; Pavel Matejka; Joaquin Gonzalez-Rodriguez
This paper evaluates the performance of the twelve primary systems submitted to the evaluation on speaker verification in the context of a mobile environment using the MOBIO database. The mobile environment provides a challenging and realistic test-bed for current state-of-the-art speaker verification techniques. Results in terms of equal error rate (EER), half total error rate (HTER) and detection error trade-off (DET) confirm that the best performing systems are based on total variability modeling, and are the fusion of several sub-systems. Nevertheless, the good old UBM-GMM based systems are still competitive. The results also show that the use of additional data for training as well as gender-dependent features can be helpful.
international conference on acoustics, speech, and signal processing | 2014
Luis Javier Rodriguez-Fuentes; Amparo Varona; Mikel Penagarikano; Germán Bordel; Mireia Diez
In the last years, the task of Query-by-Example Spoken Term Detection (QbE-STD), which aims to find occurrences of a spoken query in a set of audio documents, has gained the interest of the research community for its versatility in settings where untranscribed, multilingual and acoustically unconstrained spoken resources, or spoken resources in low-resource languages, must be searched. This paper describes and reports experimental results for a QbE-STD system that achieved the best performance in the recent Spoken Web Search (SWS) evaluation, held as part of MediaEval 2013. Though not optimized for speed, the system operates faster than real-time. The system exploits high-performance phone decoders to extract frame-level phone posteriors (a common representation in QbE-STD tasks). Then, given a query and a audio document, a distance matrix is computed between their phone posterior representations, followed by a newly introduced distance normalization technique and an iterative Dynamic Time Warping (DTW) matching procedure with some heuristic prunings. Results show that remarkable performance improvements can be achieved by using multiple examples per query and, specially, through the late (score-level) fusion of different subsystems, each based on a different set of phone posteriors.
spoken language technology workshop | 2012
Mireia Diez; Amparo Varona; Mikel Penagarikano; Luis Javier Rodriguez-Fuentes; Germán Bordel
This paper presents an alternative feature set to the traditional MFCC-SDC used in acoustic approaches to Spoken Language Recognition: the log-likelihood ratios of phone posterior probabilities, hereafter Phone Log-Likelihood Ratios (PLLR), produced by a phone recognizer. In this work, an iVector system trained on this set of features (plus dynamic coefficients) is evaluated and compared to (1) an acoustic iVector system (trained on the MFCC-SDC feature set) and (2) a phonotactic (Phone-lattice-SVM) system, using two different benchmarks: the NIST 2007 and 2009 LRE datasets. iVector systems trained on PLLR features proved to be competitive, reaching or even outperforming the MFCC-SDC-based iVector and the phonotactic systems. The fusion of the proposed approach with the acoustic and phonotactic systems provided even more significant improvements, outperforming state-of-the-art systems on both benchmarks.
IEEE Transactions on Audio, Speech, and Language Processing | 2011
Mikel Penagarikano; Amparo Varona; Luis Javier Rodriguez-Fuentes; Germán Bordel
Most common approaches to phonotactic language recognition deal with several independent phone decodings. These decodings are processed and scored in a fully uncoupled way, their time alignment (and the information that may be extracted from it) being completely lost. Recently, we have presented two new approaches to phonotactic language recognition which take into account time alignment information, by considering time-synchronous cross-decoder phone co-occurrences. Experiments on the 2007 NIST LRE database demonstrated that using phone co-occurrence statistics could improve the performance of baseline phonotactic recognizers. In this paper, approaches based on time-synchronous cross-decoder phone co-occurrences are further developed and evaluated with regard to a baseline SVM-based phonotactic system, by using: 1) counts of n-grams (up to 4-grams) of phone co-occurrences; and 2) the degree of co-occurrence of phone n-grams (up to 4-grams). To evaluate these approaches, a choice of open software (Brno University of Technology phone decoders, LIBLINEAR and FoCal) was used, and experiments were carried out on the 2007 NIST LRE database. The two approaches presented in this paper outperformed the baseline phonotactic system, yielding around 7% relative improvement in terms of CLLR. The fusion of the baseline system with the two proposed approaches yielded 1.83% EER and CLLR=0.270 (meaning 18% relative improvement), the same performance (on the same task) than state-of-the-art phonotactic systems which apply more complex models and techniques, thus supporting the use of cross-decoder dependencies for language recognition.
international conference on acoustics, speech, and signal processing | 2015
Xavier Anguera; Luis Javier Rodriguez-Fuentes; Andi Buzo; Florian Metze; Igor Szöke; Mikel Penagarikano
In this paper, we present the task and describe the main findings of the 2014 “Query-by-Example Speech Search Task” (QUESST) evaluation. The purpose of QUESST was to perform language independent search of spoken queries on spoken documents, while targeting languages or acoustic conditions for which very few speech resources are available. This evaluation investigated for the first time the performance of query-by-example search against morphological and morpho-syntactic variability, requiring participants to match variants of a spoken query in several languages of different morphological complexity. Another novelty is the use of the normalized cross entropy cost (Cnxe) as the primary performance metric, keeping Term-Weighted Value (TWV) as a secondary metric for comparison with previous evaluations. After analyzing the most competitive submissions (by five teams), we find that, although low-level “pattern matching” approaches provide the best performance for “exact” matches, “symbolic” approaches working on higher-level representations seem to perform better in more complex settings, such as matching morphological variants. Finally, optimizing the output scores for Cnxe seems to generate systems that are more robust to differences in the operating point and that also perform well in terms of TWV, whereas the opposite might not be always true.
international conference on acoustics, speech, and signal processing | 2011
Mikel Penagarikano; Amparo Varona; Luis Javier Rodriguez-Fuentes; Germán Bordel
Due to computational bounds, most SVM-based phonotactic language recognition systems consider only low-order n-grams (up to n = 3), thus limiting the potential performance of this approach. The huge amount of n-grams for n ≥ 4 makes it computationally unfeasible even selecting the most frequent n-grams. In this paper, we demonstrate the feasibility and usefulness of using high-order n-grams for n = 4, 5, 6, 7 in SVM-based phonotactic language recognition, thanks to a dynamic n-gram selection algorithm. The most frequent n-grams are selected, but computational issues (those regarding memory requirements) are prevented, since counts are periodically updated and only those units with the highest counts are retained for subsequent processing. Systems were built by means of open software (Brno University of Technology phone decoders, HTK, LIBLINEAR and FoCal) and experiments were carried out on the NIST LRE2007 database. Applying the proposed approach, a 1.36% EER was achieved when using up to 4-grams, 1.32% EER when using up to 5-grams (11.2% improvement with regard to using up to 3-grams) and 1.34% EER when using up to 6-grams or 7-grams.
ieee automatic speech recognition and understanding workshop | 2011
Luis Javier Rodriguez-Fuentes; Mikel Penagarikano; Amparo Varona; Mireia Diez; Germán Bordel; David Martinez; Jesús Villalba; Antonio Miguel; Alfonso Ortega; Eduardo Lleida; Alberto Abad; Oscar Koller; Isabel Trancoso; Paula Lopez-Otero; Laura Docio-Fernandez; Carmen García-Mateo; Rahim Saeidi; Mehdi Soufifar; Tomi Kinnunen; Torbjørn Svendsen; Pasi Fränti
Best language recognition performance is commonly obtained by fusing the scores of several heterogeneous systems. Regardless the fusion approach, it is assumed that different systems may contribute complementary information, either because they are developed on different datasets, or because they use different features or different modeling approaches. Most authors apply fusion as a final resource for improving performance based on an existing set of systems. Though relative performance gains decrease as larger sets of systems are considered, best performance is usually attained by fusing all the available systems, which may lead to high computational costs. In this paper, we aim to discover which technologies combine the best through fusion and to analyse the factors (data, features, modeling methodologies, etc.) that may explain such a good performance. Results are presented and discussed for a number of systems provided by the participating sites and the organizing team of the Albayzin 2010 Language Recognition Evaluation. We hope the conclusions of this work help research groups make better decisions in developing language recognition technology.
international conference on acoustics, speech, and signal processing | 2010
Maider Zamalloa; Luis Javier Rodriguez-Fuentes; Germán Bordel; Mikel Penagarikano; Juan Pedro Uribe
Ambient Inteligence aims to create smart spaces providing services in a transparent and non-intrusive fashion, so context awareness and user adaptation are key issues. Speech can be exploited for user adaptation in such scenarios by continuously tracking speaker identity. However, most speaker tracking approaches require processing the full audio recording before determining speaker turns, which makes them unsuitable for online processing and low-latency decision-making. In this work a low-latency speaker tracking system is presented, which deals with continuous audio streams and outputs decisions at one-second intervals, by scoring fixed-length audio segments with a set of target speaker models. A smoothing technique is explored, based on the scores of past segments, which increases the robustness of tracking decisions to local variability. Experimental results are reported on the AMI Corpus of meeting conversations, revealing the effectiveness of the proposed approach when compared to an offline speaker tracking approach developed for reference.
IEEE Signal Processing Letters | 2014
Mireia Diez; Amparo Varona; Mikel Penagarikano; Luis Javier Rodriguez-Fuentes; Germán Bordel
The so called Phone Log-Likelihood Ratio (PLLR) features have been recently introduced as a novel and effective way of retrieving acoustic-phonetic information in spoken language and speaker recognition systems. In this letter, an in-depth insight into the PLLR feature space is provided and the multidimensional distribution of these features is analyzed in a language recognition system. The study reveals that PLLR features are confined into a subspace that strongly bounds PLLR distributions. To enhance the information retrieved by the system, PLLR features are projected into a hyper-plane that provides a more suitable representation of the subspace where the features lie. After applying the projection method, PCA is used to decorrelate the features. Gains attained on each step of the proposed approach are outlined and compared to simple PCA projection. Experiments carried out on NIST 2007, 2009 and 2011 LRE datasets demonstrate the effectiveness of the proposed method, which yields up to a 27% relative improvement with regard to the system based on the original features.
IEEE Signal Processing Letters | 2014
Mireia Diez; Amparo Varona; Mikel Penagarikano; Luis Javier Rodriguez-Fuentes; Germán Bordel
In this letter, we apply Phone Log-Likelihood Ratio (PLLR) features to the task of speaker recognition. PLLRs, which are computed on the phone posterior probabilities provided by phone decoders, convey acoustic-phonetic information in a sequence of frame-level vectors, and therefore can be easily plugged into traditional acoustic systems, just by replacing the Mel-Frequency Cepstral Coefficients (MFCC) or an alternate representation. To study the performance of the proposed features, MFCC-based and PLLR-based systems are trained under an i-vector-PLDA approach. Results on the NIST 2010 and 2012 Speaker Recognition Evaluation databases show that, despite yielding lower performance than the acoustic system, the system based on PLLR features does provide significant gains when both systems are fused, which reveals a complementarity among features, and provides a suitable and effective way of using higher level phonetic information in speaker recognition systems.