Mikel Penagarikano | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Mikel Penagarikano is active.

Explore More

Publication

Featured researches published by Mikel Penagarikano.

international conference on biometrics | 2013

The 2013 speaker recognition evaluation in mobile environment

Elie Khoury; B. Vesnicer; Javier Franco-Pedroso; Ricardo Paranhos Velloso Violato; Z. Boulkcnafet; L. M. Mazaira Fernandez; Mireia Diez; J. Kosmala; Houssemeddine Khemiri; T. Cipr; Rahim Saeidi; Manuel Günther; J. Zganec-Gros; R. Zazo Candil; Flávio Olmos Simões; M. Bengherabi; A. Alvarez Marquina; Mikel Penagarikano; Alberto Abad; M. Boulayemen; Petr Schwarz; D.A. van Leeuwen; J. Gonzalez-Dominguez; M. Uliani Neto; E. Boutellaa; P. Gómez Vilda; Amparo Varona; Dijana Petrovska-Delacrétaz; Pavel Matejka; Joaquin Gonzalez-Rodriguez

This paper evaluates the performance of the twelve primary systems submitted to the evaluation on speaker verification in the context of a mobile environment using the MOBIO database. The mobile environment provides a challenging and realistic test-bed for current state-of-the-art speaker verification techniques. Results in terms of equal error rate (EER), half total error rate (HTER) and detection error trade-off (DET) confirm that the best performing systems are based on total variability modeling, and are the fusion of several sub-systems. Nevertheless, the good old UBM-GMM based systems are still competitive. The results also show that the use of additional data for training as well as gender-dependent features can be helpful.

international conference on acoustics, speech, and signal processing | 2014

High-performance Query-by-Example Spoken Term Detection on the SWS 2013 evaluation

Luis Javier Rodriguez-Fuentes; Amparo Varona; Mikel Penagarikano; Germán Bordel; Mireia Diez

In the last years, the task of Query-by-Example Spoken Term Detection (QbE-STD), which aims to find occurrences of a spoken query in a set of audio documents, has gained the interest of the research community for its versatility in settings where untranscribed, multilingual and acoustically unconstrained spoken resources, or spoken resources in low-resource languages, must be searched. This paper describes and reports experimental results for a QbE-STD system that achieved the best performance in the recent Spoken Web Search (SWS) evaluation, held as part of MediaEval 2013. Though not optimized for speed, the system operates faster than real-time. The system exploits high-performance phone decoders to extract frame-level phone posteriors (a common representation in QbE-STD tasks). Then, given a query and a audio document, a distance matrix is computed between their phone posterior representations, followed by a newly introduced distance normalization technique and an iterative Dynamic Time Warping (DTW) matching procedure with some heuristic prunings. Results show that remarkable performance improvements can be achieved by using multiple examples per query and, specially, through the late (score-level) fusion of different subsystems, each based on a different set of phone posteriors.

spoken language technology workshop | 2012

On the use of phone log-likelihood ratios as features in spoken language recognition

Mireia Diez; Amparo Varona; Mikel Penagarikano; Luis Javier Rodriguez-Fuentes; Germán Bordel

This paper presents an alternative feature set to the traditional MFCC-SDC used in acoustic approaches to Spoken Language Recognition: the log-likelihood ratios of phone posterior probabilities, hereafter Phone Log-Likelihood Ratios (PLLR), produced by a phone recognizer. In this work, an iVector system trained on this set of features (plus dynamic coefficients) is evaluated and compared to (1) an acoustic iVector system (trained on the MFCC-SDC feature set) and (2) a phonotactic (Phone-lattice-SVM) system, using two different benchmarks: the NIST 2007 and 2009 LRE datasets. iVector systems trained on PLLR features proved to be competitive, reaching or even outperforming the MFCC-SDC-based iVector and the phonotactic systems. The fusion of the proposed approach with the acoustic and phonotactic systems provided even more significant improvements, outperforming state-of-the-art systems on both benchmarks.

2006 IEEE Odyssey - The Speaker and Language Recognition Workshop | 2006

Feature Selection Based on Genetic Algorithms for Speaker Recognition

Maider Zamalloa; Germán Bordel; Luis Javier Rodríguez; Mikel Penagarikano

The Mel-frequency cepstral coefficients (MFCC) and their derivatives are commonly used as acoustic features for speaker recognition. The issue arises of whether some of those features are redundant or dependent on other features. Probably, not all of them are equally relevant for speaker recognition. Reduced feature sets allow more robust estimates of the model parameters. Also, less computational resources are required, which is crucial for real-time speaker recognition applications using low-resource devices. In this paper, we use feature weighting as an intermediate step towards feature selection. Genetic algorithms are used to find the optimal set of weights for a 38-dimensional feature set, consisting of 12 MFCC, their first and second derivatives, energy and its first derivative. To evaluate each set of weights, speaker recognition errors are counted over a validation dataset. Speaker models are based on empirical distributions of acoustic labels, obtained through vector quantization. On average, weighting acoustic features yields between 15% and 25% error reduction in speaker recognition tests. Finally, features are sorted according to their weights, and the K features with greatest average ranks are retained and evaluated. We conclude that combining feature weighting and feature selection allows to reduce costs without degrading performance

ieee automatic speech recognition and understanding workshop | 2005

Sautrela: a highly modular open source speech recognition framework

Mikel Penagarikano; Germán Bordel

This paper describes the Sautrela system (www.sautrela.org), a highly modular and pluggable open source framework for generic purpose signal processing, focused on speech recognition. The aim of Sautrela is to unify in a single framework almost all the tasks related to pattern recognition such as signal processing, model training and decoding. This framework has been developed using the Javatrade technology and thus ensures its portability to a large variety of computer platforms

IEEE Transactions on Audio, Speech, and Language Processing | 2011

Improved Modeling of Cross-Decoder Phone Co-Occurrences in SVM-Based Phonotactic Language Recognition

Mikel Penagarikano; Amparo Varona; Luis Javier Rodriguez-Fuentes; Germán Bordel

Most common approaches to phonotactic language recognition deal with several independent phone decodings. These decodings are processed and scored in a fully uncoupled way, their time alignment (and the information that may be extracted from it) being completely lost. Recently, we have presented two new approaches to phonotactic language recognition which take into account time alignment information, by considering time-synchronous cross-decoder phone co-occurrences. Experiments on the 2007 NIST LRE database demonstrated that using phone co-occurrence statistics could improve the performance of baseline phonotactic recognizers. In this paper, approaches based on time-synchronous cross-decoder phone co-occurrences are further developed and evaluated with regard to a baseline SVM-based phonotactic system, by using: 1) counts of n-grams (up to 4-grams) of phone co-occurrences; and 2) the degree of co-occurrence of phone n-grams (up to 4-grams). To evaluate these approaches, a choice of open software (Brno University of Technology phone decoders, LIBLINEAR and FoCal) was used, and experiments were carried out on the 2007 NIST LRE database. The two approaches presented in this paper outperformed the baseline phonotactic system, yielding around 7% relative improvement in terms of CLLR. The fusion of the baseline system with the two proposed approaches yielded 1.83% EER and CLLR=0.270 (meaning 18% relative improvement), the same performance (on the same task) than state-of-the-art phonotactic systems which apply more complex models and techniques, thus supporting the use of cross-decoder dependencies for language recognition.

international conference on acoustics, speech, and signal processing | 2015

QUESST2014: Evaluating Query-by-Example Speech Search in a zero-resource setting with real-life queries

Xavier Anguera; Luis Javier Rodriguez-Fuentes; Andi Buzo; Florian Metze; Igor Szöke; Mikel Penagarikano

In this paper, we present the task and describe the main findings of the 2014 “Query-by-Example Speech Search Task” (QUESST) evaluation. The purpose of QUESST was to perform language independent search of spoken queries on spoken documents, while targeting languages or acoustic conditions for which very few speech resources are available. This evaluation investigated for the first time the performance of query-by-example search against morphological and morpho-syntactic variability, requiring participants to match variants of a spoken query in several languages of different morphological complexity. Another novelty is the use of the normalized cross entropy cost (Cnxe) as the primary performance metric, keeping Term-Weighted Value (TWV) as a secondary metric for comparison with previous evaluations. After analyzing the most competitive submissions (by five teams), we find that, although low-level “pattern matching” approaches provide the best performance for “exact” matches, “symbolic” approaches working on higher-level representations seem to perform better in more complex settings, such as matching morphological variants. Finally, optimizing the output scores for Cnxe seems to generate systems that are more robust to differences in the operating point and that also perform well in terms of TWV, whereas the opposite might not be always true.

iberian conference on pattern recognition and image analysis | 2007

A Simple But Effective Approach to Speaker Tracking in Broadcast News

Luis Javier Rodríguez; Mikel Penagarikano; Germán Bordel

The automatic transcription of broadcast news and meetings involves the segmentation, identification and tracking of speaker turns during each session, which is known as speaker diarization. This paper presents a simple but effective approach to a slightly different task, called speaker tracking, also involving audio segmentation and speaker identification, but with a subset of known speakers, which allows to estimate speaker models and to perform identification on a segment-by-segment basis. The proposed algorithm segments the audio signal in a fully unsupervised way, by locating the most likely change points from an purely acoustic point of view. Then the available speaker data are used to estimate single-Gaussian acoustic models. Finally, speaker models are used to classify the audio segments by choosing the most likely speaker or, alternatively, the Othercategory, if none of the speakers is likely enough. Despite its simplicity, the proposed approach yielded the best performance in the speaker tracking challenge organized in November 2006 by the Spanish Network on Speech Technology.

international conference on acoustics, speech, and signal processing | 2011

A dynamic approach to the selection of high order n-grams in phonotactic language recognition

Mikel Penagarikano; Amparo Varona; Luis Javier Rodriguez-Fuentes; Germán Bordel

Due to computational bounds, most SVM-based phonotactic language recognition systems consider only low-order n-grams (up to n = 3), thus limiting the potential performance of this approach. The huge amount of n-grams for n ≥ 4 makes it computationally unfeasible even selecting the most frequent n-grams. In this paper, we demonstrate the feasibility and usefulness of using high-order n-grams for n = 4, 5, 6, 7 in SVM-based phonotactic language recognition, thanks to a dynamic n-gram selection algorithm. The most frequent n-grams are selected, but computational issues (those regarding memory requirements) are prevented, since counts are periodically updated and only those units with the highest counts are retained for subsequent processing. Systems were built by means of open software (Brno University of Technology phone decoders, HTK, LIBLINEAR and FoCal) and experiments were carried out on the NIST LRE2007 database. Applying the proposed approach, a 1.36% EER was achieved when using up to 4-grams, 1.32% EER when using up to 5-grams (11.2% improvement with regard to using up to 3-grams) and 1.34% EER when using up to 6-grams or 7-grams.

ieee automatic speech recognition and understanding workshop | 2011

Multi-site heterogeneous system fusions for the Albayzin 2010 Language Recognition Evaluation

Luis Javier Rodriguez-Fuentes; Mikel Penagarikano; Amparo Varona; Mireia Diez; Germán Bordel; David Martinez; Jesús Villalba; Antonio Miguel; Alfonso Ortega; Eduardo Lleida; Alberto Abad; Oscar Koller; Isabel Trancoso; Paula Lopez-Otero; Laura Docio-Fernandez; Carmen García-Mateo; Rahim Saeidi; Mehdi Soufifar; Tomi Kinnunen; Torbjørn Svendsen; Pasi Fränti

Best language recognition performance is commonly obtained by fusing the scores of several heterogeneous systems. Regardless the fusion approach, it is assumed that different systems may contribute complementary information, either because they are developed on different datasets, or because they use different features or different modeling approaches. Most authors apply fusion as a final resource for improving performance based on an existing set of systems. Though relative performance gains decrease as larger sets of systems are considered, best performance is usually attained by fusing all the available systems, which may lead to high computational costs. In this paper, we aim to discover which technologies combine the best through fusion and to analyse the factors (data, features, modeling methodologies, etc.) that may explain such a good performance. Results are presented and discussed for a number of systems provided by the participating sites and the organizing team of the Albayzin 2010 Language Recognition Evaluation. We hope the conclusions of this work help research groups make better decisions in developing language recognition technology.

Explore More