Alicia Lozano-Diez | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Alicia Lozano-Diez is active.

Explore More

Publication

Featured researches published by Alicia Lozano-Diez.

PLOS ONE | 2016

Language Identification in Short Utterances Using Long Short-Term Memory (LSTM) Recurrent Neural Networks

Ruben Zazo; Alicia Lozano-Diez; Javier Gonzalez-Dominguez; Doroteo Torre Toledano; Joaquin Gonzalez-Rodriguez

Long Short Term Memory (LSTM) Recurrent Neural Networks (RNNs) have recently outperformed other state-of-the-art approaches, such as i-vector and Deep Neural Networks (DNNs), in automatic Language Identification (LID), particularly when dealing with very short utterances (∼3s). In this contribution we present an open-source, end-to-end, LSTM RNN system running on limited computational resources (a single GPU) that outperforms a reference i-vector system on a subset of the NIST Language Recognition Evaluation (8 target languages, 3s task) by up to a 26%. This result is in line with previously published research using proprietary LSTM implementations and huge computational resources, which made these former results hardly reproducible. Further, we extend those previous experiments modeling unseen languages (out of set, OOS, modeling), which is crucial in real applications. Results show that a LSTM RNN with OOS modeling is able to detect these languages and generalizes robustly to unseen OOS languages. Finally, we also analyze the effect of even more limited test data (from 2.25s to 0.1s) proving that with as little as 0.5s an accuracy of over 50% can be achieved.

Odyssey 2016 | 2016

Analysis and Optimization of Bottleneck Features for Speaker Recognition.

Alicia Lozano-Diez; Anna Silnova; Pavel Matejka; Ondrej Glembek; Oldrich Plchot; Jan Pesán; Lukas Burget; Joaquin Gonzalez-Rodriguez

Recently, Deep Neural Network (DNN) based bottleneck features proved to be very effective in i-vector based speaker recognition. However, the bottleneck feature extraction is usually fully optimized for speech rather than speaker recognition task. In this paper, we explore whether DNNs suboptimal for speech recognition can provide better bottleneck features for speaker recognition. We experiment with different features optimized for speech or speaker recognition as input to the DNN. We also experiment with under-trained DNN, where the training was interrupted before the full convergence of the speech recognition objective. Moreover, we analyze the effect of normalizing the features at the input and/or at the output of bottleneck features extraction to see how it affects the final speaker recognition system performance. We evaluated the systems in the SRE’10, condition 5, female task. Results show that the best configuration of the DNN in terms of phone accuracy does not necessary imply better performance of the final speaker recognition system. Finally, we compare the performance of bottleneck features and the standard MFCC features in i-vector/PLDA speaker recognition system. The best bottleneck features yield up to 37% of relative improvement in terms of EER.

PLOS ONE | 2017

An analysis of the influence of deep neural network (DNN) topology in bottleneck feature based language recognition

Alicia Lozano-Diez; Ruben Zazo; Doroteo Torre Toledano; Joaquin Gonzalez-Rodriguez

Language recognition systems based on bottleneck features have recently become the state-of-the-art in this research field, showing its success in the last Language Recognition Evaluation (LRE 2015) organized by NIST (U.S. National Institute of Standards and Technology). This type of system is based on a deep neural network (DNN) trained to discriminate between phonetic units, i.e. trained for the task of automatic speech recognition (ASR). This DNN aims to compress information in one of its layers, known as bottleneck (BN) layer, which is used to obtain a new frame representation of the audio signal. This representation has been proven to be useful for the task of language identification (LID). Thus, bottleneck features are used as input to the language recognition system, instead of a classical parameterization of the signal based on cepstral feature vectors such as MFCCs (Mel Frequency Cepstral Coefficients). Despite the success of this approach in language recognition, there is a lack of studies analyzing in a systematic way how the topology of the DNN influences the performance of bottleneck feature-based language recognition systems. In this work, we try to fill-in this gap, analyzing language recognition results with different topologies for the DNN used to extract the bottleneck features, comparing them and against a reference system based on a more classical cepstral representation of the input signal with a total variability model. This way, we obtain useful knowledge about how the DNN configuration influences bottleneck feature-based language recognition systems performance.

Odyssey 2016 | 2016

Evaluation of an LSTM-RNN System in Different NIST Language Recognition Frameworks.

Ruben Zazo; Alicia Lozano-Diez; Joaquin Gonzalez-Rodriguez

Long Short-Term Memory recurrent neural networks (LSTM RNNs) provide an outstanding performance in language identification (LID) due to its ability to model speech sequences. So far, previously published LSTM RNNs solutions for LID deal with highly controlled scenarios, balanced datasets and limited channel variability. In this paper we evaluate an endto-end LSTM LID system, comparing it against a classical ivector system, on different environments based on data from Language Recognition Evaluations (LRE) organized by NIST. In order to analyze the behavior we train and test our system on a balanced and controlled subset of LRE09, on the develompent data of LRE15 and, finally, on the evaluation set of LRE15. Our results show that an end-to-end recurrent system clearly outperforms the reference i-vector system in a controlled environment, specially when dealing with short utterances. However, our deep learning approach is more sensitive to unbalanced datasets, channel variability and, specially, to the mismatch between development and test datasets.

PLOS ONE | 2018

Multi-resolution speech analysis for automatic speech recognition using deep neural networks: Experiments on TIMIT

Doroteo Torre Toledano; María Pilar Fernández-Gallego; Alicia Lozano-Diez

Speech Analysis for Automatic Speech Recognition (ASR) systems typically starts with a Short-Time Fourier Transform (STFT) that implies selecting a fixed point in the time-frequency resolution trade-off. This approach, combined with a Mel-frequency scaled filterbank and a Discrete Cosine Transform give rise to the Mel-Frequency Cepstral Coefficients (MFCC), which have been the most common speech features in speech processing for the last decades. These features were particularly well suited for the previous Hidden Markov Models/Gaussian Mixture Models (HMM/GMM) state of the art in ASR. In particular they produced highly uncorrelated features of small dimensionality (typically 13 coefficients plus deltas and double deltas), which was very convenient for diagonal covariance GMMs, for dealing with the curse of dimensionality and for the limited computing resources of a decade ago. Currently most ASR systems use Deep Neural Networks (DNN) instead of the GMMs for modeling the acoustic features, which provides more flexibility regarding the definition of the features. In particular, acoustic features can be highly correlated and can be much larger in size because the DNNs are very powerful at processing high-dimensionality inputs. Also, the computing hardware has reached a level of evolution that makes computational cost in speech processing a less relevant issue. In this context we have decided to revisit the problem of the time-frequency resolution in speech analysis, and in particular to check if multi-resolution speech analysis (both in time and frequency) can be helpful in improving acoustic modeling using DNNs. Our experiments start with several Kaldi baseline system for the well known TIMIT corpus and modify them by adding multi-resolution speech representations by concatenating different spectra computed using different time-frequency resolutions and different post-processed and speaker-adapted features using different time-frequency resolutions. Our experiments show that using a multi-resolution speech representation tends to improve over results using the baseline single resolution speech representation, which seems to confirm our main hypothesis. However, results combining multi-resolution with the highly post-processed and speaker-adapted features, which provide the best results in Kaldi for TIMIT, yield only very modest improvements.

Odyssey 2018 The Speaker and Language Recognition Workshop | 2018

Analysis of BUT-PT Submission for NIST LRE 2017

Oldřich Plchot; Pavel Matějka; Ondřej Novotný; Sandro Cumani; Alicia Lozano-Diez; Josef Slavíček; Mireia Diez; Frantisek Grezl; Ondřej Glembek; Mounika Kamsali; Anna Silnova; Lukas Burget; Lucas Ondel; Santosh Kesiraju; Johan Rohdin

In this paper, we summarize our efforts in the NIST Language Recognition Evaluations (LRE) 2017 which resulted in systems providing very competitive and state-of-the-art performance. We provide both the descriptions and the analysis of the systems that we included in our submission. We explain our partitioning of the datasets that we were provided by NIST for training and development, and we follow by describing the features, DNN models and classifiers that were used to produce the final systems. After covering the architecture of our submission, we concentrate on post-evaluation analysis. We compare different DNN Bottle-Neck features, i-vector systems of different sizes and architectures, different classifiers and we present experimental results with data augmentation and with improved architecture of the system based on DNN embeddings. We present the performance of the systems in the Fixed condition (where participants are required to use only predefined data sets) and in addition to official NIST LRE17 evaluation set, we also provide results on our internal development set which can serve as a baseline for other researchers, since all training data are fixed and provided by NIST.

Entropy | 2018

Deconstructing Cross-Entropy for Probabilistic Binary Classifiers

Daniel Ramos; Javier Franco-Pedroso; Alicia Lozano-Diez; Joaquin Gonzalez-Rodriguez

In this work, we analyze the cross-entropy function, widely used in classifiers both as a performance measure and as an optimization objective. We contextualize cross-entropy in the light of Bayesian decision theory, the formal probabilistic framework for making decisions, and we thoroughly analyze its motivation, meaning and interpretation from an information-theoretical point of view. In this sense, this article presents several contributions: First, we explicitly analyze the contribution to cross-entropy of (i) prior knowledge; and (ii) the value of the features in the form of a likelihood ratio. Second, we introduce a decomposition of cross-entropy into two components: discrimination and calibration. This decomposition enables the measurement of different performance aspects of a classifier in a more precise way; and justifies previously reported strategies to obtain reliable probabilities by means of the calibration of the output of a discriminating classifier. Third, we give different information-theoretical interpretations of cross-entropy, which can be useful in different application scenarios, and which are related to the concept of reference probabilities. Fourth, we present an analysis tool, the Empirical Cross-Entropy (ECE) plot, a compact representation of cross-entropy and its aforementioned decomposition. We show the power of ECE plots, as compared to other classical performance representations, in two diverse experimental examples: a speaker verification system, and a forensic case where some glass findings are present.

International Conference on Advances in Speech and Language Technologies for Iberian Languages | 2016

Detection of Publicity Mentions in Broadcast Radio: Preliminary Results

María Pilar Fernández-Gallego; Álvaro Mesa-Castellanos; Alicia Lozano-Diez; Doroteo Torre Toledano

The advertising mentions are publicity contents that are not prerecorded, usually are said by radio or TV broadcasters to publicize a product or a company. The main difficulty of detecting advertising mentions is that the audio is not exactly repeated every time, as happens with conventional prerecorded advertising where more efficient techniques such as audio fingerprinting can be used. This paper proposes the use of a keyword search system in Spanish for the detection of advertising mentions. For that, it has been necessary to train and evaluate a new speech recognizer in Spanish (LVCSR) using the Kaldi tool and databases Fisher Spanish and Callhome Spanish. The best word error rate we have obtained on conversational telephone speech is 41.10 %. For the evaluation of mentions detection a specific database in Spanish has been created, containing 300 h of audio, 25 of which have been tagged with different types of information, including mentions appearing in the audio. The recognizer has been applied to all advertising mentions in search for mention specific keywords, achieving a detection rate of about 74 %.

conference of the international speech communication association | 2015

An end-to-end approach to language identification in short utterances using convolutional neural networks

Alicia Lozano-Diez; Rubén Zazo-Candil; Javier Gonzalez-Dominguez; Doroteo Torre Toledano; Joaquin Gonzalez-Rodriguez

IberSPEECH 2014 Proceedings of the Second International Conference on Advances in Speech and Language Technologies for Iberian Languages - Volume 8854 | 2014