Diego Castán
University of Zaragoza
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Diego Castán.
conference of the international speech communication association | 2016
Mitchell McLaren; Luciana Ferrer; Diego Castán; Aaron Lawson
The Speakers in the Wild (SITW) speaker recognition database contains hand-annotated speech samples from open-source media for the purpose of benchmarking text-independent speaker recognition technology on single and multi-speaker audio acquired across unconstrained or “wild” conditions. The database consists of recordings of 299 speakers, with an average of eight different sessions per person. Unlike existing databases for speaker recognition, this data was not collected under controlled conditions and thus contains real noise, reverberation, intraspeaker variability and compression artifacts. These factors are often convolved in the real world, as the SITW data shows, and they make SITW a challenging database for singleand multispeaker recognition
international conference on acoustics, speech, and signal processing | 2013
Julien van Hout; Murat Akbacak; Diego Castán; Eric Yeh; Michelle Hewlett Sanchez
Because of the popularity of online videos, there has been much interest in recent years in audio processing for the improvement of online video search. In this paper, we explore using acoustic concepts and spoken concepts extracted via audio segmentation/recognition and speech recognition respectively for Multimedia Event Detection (MED). To extract spoken concepts, a segmenter trained on annotated data from user videos segments the audio into three classes: speech, music, and other sounds. The speech segments are passed to an Automatic Speech Recognition (ASR) engine, and words from the 1-best ASR output, as well as posterior-weighted word counts collected from ASR lattices, are used as features to an SVM based classifier. Acoustic concepts are extracted using the 3-gram lattice counts of two Acoustic Concept Recognition (ACR) systems trained on 7 broad classes. MED results are reported on a subset of the NIST 2011 TRECVID data. We find that spoken concepts using lattices yield a 15% relative improvement in Average Pmiss (APM) over 1-best based features. Further, the proposed spoken concepts gave a 30% relative gain in APM over the ACR-based MED system using 7 classes. Lastly, we obtain an 8% relative APM improvement after score-level fusion of both concept types, showing the effective coupling of both approaches.
IberSPEECH | 2012
Diego Castán; Alfonso Ortega Giménez; Eduardo Lleida
This paper proposes a study of a Factor Analysis (FA) segmentation and classification system. Our approach is inspired by language recognition systems where every input sequence is a language. Following this idea, a study between the classic segmentation systems based on HMM/GMM and FA is done over the output of a perfect segmentation system (oracle boundaries). It can be seen how FA improves the classification results compared to HMM/GMM. Also, the first experiments of an on-building FA segmentation system are reported suggesting the need to improve the channel compensation over some classes.
international conference on acoustics, speech, and signal processing | 2013
Diego Castán; Alfonso Ortega; Jesús Villalba; Antonio Miguel; Eduardo Lleida
This paper proposes a novel audio segmentation-by-classification system based on Factor Analysis (FA) with a channel compensation matrix for each class and scoring the fixed-length segments as the log-likelihood ratio between class/no-class. The scores are smoothed and the most probable sequence is computed with a Viterbi algorithm. The system described here is designed to segment and classify the audio files coming from broadcast programs into five different classes: speech (SP), speech with noise (SN), speech with music (SM), music (MU) or others (OT). This task was proposed in the Albayzin 2010 evaluation campaign. The system is compared with the winning system of the evaluation achieving lower error rate in SP and SN. These classes represent 3/4 of the total amount of the data. Therefore, the FA segmentation system gets a reduction in the average segmentation error rate.
conference of the international speech communication association | 2016
Mitchell McLaren; Diego Castán; Luciana Ferrer; Aaron Lawson
This article is concerned with the issue of calibration in the context of Deep Neural Network (DNN) based approaches to speaker recognition. DNNs have provided a new standard in technology when used in place of the traditional universal background model (UBM) for feature alignment, or to augment traditional features with those extracted from a bottleneck layer of the DNN. These techniques provide extremely good performance for constrained trial conditions that are well matched to development conditions. However, when applied to unseen conditions or a wide variety of conditions, some DNN-based techniques offer poor calibration performance. Through analysis on both PRISM and the recently released Speakers in the Wild (SITW) corpora, we illustrate that bottleneck features hinder calibration if used in the calculation of first-order Baum Welch statistics during i-vector extraction. We propose a hybrid alignment framework, which stems from our previous work in DNN senone alignment, that uses the bottleneck features only for the alignment of features during statistics calculation. This framework not only addresses the issue of calibration, but provides a more computationally efficient system based on bottleneck features with improved discriminative power.
Odyssey 2016 | 2016
Mitchell McLaren; Diego Castán; Luciana Ferrer
We present the work done by our group for the 2015 language recognition evaluation (LRE) organized by the National Institute of Standards and Technology (NIST), along with an extended post-evaluation analysis. The focus of this evaluation was the development of language recognition systems for clusters of closely related languages using training data released by NIST. This training data contained a highly imbalanced sample from the languages of interest. The SRI team submitted several systems to LRE’15. Major components included (1) bottleneck features extracted from Deep Neural Networks (DNNs) trained to predict English senones, with multiple DNNs trained using a variety of acoustic features; (2) data-driven Discrete Cosine Transform (DCT) contextualization of features for traditional Universal Background Model (UBM) i-vector extraction and for input to a DNN for bottleneck feature extraction; (3) adaptive Gaussian backend scoring; (4) a newly developed multiresolution neural network backend; and (5) cluster-specific Nway fusion of scores. We compare results on our development dataset with those on the evaluation data and find significantly different conclusions about which techniques were useful for each dataset. This difference was due mostly to a large unexpected mismatch in acoustic and channel conditions between the two datasets. We provide a post-evaluation analysis revealing that the successful approaches for this evaluation included the use of bottleneck features, and a well-defined development dataset appropriate for mismatched conditions.
conference of the international speech communication association | 2018
Mahesh Kumar Nandwana; Mitchell McLaren; Diego Castán; Julien van Hout; Aaron Lawson
Deep neural network (DNN)-based speaker embeddings have resulted in new, state-of-the-art text-independent speaker recognition technology. However, very limited effort has been made to understand DNN speaker embeddings. In this study, our aim is analyzing the behavior of the speaker recognition systems based on speaker embeddings toward different front-end features, including the standard Mel frequency cepstral coefficients (MFCC), as well as power normalized cepstral coefficients (PNCC), and perceptual linear prediction (PLP). Using a speaker recognition system based on DNN speaker embeddings and probabilistic linear discriminant analysis (PLDA), we compared different approaches to leveraging complementary information using score-, embeddings-, and feature-level combination. We report our results for Speakers in the Wild (SITW) and NIST SRE 2016 datasets. We found that first and second embeddings layers are complementary in nature. By applying score and embedding-level fusion we demonstrate relative improvements in equal error rate of 17% on NIST SRE 2016 and 10% on SITW over the baseline system.
international conference on universal access in human-computer interaction | 2015
Paola García; Eduardo Lleida; Diego Castán; José Manuel Marcos; David Romero
We describe the design of a communicator for people with speech impairments of several ages, but that can also be used by everybody. The design is based on the accurate definition of user models and profiles from which we extracted technical goals and requirements. The current design shows the factors to consider to provide a successful communication between users. The system is prepared to be used with children and elderly people with some kind of speech impairment. Moreover, the communicator is able to spontaneously adapt to each user profile and be aware of the situation, summarized in: location, time of the day and interlocutor. Therefore, the vocabulary to be used relates to a particular situation with the possibility to be broadened by the user if needed. This “vocabulary” is not restricted only to the word or syntactic domain but to pictograms and concepts. Several machine learning tools are employed for this purpose, such as word prediction, context-aware communication and non-syntactic modeling. We present a prototype scenario that includes examples of the usage of our target users.
IberSPEECH 2014 Proceedings of the Second International Conference on Advances in Speech and Language Technologies for Iberian Languages - Volume 8854 | 2014
Diego Castán; Alfonso Ortega; Antonio Miguel; Eduardo Lleida
The classification of acoustic events is useful to describe the scene and can contribute to improve the robustness of different speech technologies. However, the events are usually overlapped with speech or other sounds. This work proposes an approach based on Factor Analysis to compensate the variability of the acoustic events due to overlap with speech. The system is evaluated in the CLEAR evaluation database composed of recordings in meeting rooms where the acoustic events have been spontaneously generated in five different locations. The experiments are divided in two sets. Firstly, isolated acoustic events are used as development to analyze and evaluate parameters of the Factor Analysis system. Secondly, the system is compared to a baseline based on Gaussians Mixture Models with Hidden Markov Models. The Factor Analysis approach improves the total error rate due to the variability compensation of overlapped segments.
Eurasip Journal on Audio, Speech, and Music Processing | 2014
Diego Castán; Alfonso Ortega; Antonio Miguel; Eduardo Lleida