Paula Lopez-Otero | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Paula Lopez-Otero is active.

Explore More

Publication

Featured researches published by Paula Lopez-Otero.

ieee automatic speech recognition and understanding workshop | 2011

Multi-site heterogeneous system fusions for the Albayzin 2010 Language Recognition Evaluation

Luis Javier Rodriguez-Fuentes; Mikel Penagarikano; Amparo Varona; Mireia Diez; Germán Bordel; David Martinez; Jesús Villalba; Antonio Miguel; Alfonso Ortega; Eduardo Lleida; Alberto Abad; Oscar Koller; Isabel Trancoso; Paula Lopez-Otero; Laura Docio-Fernandez; Carmen García-Mateo; Rahim Saeidi; Mehdi Soufifar; Tomi Kinnunen; Torbjørn Svendsen; Pasi Fränti

Best language recognition performance is commonly obtained by fusing the scores of several heterogeneous systems. Regardless the fusion approach, it is assumed that different systems may contribute complementary information, either because they are developed on different datasets, or because they use different features or different modeling approaches. Most authors apply fusion as a final resource for improving performance based on an existing set of systems. Though relative performance gains decrease as larger sets of systems are considered, best performance is usually attained by fusing all the available systems, which may lead to high computational costs. In this paper, we aim to discover which technologies combine the best through fusion and to analyse the factors (data, features, modeling methodologies, etc.) that may explain such a good performance. Results are presented and discussed for a number of systems provided by the participating sites and the organizing team of the Albayzin 2010 Language Recognition Evaluation. We hope the conclusions of this work help research groups make better decisions in developing language recognition technology.

ieee automatic speech recognition and understanding workshop | 2015

Phonetic unit selection for cross-lingual query-by-example spoken term detection

Paula Lopez-Otero; Laura Docio-Fernandez; Carmen García-Mateo

Cross-lingual query-by-example spoken term detection (QbE STD) has caught the attention of speech researchers, as it makes it possible to develop systems for low-resource languages, in which the available amount of labelled data makes the training of automatic speech recognition approaches prohibitive. The use of phonetic posteriorgrams for speech representation combined with dynamic time warping search is a widely used approach for this task, but little attention has been focused in the suitability of a set of phonetic units to represent speech information spoken in a different language. This paper proposes a technique for estimating the relevance of phonetic units aiming at selecting the most suitable ones for a given target language. Experiments in a Spanish database using phoneme posteriorgrams in Czech, English, Hungarian and Russian proved the validity of the proposed method, as QbE STD performance was enhanced by reducing the set of phonetic units.

international conference on acoustics, speech, and signal processing | 2010

Novel strategies for reducing the false alarm rate in a speaker segmentation system

Paula Lopez-Otero; Laura Docio-Fernandez; Carmen García-Mateo

Reliable speaker segmentation is critical in many applications in the speech processing domain. In this paper, we extend our earlier formulation for false alarm reduction in a typical state-of-art speaker segmentation system. Specifically, we present two novel strategies for reducing the false alarm rate with a minimal impact on the true speaker change detection rate. One of the new strategies rejects, given a discard probability, those changes that are suspicious of being false alarms because of their low ΔBIC value; and the other one assumes that the occurrence of changes constitute a Poisson process, so changes will be discarded with a probability that follows a Poisson cumulative density function. Our experiments show the improvements obtained with each false alarm reduction approach using the Spanish Parliament Sessions defined for the 2006 TC-STAR Automatic Speech Recognition evaluation campaign.

international convention on information and communication technology, electronics and microelectronics | 2014

A study of acoustic features for the classification of depressed speech

Paula Lopez-Otero; Laura Docio-Fernandez; Carmen García-Mateo

Soft biometrics comprises the biological traits that are not sufficient for person authentication but can help to narrow the search space. Evidence of mental health state can be considered as a soft biometric, as it provides valuable information about the identity of an individual. Different approaches have been used for the automatic classification of speech in “depressed” or “non-depressed”, but the differences in algorithms, features, databases and performance measures make it difficult to draw conclusions about which features and techniques are more suitable for this task. In this work, the performance of different acoustic features for classification of depression in speech was studied in the framework of the audiovisual emotion challenge (AVEC 2013). To do so, an approach in which the audio data is segmented and projected into a total variability subspace was used, and these projected data was used to estimate the depression level by cosine distance scoring and majority voting.

2nd International Workshop on Biometrics and Forensics | 2014

A study of acoustic features for depression detection

Paula Lopez-Otero; Laura Dacia-Fernandez; Carmen García-Mateo

Clinical depression can be considered as a soft biometric trait that can help to characterize an individual. This mood disorder can be involved in forensic psychological assessment, due to its relevance in different legal issues. The automatic detection of depressed speech has been object of research in the last years, resulting in different algorithmic approaches and acoustic features. Due to the use of different algorithms, databases and performance measures, deciding which ones are more suitable for this task is difficult. In this work, the performance of different acoustic features for depression detection was explored in a common framework. To do so, a depression estimation approach in which the audio data is segmented and projected into a total variability subspace was used, and these projected data was used to estimate the depression level by performing support vector regression. The data and evaluation metrics were the ones used in the audiovisual emotion challenge (AVEC 2013).

Computer Speech & Language | 2017

Reversible speaker de-identification using pre-trained transformation functions

Carmen Magariños; Paula Lopez-Otero; Laura Docio-Fernandez; Eduardo Rodriguez-Banga; Daniel Erro; Carmen García-Mateo

Abstract Speaker de-identification approaches must accomplish three main goals: universality, naturalness and reversibility. The main drawback of the traditional approach to speaker de-identification using voice conversion techniques is its lack of universality, since a parallel corpus between the input and target speakers is necessary to train the conversion parameters. It is possible to make use of a synthetic target to overcome this issue, but this harms the naturalness of the resulting de-identified speech. Hence, a technique is proposed in this paper in which a pool of pre-trained transformations between a set of speakers is used as follows: given a new user to de-identify, its most similar speaker in this set of speakers is chosen as the source speaker, and the speaker that is the most dissimilar to the source speaker is chosen as the target speaker. Speaker similarity is measured using the i-vector paradigm, which is usually employed as an objective measure of speaker de-identification performance, leading to a system with high de-identification accuracy. The transformation method is based on frequency warping and amplitude scaling, in order to obtain natural sounding speech while masking the identity of the speaker. In addition, compared to other voice conversion approaches, the proposed method is easily reversible. Experiments were conducted on Albayzin database, and performance was evaluated in terms of objective and subjective measures. These results showed a high success when de-identifying speech, as well as a great naturalness of the transformed voices. In addition, when making the transformation parameters available to a trusted holder, it is possible to invert the de-identification procedure, hence recovering the original speaker identity. The computational cost of the proposed approach is small, making it possible to produce de-identified speech in real-time with a high level of naturalness.

2016 First International Workshop on Sensing, Processing and Learning for Intelligent Machines (SPLINE) | 2016

Piecewise linear definition of transformation functions for speaker de-identification

Carmen Magariños; Paula Lopez-Otero; Laura Docio-Fernandez; Eduardo Rodríguez Banga; Carmen García-Mateo; Daniel Erro

The main drawback of speaker de-identification approaches using voice conversion techniques is the need for parallel corpora to train transformation functions between the source and target speakers. In this paper, a voice conversion approach that does not require training any parameters is proposed: it consists in manually defining frequency warping (FW) based transformations by using piecewise linear approximations. An analysis of the de-identification capabilities of the proposed approach using FW only or combined with FW modification and spectral amplitude scaling (AS) was performed. Experimental results show that, using the manually defined transformations using only FW, it is not possible to obtain de-identified natural sounding speech. Nevertheless, when modifying the FW, both de-identification accuracy and naturalness increase to a great extent. A slight improvement in de-identification was also obtained when applying spectral amplitude scaling.

Pattern Recognition Letters | 2015

Assessing speaker independence on a speech-based depression level estimation system

Paula Lopez-Otero; Laura Docio-Fernandez; Carmen García-Mateo

Depression can be considered a psychological state related soft biometric traitSpeaker dependence of an iVector based depression level estimation system is assessedSystem performance is much better when the test speaker is on the training setExperimental frameworks must be carefully designed to avoid biasing the experimentsWe introduce a new metric for assessing depression classification systems Soft biometrics refers to traits that provide valuable information about an individual without being sufficient for their authentication, as they lack uniqueness and distinctiveness. This definition includes features related to the psychological state of individuals, such as emotions or mental health disorders like depression. Depression has recently been attracting the attention of speech researchers, with audio/visual emotion challenge (AVEC) 2013 and 2014 organized to encourage researchers to develop approaches to accurately estimate speaker depression level. The evaluation frameworks provided for these evaluations do not take speaker independence into account in experiment design, despite this being an important factor in developing a robust speech based system. We assess the influence of prior knowledge of the speakers in a depression estimation experiment, using an iVector-based state-of-the-art approach to depression level estimation to perform a speaker-dependent experiment and a speaker-independent experiment. We conclude that having previous information about the depression level of a given speaker dramatically improves system performance. Hence, we suggest that experimental frameworks must be carefully designed in order to serve as a genuinely useful resource for the development of robust depression estimation systems.

Speech Communication | 2016

Finding relevant features for zero-resource query-by-example search on speech

Paula Lopez-Otero; Laura Docio-Fernandez; Carmen García-Mateo

Zero-resource query-by-example search on speech strategies have raised the interest of the research community, as they do not imply training (and therefore, large amounts of training data) or any knowledge about either the language to be processed or any others. These systems usually rely on Mel-frequency cepstral coefficients (MFCCs) for speech representation and dynamic time warping (DTW) or any of its variants for performing the search. Nevertheless, which features to use in this task is still an open research problem, and the use of large feature sets combined with feature selection approaches have not been addressed yet in the query-by-example search on speech scenario. In this paper, we present two methods to select the most relevant features among a large set of acoustic features, for the purpose of estimating the relevance of each feature using the costs of the best alignment path (obtained when performing DTW) and their neighbouring region. To prove the validity of these methods, experiments were carried out in four different search on speech scenarios that were used in international benchmarks, namely Albayzin 2014 search on speech evaluation, MediaEval spoken web search SWS 2013, and MediaEval query-by-example search on speech QUESST2014 and QUESST2015. Experimental results showed a dramatic improvement in the results when reducing the feature set using the proposed techniques, especially in the case of the relevance-based approaches. A comparison between the proposed methods and other representations such as MFCCs, phonetic posteriorgrams and dimensionality reduction based on principal component analysis, showed that the zero-resource approaches presented in this paper are promising, as they outperformed more extended approaches in all the experimental scenarios. The feature relevance estimation approaches, apart from improving search on speech results, also revealed features other than MFCCs that seemed to be a value-added in query-by-example tasks.

content based multimedia indexing | 2017

Towards large scale multimedia indexing: A case study on person discovery in broadcast news

Nam Le; Hervé Bredin; Gabriel Sargent; Miquel India; Paula Lopez-Otero; Claude Barras; Camille Guinaudeau; Guillaume Gravier; Gabriel Barbosa da Fonseca; Izabela Lyon Freire; Zenilton Kleber Gonçalves do Patrocínio; Silvio Jamil Ferzoli Guimarães; Gerard Martí; Josep Ramon Morros; Javier Hernando; Laura Docio-Fernandez; Carmen García-Mateo; Sylvain Meignier; Jean-Marc Odobez

The rapid growth of multimedia databases and the human interest in their peers make indices representing the location and identity of people in audio-visual documents essential for searching archives. Person discovery in the absence of prior identity knowledge requires accurate association of audio-visual cues and detected names. To this end, we present 3 different strategies to approach this problem: clustering-based naming, verification-based naming, and graph-based naming. Each of these strategies utilizes different recent advances in unsupervised face / speech representation, verification, and optimization. To have a better understanding of the approaches, this paper also provides a quantitative and qualitative comparative study of these approaches using the associated corpus of the Person Discovery challenge at MediaEval 2016. From the results of our experiments, we can observe the pros and cons of each approach, thus paving the way for future promising research directions.

Explore More