Eva Navas
University of the Basque Country
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Eva Navas.
IEEE Journal of Selected Topics in Signal Processing | 2014
Daniel Erro; Iñaki Sainz; Eva Navas; Inma Hernaez
This article explores the potential of the harmonics plus noise model of speech in the development of a high-quality vocoder applicable in statistical frameworks, particularly in modern speech synthesizers. It presents an extensive explanation of all the different alternatives considered during the design of the HNM-based vocoder, together with the corresponding objective and subjective experiments, and a careful description of its implementation details. Three aspects of the analysis have been investigated: refinement of the pitch estimation using quasi-harmonic analysis, study and comparison of several spectral envelope analysis procedures, and strategies to analyze and model the maximum voiced frequency. The performance of the resulting vocoder is shown to be similar to that of state-of-the-art vocoders in synthesis tasks.
IEEE Transactions on Multimedia | 2010
Iker Luengo; Eva Navas; Inmaculada Hernáez
The definition of parameters is a crucial step in the development of a system for identifying emotions in speech. Although there is no agreement on which are the best features for this task, it is generally accepted that prosody carries most of the emotional information. Most works in the field use some kind of prosodic features, often in combination with spectral and voice quality parametrizations. Nevertheless, no systematic study has been done comparing these features. This paper presents the analysis of the characteristics of features derived from prosody, spectral envelope, and voice quality as well as their capability to discriminate emotions. In addition, early fusion and late fusion techniques for combining different information sources are evaluated. The results of this analysis are validated with experimental automatic emotion identification tests. Results suggest that spectral envelope features outperform the prosodic ones. Even when different parametrizations are combined, the late fusion of long-term spectral statistics with short-term spectral envelope parameters provides an accuracy comparable to that obtained when all parametrizations are combined.
IEEE Transactions on Audio, Speech, and Language Processing | 2013
Daniel Erro; Eva Navas; Inma Hernaez
Voice conversion methods based on frequency warping followed by amplitude scaling have been recently proposed. These methods modify the frequency axis of the source spectrum in such manner that some significant parts of it, usually the formants, are moved towards their image in the target speakers spectrum. Amplitude scaling is then applied to compensate for the differences between warped source spectra and target spectra. This article presents a fully parametric formulation of a frequency warping plus amplitude scaling method in which bilinear frequency warping functions are used. Introducing this constraint allows for the conversion error to be described in the cepstral domain and to minimize it with respect to the parameters of the transformation through an iterative algorithm, even when multiple overlapping conversion classes are considered. The paper explores the advantages and limitations of this approach when applied to a cepstral representation of speech. We show that it achieves significant improvements in quality with respect to traditional methods based on Gaussian mixture models, with no loss in average conversion accuracy. Despite its relative simplicity, it achieves similar performance scores to state-of-the-art statistical methods involving dynamic features and global variance.
international conference on acoustics, speech, and signal processing | 2007
Iker Luengo; Ibon Saratxaga; Eva Navas; Inmaculada Hernáez; Jon Sanchez; Iñaki Sainz
A novel algorithm based on classical cepstrum calculation followed by dynamic programming is presented in this paper. The algorithm has been evaluated with a 60-minutes database containing 60 speakers and different recording conditions and environments. A second reference database has also been used. In addition, the performance of four popular PDA algorithms has been evaluated with the same databases. The results prove the good performance of the described algorithm in noisy conditions. Furthermore, the paper is a first initiative to perform an evaluation of widely used PDA algorithms over an extensive and realistic database.
IEEE Transactions on Information Forensics and Security | 2015
Jon Sanchez; Ibon Saratxaga; Inma Hernaez; Eva Navas; Daniel Erro; Tuomo Raitio
In the field of speaker verification (SV) it is nowadays feasible and relatively easy to create a synthetic voice to deceive a speech driven biometric access system. This paper presents a synthetic speech detector that can be connected at the front-end or at the back-end of a standard SV system, and that will protect it from spoofing attacks coming from state-of-the-art statistical Text to Speech (TTS) systems. The system described is a Gaussian Mixture Model (GMM) based binary classifier that uses natural and copy-synthesized signals obtained from the Wall Street Journal database to train the system models. Three different state-of-the-art vocoders are chosen and modeled using two sets of acoustic parameters: 1) relative phase shift and 2) canonical Mel Frequency Cepstral Coefficients (MFCC) parameters, as baseline. The vocoder dependency of the system and multivocoder modeling features are thoroughly studied. Additional phase-aware vocoders are also tested. Several experiments are carried out, showing that the phase-based parameters perform better and are able to cope with new unknown attacks. The final evaluations, testing synthetic TTS signals obtained from the Blizzard challenge, validate our proposal.
IEEE Transactions on Audio, Speech, and Language Processing | 2010
Daniel Erro; Eva Navas; Inmaculada Hernáez; Ibon Saratxaga
Voice conversion has been traditionally focused on spectrum. Current systems lack a solid prosody conversion method suitable for different speaking styles. Recently, the unit selection technique has been applied to transform emotional intonation contours. This paper goes one step beyond: it explores strategies for training and configuring the selection cost function in an emotion conversion application. The proposed system, which uses accent groups as basic intonation units and performs conversion also on phoneme durations and intensity, is evaluated by means of a carefully designed subjective test involving the big six emotions. Although the expressiveness of the converted sentences is still far from that of natural emotional speech, satisfactory results are obtained when different configurations are used for different emotions.
international conference on acoustics, speech, and signal processing | 2011
Daniel Erro; Iñaki Sainz; Eva Navas; Inma Hernaez
Currently, the statistical framework based on Hidden Markov Models (HMMs) plays a relevant role in speech synthesis, while voice conversion systems based on Gaussian Mixture Models (GMMs) are almost standard. In both cases, statistical modeling is applied to learn distributions of acoustic vectors extracted from speech signals, each vector containing a suitable parametric representation of one speech frame. The overall performance of the systems is often limited by the accuracy of the underlying speech parameterization and reconstruction method. The method presented in this paper allows accurate MFCC extraction and high-quality reconstruction of speech signals assuming a Harmonics plus Noise Model (HNM). Its suitability for high-quality HMM-based speech synthesis is shown through subjective tests.
text speech and dialogue | 2004
Eva Navas; Inmaculada Hernáez; Amaia Castelruiz; Iker Luengo
This paper presents a database designed to extract prosodic models corresponding to emotional speech to be used in speech synthesis for standard Basque. A database of acted speech, which uses a corpus containing both neutral texts and texts semantically related with emotion has been recorded for the six basic emotions: anger, disgust, fear, joy, sadness and surprise. Subjective evaluation of the database shows that emotions are accurately identified, so it can be used to study prosodic models of emotion in Basque.
Speech Communication | 2016
Ibon Saratxaga; Jon Sanchez; Zhizheng Wu; Inma Hernaez; Eva Navas
Phase information based synthetic speech detectors (RPS, MGD) are analyzed.Training using real attack samples and copy-synthesized material is evaluated.Evaluation of the detectors against unknown attacks, including channel effect.Detectors work well for voice conversion and adapted synthetic speech impostors. Taking advantage of the fact that most of the speech processing techniques neglect the phase information, we seek to detect phase perturbations in order to prevent synthetic impostors attacking Speaker Verification systems. Two Synthetic Speech Detection (SSD) systems that use spectral phase related information are reviewed and evaluated in this work: one based on the Modified Group Delay (MGD), and the other based on the Relative Phase Shift, (RPS). A classical module-based MFCC system is also used as baseline. Different training strategies are proposed and evaluated using both real spoofing samples and copy-synthesized signals from the natural ones, aiming to alleviate the issue of getting real data to train the systems. The recently published ASVSpoof2015 database is used for training and evaluation. Performance with completely unrelated data is also checked using synthetic speech from the Blizzard Challenge as evaluation material. The results prove that phase information can be successfully used for the SSD task even with unknown attacks.
international symposium on universal communication | 2008
K. Arrieta; Igor Leturia; Urtza Iturraspe; A.D. de Ilarraza; Kepa Sarasola; Inmaculada Hernáez; Eva Navas
AnHitz is a project promoted by the Basque Government to develop language technologies for the Basque language. The participants in AnHitz are research groups with very different backgrounds: text processing, speech processing and multimedia. The project aims to further develop existing language, speech and visual technologies for Basque: up to now its fruit is a set of 7 different language resources, 9 NLP tools, and 5 applications.. But also, in the last year of this project we are integrating, for the first time, such resources and tools (both existing and generated in the project) into a content management application for Basque with a natural language communication interface. This application consists of a Question Answering and a Cross Lingual Information Retrieval system on the area of Science and Technology. The interaction between the system and the user will be in Basque (the results of the CLIR module that are not in Basque will be translated through Machine Translation) using Speech Synthesis, Automatic Speech Recognition and a Visual Interface. The various resources, technologies and tools that we are developing are already in a very advanced stage, and the implementation of the content management application to integrate them all is in work and is due to be completed by October 2008.