Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Daniel Erro is active.

Publication


Featured researches published by Daniel Erro.


IEEE Journal of Selected Topics in Signal Processing | 2014

Harmonics Plus Noise Model Based Vocoder for Statistical Parametric Speech Synthesis

Daniel Erro; Iñaki Sainz; Eva Navas; Inma Hernaez

This article explores the potential of the harmonics plus noise model of speech in the development of a high-quality vocoder applicable in statistical frameworks, particularly in modern speech synthesizers. It presents an extensive explanation of all the different alternatives considered during the design of the HNM-based vocoder, together with the corresponding objective and subjective experiments, and a careful description of its implementation details. Three aspects of the analysis have been investigated: refinement of the pitch estimation using quasi-harmonic analysis, study and comparison of several spectral envelope analysis procedures, and strategies to analyze and model the maximum voiced frequency. The performance of the resulting vocoder is shown to be similar to that of state-of-the-art vocoders in synthesis tasks.


IEEE Transactions on Audio, Speech, and Language Processing | 2013

Parametric Voice Conversion Based on Bilinear Frequency Warping Plus Amplitude Scaling

Daniel Erro; Eva Navas; Inma Hernaez

Voice conversion methods based on frequency warping followed by amplitude scaling have been recently proposed. These methods modify the frequency axis of the source spectrum in such manner that some significant parts of it, usually the formants, are moved towards their image in the target speakers spectrum. Amplitude scaling is then applied to compensate for the differences between warped source spectra and target spectra. This article presents a fully parametric formulation of a frequency warping plus amplitude scaling method in which bilinear frequency warping functions are used. Introducing this constraint allows for the conversion error to be described in the cepstral domain and to minimize it with respect to the parameters of the transformation through an iterative algorithm, even when multiple overlapping conversion classes are considered. The paper explores the advantages and limitations of this approach when applied to a cepstral representation of speech. We show that it achieves significant improvements in quality with respect to traditional methods based on Gaussian mixture models, with no loss in average conversion accuracy. Despite its relative simplicity, it achieves similar performance scores to state-of-the-art statistical methods involving dynamic features and global variance.


IEEE Transactions on Information Forensics and Security | 2015

Toward a Universal Synthetic Speech Spoofing Detection Using Phase Information

Jon Sanchez; Ibon Saratxaga; Inma Hernaez; Eva Navas; Daniel Erro; Tuomo Raitio

In the field of speaker verification (SV) it is nowadays feasible and relatively easy to create a synthetic voice to deceive a speech driven biometric access system. This paper presents a synthetic speech detector that can be connected at the front-end or at the back-end of a standard SV system, and that will protect it from spoofing attacks coming from state-of-the-art statistical Text to Speech (TTS) systems. The system described is a Gaussian Mixture Model (GMM) based binary classifier that uses natural and copy-synthesized signals obtained from the Wall Street Journal database to train the system models. Three different state-of-the-art vocoders are chosen and modeled using two sets of acoustic parameters: 1) relative phase shift and 2) canonical Mel Frequency Cepstral Coefficients (MFCC) parameters, as baseline. The vocoder dependency of the system and multivocoder modeling features are thoroughly studied. Additional phase-aware vocoders are also tested. Several experiments are carried out, showing that the phase-based parameters perform better and are able to cope with new unknown attacks. The final evaluations, testing synthetic TTS signals obtained from the Blizzard challenge, validate our proposal.


IEEE Transactions on Audio, Speech, and Language Processing | 2010

Emotion Conversion Based on Prosodic Unit Selection

Daniel Erro; Eva Navas; Inmaculada Hernáez; Ibon Saratxaga

Voice conversion has been traditionally focused on spectrum. Current systems lack a solid prosody conversion method suitable for different speaking styles. Recently, the unit selection technique has been applied to transform emotional intonation contours. This paper goes one step beyond: it explores strategies for training and configuring the selection cost function in an emotion conversion application. The proposed system, which uses accent groups as basic intonation units and performs conversion also on phoneme durations and intensity, is evaluated by means of a carefully designed subjective test involving the big six emotions. Although the expressiveness of the converted sentences is still far from that of natural emotional speech, satisfactory results are obtained when different configurations are used for different emotions.


international conference on acoustics, speech, and signal processing | 2011

HNM-based MFCC+F0 extractor applied to statistical speech synthesis

Daniel Erro; Iñaki Sainz; Eva Navas; Inma Hernaez

Currently, the statistical framework based on Hidden Markov Models (HMMs) plays a relevant role in speech synthesis, while voice conversion systems based on Gaussian Mixture Models (GMMs) are almost standard. In both cases, statistical modeling is applied to learn distributions of acoustic vectors extracted from speech signals, each vector containing a suitable parametric representation of one speech frame. The overall performance of the systems is often limited by the accuracy of the underlying speech parameterization and reconstruction method. The method presented in this paper allows accurate MFCC extraction and high-quality reconstruction of speech signals assuming a Harmonics plus Noise Model (HNM). Its suitability for high-quality HMM-based speech synthesis is shown through subjective tests.


IberSPEECH | 2012

Improving the Quality of Standard GMM-Based Voice Conversion Systems by Considering Physically Motivated Linear Transformations

Tudor-Cătălin Zorilă; Daniel Erro; Inma Hernaez

This paper presents a new method to train traditional voice conversion functions based on Gaussian mixture models, linear transforms and cepstral parameterization. Instead of using statistical criteria, this method calculates a set of linear transforms that represent physically meaningful spectral modifications such as frequency warping and amplitude scaling. Our experiments indicate that the proposed training method leads to significant improvements in the average quality of the converted speech with respect to traditional statistical methods. This is achieved without modifying the input/output parameters or the shape of the conversion function.


IEEE Transactions on Audio, Speech, and Language Processing | 2014

Enhancing the intelligibility of statistically generated synthetic speech by means of noise-independent modifications

Daniel Erro; Tudor-Catalin Zorila; Yannis Stylianou

When speaking devices such as smartphones, tablet-PCs, or GPS systems are used in noisy outdoor environments, the intelligibility of speech significantly drops. This is even more pronounced when synthetic speech is used. This article describes how a statistical parametric speech synthesis system trained on an ordinary synthesis database can be designed to generate highly intelligible speech, even at very low signal-to-noise ratios. By using a simple and flexible vocoder based on a full-band harmonic model, the proposed system applies deterministic noise-independent modifications at several levels: speaking rate, average fundamental frequency level and range, energy contour over time, formant sharpness, and intensity of specific spectral bands. The degree of intelligibility achieved by the system has been evaluated by means of a large-scale subjective test, the results of which show that the suggested approach clearly outperforms a reference state-of-the-art TTS system and also unmodified natural speech in some conditions. In comparison with alternative systems evaluated in the same framework, the proposed one exhibits the best performance in the scenarios with lowest signal-to-noise ratio. Finally, the impact of the suggested modifications on naturalness, quality and similarity to the original natural voice is quantified by means of a subjective test.


Computer Speech & Language | 2017

Reversible speaker de-identification using pre-trained transformation functions

Carmen Magariños; Paula Lopez-Otero; Laura Docio-Fernandez; Eduardo Rodriguez-Banga; Daniel Erro; Carmen García-Mateo

Abstract Speaker de-identification approaches must accomplish three main goals: universality, naturalness and reversibility. The main drawback of the traditional approach to speaker de-identification using voice conversion techniques is its lack of universality, since a parallel corpus between the input and target speakers is necessary to train the conversion parameters. It is possible to make use of a synthetic target to overcome this issue, but this harms the naturalness of the resulting de-identified speech. Hence, a technique is proposed in this paper in which a pool of pre-trained transformations between a set of speakers is used as follows: given a new user to de-identify, its most similar speaker in this set of speakers is chosen as the source speaker, and the speaker that is the most dissimilar to the source speaker is chosen as the target speaker. Speaker similarity is measured using the i-vector paradigm, which is usually employed as an objective measure of speaker de-identification performance, leading to a system with high de-identification accuracy. The transformation method is based on frequency warping and amplitude scaling, in order to obtain natural sounding speech while masking the identity of the speaker. In addition, compared to other voice conversion approaches, the proposed method is easily reversible. Experiments were conducted on Albayzin database, and performance was evaluated in terms of objective and subjective measures. These results showed a high success when de-identifying speech, as well as a great naturalness of the transformed voices. In addition, when making the transformation parameters available to a trusted holder, it is possible to invert the de-identification procedure, hence recovering the original speaker identity. The computational cost of the proposed approach is small, making it possible to produce de-identified speech in real-time with a high level of naturalness.


Computer Speech & Language | 2015

Interpretable parametric voice conversion functions based on Gaussian mixture models and constrained transformations

Daniel Erro; Agustín Alonso; Luis Serrano; Eva Navas; Inma Hernaez

New voice conversion functions based on bilinear frequency warping and constrained amplitude scaling.Good overall conversion performance using more intuitive and informative parameters.Useful as an analysis tool to visualize spectral differences between different voices or styles. Voice conversion functions based on Gaussian mixture models and parametric speech signal representations are opaque in the sense that it is not straightforward to interpret the physical meaning of the conversion parameters. Following the line of recent works based on the frequency warping plus amplitude scaling paradigm, in this article we show that voice conversion functions can be designed according to physically meaningful constraints in such manner that they become highly informative. The resulting voice conversion method can be used to visualize the differences between source and target voices or styles in terms of formant location in frequency, spectral tilt and amplitude in a number of spectral bands.


international conference on acoustics, speech, and signal processing | 2014

Parametric representation for singing voice synthesis: A comparative evaluation

Onur Babacan; Thomas Drugman; Tuomo Raitio; Daniel Erro; Thierry Dutoit

Various parametric representations have been proposed to model the speech signal. While the performance of such vocoders is well-known in the context of speech processing, their extrapolation to singing voice synthesis might not be straightforward. The goal of this paper is twofold. First, a comparative subjective evaluation is performed across four existing techniques suitable for statistical parametric synthesis: traditional pulse vocoder, Deterministic plus Stochastic Model, Harmonic plus Noise Model and GlottHMM. The behavior of these techniques as a function of the singer type (baritone, counter-tenor and soprano) is studied. Secondly, the artifacts occurring in high-pitched voices are discussed and possible approaches to overcome them are suggested.

Collaboration


Dive into the Daniel Erro's collaboration.

Top Co-Authors

Avatar

Eva Navas

University of the Basque Country

View shared research outputs
Top Co-Authors

Avatar

Inma Hernaez

University of the Basque Country

View shared research outputs
Top Co-Authors

Avatar

Ibon Saratxaga

University of the Basque Country

View shared research outputs
Top Co-Authors

Avatar

Iñaki Sainz

University of the Basque Country

View shared research outputs
Top Co-Authors

Avatar

Jon Sanchez

University of the Basque Country

View shared research outputs
Top Co-Authors

Avatar

Agustín Alonso

University of the Basque Country

View shared research outputs
Top Co-Authors

Avatar

Igor Odriozola

University of the Basque Country

View shared research outputs
Top Co-Authors

Avatar

Inmaculada Hernáez

University of the Basque Country

View shared research outputs
Top Co-Authors

Avatar

David Tavarez

University of the Basque Country

View shared research outputs
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge