Is this you? Create Your Porfile

Antonio Bonafonte

Polytechnic University of Catalonia

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Antonio Bonafonte is active.

Explore More

Publication

Featured researches published by Antonio Bonafonte.

IEEE Transactions on Audio, Speech, and Language Processing | 2010

Voice Conversion Based on Weighted Frequency Warping

Daniel Erro; Asunción Moreno; Antonio Bonafonte

Any modification applied to speech signals has an impact on their perceptual quality. In particular, voice conversion to modify a source voice so that it is perceived as a specific target voice involves prosodic and spectral transformations that produce significant quality degradation. Choosing among the current voice conversion methods represents a trade-off between the similarity of the converted voice to the target voice and the quality of the resulting converted speech, both rated by listeners. This paper presents a new voice conversion method termed Weighted Frequency Warping that has a good balance between similarity and quality. This method uses a time-varying piecewise-linear frequency warping function and an energy correction filter, and it combines typical probabilistic techniques and frequency warping transformations. Compared to standard probabilistic systems, Weighted Frequency Warping results in a significant increase in quality scores, whereas the conversion scores remain almost unaltered. This paper carefully discusses the theoretical aspects of the method and the details of its implementation, and the results of an international evaluation of the new system are also included.

international conference on acoustics, speech, and signal processing | 2006

Text-Independent Voice Conversion Based on Unit Selection

David Sündermann; Harald Höge; Antonio Bonafonte; Hermann Ney; Alan W. Black; Shrikanth Narayanan

So far, most of the voice conversion training procedures are text-dependent, i.e., they are based on parallel training utterances of source and large speaker. Since several applications (e.g. speech-to-speech translation or dubbing) require text-independent training, over the last two years, training techniques that use non-parallel data were proposed In this paper, we present a new approach that applies unit selection to find corresponding time frames in source and target speech. By means of a subjective experiment it is shown that this technique achieves the same performance as the conventional text-dependent training

IEEE Transactions on Audio, Speech, and Language Processing | 2010

INCA Algorithm for Training Voice Conversion Systems From Nonparallel Corpora

Daniel Erro; Asunción Moreno; Antonio Bonafonte

Most existing voice conversion systems, particularly those based on Gaussian mixture models, require a set of paired acoustic vectors from the source and target speakers to learn their corresponding transformation function. The alignment of phonetically equivalent source and target vectors is not problematic when the training corpus is parallel, which means that both speakers utter the same training sentences. However, in some practical situations, such as cross-lingual voice conversion, it is not possible to obtain such parallel utterances. With an aim towards increasing the versatility of current voice conversion systems, this paper proposes a new iterative alignment method that allows pairing phonetically equivalent acoustic vectors from nonparallel utterances from different speakers, even under cross-lingual conditions. This method is based on existing voice conversion techniques, and it does not require any phonetic or linguistic information. Subjective evaluation experiments show that the performance of the resulting voice conversion system is very similar to that of an equivalent system trained on a parallel corpus.

Signal Processing-image Communication | 2002

Facial animation parameters extraction and expression recognition using Hidden Markov Models

Montse Pardàs; Antonio Bonafonte

Abstract The video analysis system described in this paper aims at facial expression recognition consistent with the MPEG4 standardized parameters for facial animation, FAP. For this reason, two levels of analysis are necessary: low-level analysis to extract the MPEG4 compliant parameters and high-level analysis to estimate the expression of the sequence using these low-level parameters. The low-level analysis is based on an improved active contour algorithm that uses high level information based on principal component analysis to locate the most significant contours of the face (eyebrows and mouth), and on motion estimation to track them. The high-level analysis takes as input the FAP produced by the low-level analysis tool and, by means of a Hidden Markov Model classifier, detects the expression of the sequence.

international conference on acoustics, speech, and signal processing | 2005

A study on residual prediction techniques for voice conversion

David Sündermann; Antonio Bonafonte; Hermann Ney

Several well-studied voice conversion techniques use line spectral frequencies as features to represent the spectral envelopes of the processed speech frames. In order to return to the time domain, these features are converted to linear predictive coefficients that serve as coefficients of a filter applied to an unknown residual signal. We compare several residual prediction approaches that have already been proposed in the literature dealing with voice conversion. We also present a novel technique that outperforms the others in terms of voice conversion performance and sound quality.

international conference on acoustics, speech, and signal processing | 2005

Comparative study of automatic phone segmentation methods for TTS

Jordi Adell; Antonio Bonafonte; Jon Ander Gómez; María José Castro

We present two novel approaches to phonetic speech segmentation. One is based on acoustical clustering plus dynamic time warping and the other is based on a boundary specific correction by means of a decision tree. The use of objective or perceptual evaluations is discussed. The novel approaches clearly outperform the objective results of the baseline system based on HMM. They get results similar to agreement between manual segmentations. We show how phonetic features can be successfully used for boundary detection together with HMMs. Finally, the need for perceptual tests in order to evaluate segmentation systems is pointed out.

international conference on acoustics, speech, and signal processing | 2006

Prosody Generation for Speech-to-Speech Translation

Pablo Daniel Agüero; Jordi Adell; Antonio Bonafonte

This paper deals with speech synthesis in the framework of speech-to-speech translation. Our current focus is to translate speeches or conversations between humans so that a third person can listen to them in its own language. In this framework the style is not written but spoken and the original speech includes a lot of non-linguistic information (as speaker emotion). In this work we propose the use of prosodic features in the original speech to produce prosody in the target language. Relevant features are found using an unsupervised clustering algorithm that finds, in a bilingual speech corpus, intonation clusters in the source speech which are relevant in the target speech. Preliminary results already show a significant improvement in the synthetic quality (from MOS=3.40 to MOS=3.65)

Speech Communication | 2000

The demiphone: an efficient contextual subword unit for continuous speech recognition

José B. Mariño; Albino Nogueiras; Pau Pachès-Leal; Antonio Bonafonte

In this paper, we introduce the demiphone as a context-dependent phonetic unit for continuous speech recognition. A phoneme is divided into two parts: a left demiphone that accounts for the left coarticulation and a right demiphone that copes with the right-hand side context. This unit discards the dependence between the effects of both side contexts, but it models the transition between phonemes as the triphone does. By concatenating a left demiphone and a right demiphone a triphone can be built, although the left and the right-context coarticulations are modeled independently. The main appeal of this unit stems from its reduced number (respect to the number of triphones) and its capability to model left and right contexts unseen together in the training material. Thus, the demiphone shares in a simple way the advantages of a smoothed parameter estimation with the ability of generalization. In the present work, the demiphone is motivated and experimentally supported. Furthermore, demiphones are compared with triphones smoothed and generalized by decision-tree state-tying, accepted as the most powerful tool for coarticulation modeling at the present state of the art. The main conclusion of our work is that the demiphone simplifies the recognition system and yields a better performance than the triphone, at least for small or moderate size databases. This result may be explained by the ability of the demiphone to provide an excellent trade-off between a detailed coarticulation modeling and a proper parameter estimation.

international conference on acoustics, speech, and signal processing | 2002

Corpus based extraction of quantitative prosodic parameters of stress groups in Spanish

David Escudero; Valentín Cardeñoso; Antonio Bonafonte

We introduce a new corpus-based technique to model the prosodic information contained in spoken utterances. Taking the stress groups and intonation group as the structural building blocks and Bezier parametric functions to approximate FO contours, we propose a statistical modeling of the relevant categories of stress groups. These models can be directly exploited in speech synthesis tasks in order to get more natural intonation patterns, specially for text reading applications. Suggestions are also made as for the utility of these statistical models in classification and recognition tasks.

international conference on spoken language processing | 1996

Duration modeling with expanded HMM applied to speech recognition

Antonio Bonafonte; Josep Vidal; Albino Nogueiras

The occupancy of the HMM states is modeled by means of a Markov chain. A linear estimator is introduced to compute the probabilities of the Markov chain. The distribution function (DF) represents accurately the observed data. Representing the DF as a Markov chain allows the use of standard HMM recognizers. The increase of complexity is negligible in training and strongly limited during recognition. Experiments performed on acoustic-phonetic decoding shows how the phone recognition rate increases from 60.6 to 61.1. Furthermore, on a task of database inquires, where phones are used as subword units, the correct word rate increases from 88.2 to 88.4.

Explore More