Jordi Adell
Polytechnic University of Catalonia
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Jordi Adell.
international conference on acoustics, speech, and signal processing | 2005
Jordi Adell; Antonio Bonafonte; Jon Ander Gómez; María José Castro
We present two novel approaches to phonetic speech segmentation. One is based on acoustical clustering plus dynamic time warping and the other is based on a boundary specific correction by means of a decision tree. The use of objective or perceptual evaluations is discussed. The novel approaches clearly outperform the objective results of the baseline system based on HMM. They get results similar to agreement between manual segmentations. We show how phonetic features can be successfully used for boundary detection together with HMMs. Finally, the need for perceptual tests in order to evaluate segmentation systems is pointed out.
international conference on acoustics, speech, and signal processing | 2006
Pablo Daniel Agüero; Jordi Adell; Antonio Bonafonte
This paper deals with speech synthesis in the framework of speech-to-speech translation. Our current focus is to translate speeches or conversations between humans so that a third person can listen to them in its own language. In this framework the style is not written but spoken and the original speech includes a lot of non-linguistic information (as speaker emotion). In this work we propose the use of prosodic features in the original speech to produce prosody in the target language. Relevant features are found using an unsupervised clustering algorithm that finds, in a bilingual speech corpus, intonation clusters in the source speech which are relevant in the target speech. Preliminary results already show a significant improvement in the synthetic quality (from MOS=3.40 to MOS=3.65)
Speech Communication | 2012
Jordi Adell; David Escudero; Antonio Bonafonte
Until now, speech synthesis has mainly involved reading-style speech. Today, however, text-to-speech systems must provide a variety of styles because users expect these interfaces to do more than just read information. If synthetic voices must be integrated into future technology, they must simulate the way people talk instead of the way people read. Existing knowledge about how disfluencies occur has made it possible to propose a general framework for synthesising disfluencies. We propose a model based on the definition of disfluency and the concept of underlying fluent sentences. The model incorporates the parameters of standard prosodic models for fluent speech with local modifications of prosodic parameters near the interruption point. The constituents of the local models for filled pauses are derived from the analysis corpus, and constituents prosodic parameters are predicted via linear regression analysis. We also discuss the implementation details of the model when used in a real speech synthesis system. Objective and perceptual evaluations showed that the proposed models outperformed the baseline model. Perceptual evaluations of the system showed that it is possible to synthesise filled pauses without decreasing the overall naturalness of the system, and users stated that the speech produced is even more natural than the one produced without filled pauses.
text, speech and dialogue | 2007
Jordi Adell; Antonio Bonafonte; David Escudero
Speech synthesis techniques have already reached a high level of naturalness. However, they are often evaluated on text reading tasks. New applications will request for conversational speech instead and disfluencies are crucial in such a style. The present paper presents a system to predict filled pauses and synthesise them. Objective results show that they can be inserted with 96% precision and 58% recall. Perceptual results even shown that its insertion increases naturalness of synthetic speech.
international conference on acoustics, speech, and signal processing | 2006
Jordi Adell; Pablo Daniel Agüero; Antonio Bonafonte
Unit selection speech synthesis techniques lead the speech synthesis state of the art. Automatic segmentation of databases is necessary in order to build new voices. They may contain errors and segmentation processes may introduce some more. Quality systems require a significant effort to find and correct these segmentation errors. Phonetic transcription is crucial and is one of the manually supervised tasks. The possibility to automatically remove incorrectly transcribed units from the inventory will help to make the process more automatic. Here we present a new technique based on speech recognition confidence measures that reaches to remove 90% of incorrectly transcribed units from a database. The cost for it is loosing only a 10% of correctly transcribed units
international conference on acoustics, speech, and signal processing | 2010
Jordi Adell; Antonio Bonafonte; David Escudero
In the present paper we present a new approach to the synthesis of filled pauses. The problem is tackled from the point of view of disfluent speech synthesis. Based on the synthetic disfluent speech model, we analyse the features that describe filled pauses and propose a model to predict them. The model was implemented and perceptually evaluated with successful results.
Journal on Multimodal User Interfaces | 2007
Olivier Martin; Irene Kotsia; Ioannis Pitas; Arman Savran; Jordi Adell; Ana Huerta; Raphaël Sebbe
This paper describes a natural and intuitive way to create expressive facial animations, using a novel approach based on the so-called ‘multimodal caricatural mirror’ (MCM). Taking as an input an audio-visual video sequence of the user’s face, the MCM generates a facial animation, in which the prosody and the facial expressions of emotions can either be reproduced or amplified. The user can thus simulate an emotion and see almost instantly the animation it produced, like with a regular mirror. In addition, the MCM also enables to amplify the emotions of selected parts of the input video sequence, leaving other parts unchanged. It therefore constitutes a novel approach to the design of very expressive facial animation, as the affective content of the animation can be modified by post-processing operations.
SSW | 2004
Jordi Adell; Antonio Bonafonte
conference of the international speech communication association | 2009
Santiago Planet; Ignasi Iriondo; Joan-Claudi Socoró; Carlos Monzo; Jordi Adell; Quatre Camins
Procesamiento Del Lenguaje Natural | 2005
Jordi Adell; Antonio Bonafonte; David Escudero