David Escudero
University of Valladolid
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by David Escudero.
international conference on acoustics, speech, and signal processing | 2002
David Escudero; Valentín Cardeñoso; Antonio Bonafonte
We introduce a new corpus-based technique to model the prosodic information contained in spoken utterances. Taking the stress groups and intonation group as the structural building blocks and Bezier parametric functions to approximate FO contours, we propose a statistical modeling of the relevant categories of stress groups. These models can be directly exploited in speech synthesis tasks in order to get more natural intonation patterns, specially for text reading applications. Suggestions are also made as for the utility of these statistical models in classification and recognition tasks.
Speech Communication | 2012
Jordi Adell; David Escudero; Antonio Bonafonte
Until now, speech synthesis has mainly involved reading-style speech. Today, however, text-to-speech systems must provide a variety of styles because users expect these interfaces to do more than just read information. If synthetic voices must be integrated into future technology, they must simulate the way people talk instead of the way people read. Existing knowledge about how disfluencies occur has made it possible to propose a general framework for synthesising disfluencies. We propose a model based on the definition of disfluency and the concept of underlying fluent sentences. The model incorporates the parameters of standard prosodic models for fluent speech with local modifications of prosodic parameters near the interruption point. The constituents of the local models for filled pauses are derived from the analysis corpus, and constituents prosodic parameters are predicted via linear regression analysis. We also discuss the implementation details of the model when used in a real speech synthesis system. Objective and perceptual evaluations showed that the proposed models outperformed the baseline model. Perceptual evaluations of the system showed that it is possible to synthesise filled pauses without decreasing the overall naturalness of the system, and users stated that the speech produced is even more natural than the one produced without filled pauses.
text, speech and dialogue | 2007
Jordi Adell; Antonio Bonafonte; David Escudero
Speech synthesis techniques have already reached a high level of naturalness. However, they are often evaluated on text reading tasks. New applications will request for conversational speech instead and disfluencies are crucial in such a style. The present paper presents a system to predict filled pauses and synthesise them. Objective results show that they can be inserted with 96% precision and 58% recall. Perceptual results even shown that its insertion increases naturalness of synthetic speech.
Speech Communication | 2012
David Escudero; Lourdes Aguilar; Maria del Mar Vanrell; Pilar Prieto
A set of tools to analyze inconsistencies observed in a Cat_ToBI labeling experiment are presented. We formalize and use the metrics that are commonly used in inconsistency tests. The metrics are systematically applied to analyze the robustness of every symbol and every pair of transcribers. The results reveal agreement rates for this study that are comparable to previous ToBI inter-reliability tests. The inter-transcriber confusion rates are transformed into distance matrices to use multidimensional scaling for visualizing the confusion between the different ToBI symbols and the disagreement between the raters. Potential different labeling criteria are identified and subsets of symbols that are candidates to be fused are proposed.
international conference on acoustics, speech, and signal processing | 2004
Valentín Cardeñoso; David Escudero
Data scarcity in corpus-based intonation modelling for text-to-speech (TTS) applications is addressed. Multiple model dictionaries are proposed to predict patterns not found in the training corpus. A grouping strategy is proposed to improve models of classes without a high enough number of training samples. An experimental study of this strategy shows that better pitch profiles can be predicted in this way.
international conference on acoustics, speech, and signal processing | 2010
Jordi Adell; Antonio Bonafonte; David Escudero
In the present paper we present a new approach to the synthesis of filled pauses. The problem is tackled from the point of view of disfluent speech synthesis. Based on the synthetic disfluent speech model, we analyse the features that describe filled pauses and propose a model to predict them. The model was implemented and perceptually evaluated with successful results.
WOCCI 2017: 6th International Workshop on Child Computer Interaction | 2017
Erika Godde; Gérard Bailly; David Escudero; Marie-Line Bosse; Estelle Gillet-Perret
We analyze here readings of the same reference text by 116 children. We show that several factors strongly impact subjective rating of fluency, notably number of correct words, repetitions, errors, syllables spelled per minute. We succeeded in predicting four subjective scores – rated between 1 and 4 by human raters – from such objective measurements with a rather high precision (R > .8 for 3 out of 4 scores). This open the way for automatic multidimensional assessment of reading fluency using calibrated texts.
Computer Speech & Language | 2017
David Escudero; Csar Gonzlez; Yurena Gutirrez; Emma Rodero
Novel methodology for comparing the style of different speakers or group of speakers.Sequences of automatic Sp_ToBI labels allow the characterization of speaking style.Distance metrics based on conditional entropy permit to obtain information about the characteristic style patterns.The characteristic patterns identified allow informants to discriminate the speaking style in perception tests.The methodology gives information of how speakers organize their discourse with a communicative intention. This paper presents a novel methodology to characterize the style of different speakers or groups of speakers. This methodology uses sequences of prosodic labels (automatic Sp_ToBI labels) to compare and differentiate these speaking styles. A set of metrics based on conditional entropy is used to compute the distance between two speakers or group of speakers depending on the use of sequences of prosodic labels. Additionally, the most contrastive sequences of labels are identified as characteristic patterns of the speaking styles represented in a given corpus. When this methodology is applied to a corpus of radio news items, the result is that the most frequent prosodic patterns coincide with those previously characterized in studies about radio style. Finally, a perceptual test verifies that the participants attribute these characteristic patterns to the radio news style.
Journal on Multimodal User Interfaces | 2015
Hector Olmedo; David Escudero; Valentín Cardeñoso
Based on a philosophy of integrating components from multimodal interaction applications with 3D graphical environments, reusing already defined markup language for describing graphics, graphical and spoken interactions based on the interactive movie metaphor, a markup language for modeling scenes, behavior and interaction is sought. With the definition of this language, we hope to have a common framework for developing applications that allow multimodal interaction at 3D stages. Thus we have defined the basis of an architecture that allows us to integrate the components of such multimodal interaction applications in 3D virtual environments.
IEE Proceedings - Vision, Image, and Signal Processing | 2003
Javier Ortega-Garcia; Julian Fierrez-Aguilar; D. Simon; J. Gonzalez; Marcos Faundez-Zanuy; V. Espinosa; A. Satue; Inma Hernaez; Juan J. Igarza; Carlos Vivaracho; David Escudero; Q.-I. Moro