Vincent Colotte
University of Lorraine
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Vincent Colotte.
international conference on acoustics, speech, and signal processing | 2000
Vincent Colotte; Yves Laprie
This paper presents a speech signal transformation which slows down speech signals selectively and enhances some important acoustic cues. This transformation can be used not only for hearing aids but also for second language acquisition by facilitating oral comprehension. Selective slowing down relies on the use of the TD-PSOLA synthesis method. An automatic pitch marking algorithm was designed to apply this method automatically. The strategy used to control slowing down exploits a spectral variation function which locates rapid spectral changes. The enhancement simply consists of amplifying stop bursts and unvoiced fricatives. These acoustic cues are detected automatically through the examination of energy criteria. This approach was evaluated in the context of second language acquisition, more precisely by evaluating improvements in oral comprehension. Transformations triggered properly, i.e. the signal regions modified are those which were expected to be modified. Experiments show that the oral comprehension is improved.
Eurasip Journal on Audio, Speech, and Music Processing | 2013
Slim Ouni; Vincent Colotte; Utpala Musti; Asterios Toutios; Brigitte Wrobel-Dautcourt; Marie-Odile Berger; Caroline Lavecchia
This paper presents a bimodal acoustic-visual synthesis technique that concurrently generates the acoustic speech signal and a 3D animation of the speaker’s outer face. This is done by concatenating bimodal diphone units that consist of both acoustic and visual information. In the visual domain, we mainly focus on the dynamics of the face rather than on rendering. The proposed technique overcomes the problems of asynchrony and incoherence inherent in classic approaches to audiovisual synthesis. The different synthesis steps are similar to typical concatenative speech synthesis but are generalized to the acoustic-visual domain. The bimodal synthesis was evaluated using perceptual and subjective evaluations. The overall outcome of the evaluation indicates that the proposed bimodal acoustic-visual synthesis technique provides intelligible speech in both acoustic and visual channels.
International Conference on Statistical Language and Speech Processing | 2018
Amal Houidhek; Vincent Colotte; Zied Mnasri; Denis Jouvet
This paper investigates the use of deep neural networks (DNN) for Arabic speech synthesis. In parametric speech synthesis, whether HMM-based or DNN-based, each speech segment is described with a set of contextual features. These contextual features correspond to linguistic, phonetic and prosodic information that may affect the pronunciation of the segments. Gemination and vowel quantity (short vowel vs. long vowel) are two particular and important phenomena in Arabic language. Hence, it is worth investigating if those phenomena must be handled by using specific speech units, or if their specification in the contextual features is enough. Consequently four modelling approaches are evaluated by considering geminated consonants (respectively long vowels) either as fully-fledged phoneme units or as the same phoneme as their simple (respectively short) counterparts. Although no significant difference has been observed in previous studies relying on HMM-based modelling, this paper examines these modelling variants in the framework of DNN-based speech synthesis. Listening tests are conducted to evaluate the four modelling approaches, and to assess the performance of DNN-based Arabic speech synthesis with respect to previous HMM-based approach.
conference of the international speech communication association | 2016
Slim Ouni; Vincent Colotte; Sara Dahmani; Soumaya Azzi
Within the framework of developing an expressive audiovisual speech synthesis, an acoustic and visual analysis of expressive acted speech is proposed in this paper. Our purpose is to identify the main characteristics of audiovisual expressions that need to be integrated during synthesis to provide believable emotions to the virtual 3D talking head. We conducted a case study of a semi-professional actor who uttered a set of sentences for 6 different emotions in addition to neutral speech. We have recorded concurrently audio and motion capture data. The acoustic and the visual data have been analyzed. The main finding is that although some expressions are not well identified, some expressions were well characterized and tied in both acoustic and visual space.
Proceedings of the 3rd Symposium on Facial Analysis and Animation | 2012
Utpala Musti; Caroline Lavecchia; Vincent Colotte; Slim Ouni; Brigitte Wrobel-Dautcourt; Marie-Odile Berger
In the vast majority of recent works, data-driven audiovisual speech synthesis, i.e., the generation of face animation together with the corresponding acoustic speech, is still considered as the synchronization of two independent sources: synthesized acoustic speech (or natural speech aligned with text) and the face animation. However, achieving perfect synchronization between these two streams is not straightforward and presents several challenges related to audio-visual intelligibility. In our work, we achieve synthesis with its acoustic and visible components simultaneously. The bimodal signal is considered as one signal with two channels: acoustic and visual. This bimodality is kept during the whole synthesis process. The setup is similar to a typical concatenative (acoustic-only) speech synthesis setup, with the difference that here, the units to be concatenated consist of visual information alongside acoustic information. The concatenation unit adopted in our work is the diphone. The advantage of choosing diphones is that the major part of coarticulation phenomena is captured locally in the middle of the unit and the concatenation is made at the boundaries, which are acoustically and visually steadier. Actually, this choice is in accordance with current practices in concatenative speech synthesis.
european signal processing conference | 1998
Yves Laprie; Vincent Colotte
language resources and evaluation | 2014
Camille Fauth; Anne Bonneau; Frank Zimmerer; Juergen Trouvain; Bistra Andreeva; Vincent Colotte; Dominique Fohr; Denis Jouvet; Jeanin J"ugler; Yves Laprie; Odile Mella; Bernd M"obius
european signal processing conference | 2002
Vincent Colotte; Yves Laprie
Archive | 2013
Jürgen Trouvain; Yves Laprie; Bernd Möbius; Bistra Andreeva; Anne Bonneau; Vincent Colotte; Camille Fauth; Dominique Fohr; Denis Jouvet; Odile Mella; Jeanin Jügler; Frank Zimmerer
Archive | 2011
Anne Bonneau; Vincent Colotte