Is this you? Create Your Porfile

Gérard Bailly

Centre national de la recherche scientifique

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Gérard Bailly is active.

Explore More

Publication

Featured researches published by Gérard Bailly.

Journal of Phonetics | 2002

Three-dimensional linear articulatory modeling of tongue, lips and face, based on MRI and video images.

Pierre Badin; Gérard Bailly; Lionel Revéret; Monica Baciu; Christoph Segebarth; Christophe Savariaux

In this study, previous articulatory midsagittal models of tongue and lips are extended to full three-dimensional models. The geometry of these vocal organs is measured on one subject uttering a corpus of sustained articulations in French. The 3D data are obtained from magnetic resonance imaging of the tongue, and from front and profile video images of the subjects face marked with small beads. The degrees of freedom of the articulators, i.e., the uncorrelated linear components needed to represent the 3D coordinates of these articulators, are extracted by linear component analysis from these data. In addition to a common jaw height parameter, the tongue is controlled by four parameters while the lips and face are also driven by four parameters. These parameters are for the most part extracted from the midsagittal contours, and are clearlyinterpretable in phonetic/biomechanical terms. This implies that most 3D features such as tongue groove or lateral channels can be controlled by articulatory parameters defined for the midsagittal model. Similarly, the 3D geometry of the lips is determined by parameters such as lip protrusion or aperture, that can be measured from a profile view of the face.

International Journal of Speech Technology | 2003

Audiovisual Speech Synthesis

Gérard Bailly; Maxime Berar; Frédéric Elisei; Matthias Odisio

This paper presents the main approaches used to synthesize talking faces, and provides greater detail on a handful of these approaches. An attempt is made to distinguish between facial synthesis itself (i.e. the manner in which facial movements are rendered on a computer screen), and the way these movements may be controlled and predicted using phonetic input. The two main synthesis techniques (model-based vs. image-based) are contrasted and presented by a brief description of the most illustrative existing systems. The challenging issues—evaluation, data acquisition and modeling—that may drive future models are also discussed and illustrated by our current work at ICP.

Speech Communication | 1997

Learning to speak. Sensori-motor control of speech movements

Gérard Bailly

Abstract This paper shows how an articulatory model, able to produce acoustic signals from articulatory motion, can learn to speak i.e. coordinate its movements in such a way that it utters meaningful sequences of sounds belonging to a given language. This complex learning procedure is accomplished in four major steps: (a) a babbling phase, where the device builds up a model of the forward transforms i.e. the articulatory-to-audio-visual mapping; (b) an imitation stage, where it tries to reproduce a limited set of sound sequences by audio-visual-to-articulatory inversion; (c) a “shaping” stage, where phonemes are associated with the most efficient available sensori-motor representation; and finally, (d) a “rhythmic” phase, where it learns the appropriate coordination of the activations of these sensori-motor targets.

Speech Communication | 2010

Can you 'read' tongue movements? Evaluation of the contribution of tongue display to speech understanding

Pierre Badin; Yuliya Tarabalka; Frédéric Elisei; Gérard Bailly

Lip reading relies on visible articulators to ease speech understanding. However, lips and face alone provide very incomplete phonetic information: the tongue, that is generally not entirely seen, carries an important part of the articulatory information not accessible through lip reading. The question is thus whether the direct and full vision of the tongue allows tongue reading. We have therefore generated a set of audiovisual VCV stimuli with an audiovisual talking head that can display all speech articulators, including tongue, in an augmented speech mode. The talking head is a virtual clone of a human speaker and the articulatory movements have also been captured on this speaker using ElectroMagnetic Articulography (EMA). These stimuli have been played to subjects in audiovisual perception tests in various presentation conditions (audio signal alone, audiovisual signal with profile cutaway display with or without tongue, complete face), at various Signal-to-Noise Ratios. The results indicate: (1) the possibility of implicit learning of tongue reading, (2) better consonant identification with the cutaway presentation with the tongue than without the tongue, (3) no significant difference between the cutaway presentation with the tongue and the more ecological rendering of the complete face, (4) a predominance of lip reading over tongue reading, but (5) a certain natural human capability for tongue reading when the audio signal is strongly degraded or absent. We conclude that these tongue reading capabilities could be used for applications in the domains of speech therapy for speech retarded children, of perception and production rehabilitation of hearing impaired children, and of pronunciation training for second language learners.

Speech Communication | 2010

Gaze, conversational agents and face-to-face communication

Gérard Bailly; Stephan Raidt; Frédéric Elisei

In this paper, we describe two series of experiments that examine audiovisual face-to-face interaction between naive human viewers and either a human interlocutor or a virtual conversational agent. The main objective is to analyze the interplay between speech activity and mutual gaze patterns during mediated face-to-face interactions. We first quantify the impact of deictic gaze patterns of our agent. We further aim at refining our experimental knowledge on mutual gaze patterns during human face-to-face interaction by using new technological devices such as non-invasive eye trackers and pinhole cameras, and at quantifying the impact of a selection of cognitive states and communicative functions on recorded gaze patterns.

Speech Communication | 1994

Characterisation of rhythmic patterns for text-to-speech synthesis

Plínio Almeida Barbosa; Gérard Bailly

Abstract This article proposes an alternative rhythmic unit to the syllable: the inter-perceptual-center group (IPCG). This group is delimited by events which can be detected using only acoustic correlates (Pompino-Marschall, 1989). The rhythmic patterns for French are described using this characterisation: we show that realisation of accents is gradual over the trailed accentual group and that this gradual lengthening is needed for perception. A model of repartition of the IPCG duration among its segmental constituents incorporating automatic generation of pauses (emergence and duration) according to speech rate is then described.

Speech Communication | 2005

SFC: A trainable prosodic model

Gérard Bailly; Bleicke Holm

This paper introduces a new model-constrained and data-driven system to generate prosody from metalinguistic information. This system considers the prosodic continuum as the superposition of multiple elementary overlapping multiparametric contours. These contours encode specific metalinguistic functions associated with various discourse units. We describe the phonological model underlying the system and the specific implementation made of that model by the trainable prosodic model described here. The way prosody is analyzed, decomposed and modelled is illustrated by experimental work. In particular, we describe the original training procedure that enables the system to identify the elementary contours and to separate out their contributions to the prosodic contours of the training data.

Journal of the Acoustical Society of America | 2001

Linear degrees of freedom in speech production: Analysis of cineradio- and labio-film data and articulatory-acoustic modeling

Denis Beautemps; Pierre Badin; Gérard Bailly

The following contribution addresses several issues concerning speech degrees of freedom in French oral vowels, stop, and fricative consonants based on an analysis of tongue and lip shapes extracted from cineradio- and labio-films. The midsagittal tongue shapes have been submitted to a linear decomposition where some of the loading factors were selected such as jaw and larynx position while four other components were derived from principal component analysis (PCA). For the lips, in addition to the more traditional protrusion and opening components, a supplementary component was extracted to explain the upward movement of both the upper and lower lips in [v] production. A linear articulatory model was developed; the six tongue degrees of freedom were used as the articulatory control parameters of the midsagittal tongue contours and explained 96% of the tongue data variance. These control parameters were also used to specify the frontal lip width dimension derived from the labio-film front views. Finally, this model was complemented by a conversion model going from the midsagittal to the area function, based on a fitting of the midsagittal distances and the formant frequencies for both vowels and consonants.

Speech Communication | 2001

Generating prosodic attitudes in French: data, model and evaluation

Yann Morlec; Gérard Bailly; Véronique Aubergé

Abstract A corpus of 322 syntactically balanced sentences uttered by one speaker with six different prosodic attitudes is analysed. The syntactic and phonotactic structure of the sentences are systematically varied in order to understand how two functions can be carried out in parallel in the prosodic continuum: (1) enunciative: demarcation of constituents; (2) illocutory: speaker’s attitude. The statistical analysis of the corpus demonstrates that global prototypical prosodic contours characterise each attitude. Such a global encoding is consistent with gating experiments showing that attitudes can be discriminated very early in utterances. These results are discussed in relation to a morphological and superpositional model of intonation. This model proposes that the information specific to each linguistic level (structure, hierarchy of constituents, semantic and pragmatic attributes) is encoded via superposed multiparametric contours. An implementation of this model is described that automatically captures and generates these prototypical prosodic contours. This implementation consists of parallel Recurrent Neural Networks each responsible for the encoding of one linguistic level. The identification rates of attitudes for both training and test synthetic utterances are similar to those for natural stimuli. We conclude that the study of discourse-level linguistic attributes such as prosodic attitudes is a valuable paradigm for comparing intonation models.

Journal of the Acoustical Society of America | 2005

Analysis and synthesis of the three-dimensional movements of the head, face, and hand of a speaker using cued speech.

Guillaume Gibert; Gérard Bailly; Denis Beautemps; Frédéric Elisei; Rémi Brun

In this paper we present efforts for characterizing the three dimensional (3-D) movements of the right hand and the face of a French female speaker during the audiovisual production of cued speech. The 3-D trajectories of 50 hand and 63 facial flesh points during the production of 238 utterances were analyzed. These utterances were carefully designed to cover all possible diphones of the French language. Linear and nonlinear statistical models of the articulations and the postures of the hand and the face have been developed using separate and joint corpora. Automatic recognition of hand and face postures at targets was performed to verify a posteriori that key hand movements and postures imposed by cued speech had been well realized by the subject. Recognition results were further exploited in order to study the phonetic structure of cued speech, notably the phasing relations between hand gestures and sound production. The hand and face gestural scores are studied in reference with the acoustic segmentation. A first implementation of a concatenative audiovisual text-to-cued speech synthesis system is finally described that employs this unique and extensive data on cued speech in action.

Explore More