Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Thomas Hueber is active.

Publication


Featured researches published by Thomas Hueber.


IEEE Transactions on Audio, Speech, and Language Processing | 2015

Speaker-adaptive acoustic-articulatory inversion using cascaded Gaussian mixture regression

Thomas Hueber; Laurent Girin; Xavier Alameda-Pineda; Gérard Bailly

This paper addresses the adaptation of an acoustic-articulatory model of a reference speaker to the voice of another speaker, using a limited amount of audio-only data. In the context of pronunciation training, a virtual talking head displaying the internal speech articulators (e.g., the tongue) could be automatically animated by means of such a model using only the speakers voice. In this study, the articulatory-acoustic relationship of the reference speaker is modeled by a gaussian mixture model (GMM). To address the speaker adaptation problem, we propose a new framework called cascaded Gaussian mixture regression (C-GMR), and derive two implementations. The first one, referred to as Split-C-GMR, is a straightforward chaining of two distinct GMRs: one mapping the acoustic features of the source speaker into the acoustic space of the reference speaker, and the other estimating the articulatory trajectories with the reference model. In the second implementation, referred to as Integrated-C-GMR, the two mapping steps are tied together in a single probabilistic model. For this latter model, we present the full derivation of the exact EM training algorithm, that explicitly exploits the missing data methodology of machine learning. Other adaptation schemes based on maximum-a posteriori (MAP), maximum likelihood linear regression (MLLR) and direct cross-speaker acoustic-to-articulatory GMR are also investigated. Experiments conducted on two speakers for different amount of adaptation data show the interest of the proposed C-GMR techniques.


PLOS Computational Biology | 2016

Real-Time Control of an Articulatory-Based Speech Synthesizer for Brain Computer Interfaces

Florent Bocquelet; Thomas Hueber; Laurent Girin; Christophe Savariaux; Blaise Yvert

Restoring natural speech in paralyzed and aphasic people could be achieved using a Brain-Computer Interface (BCI) controlling a speech synthesizer in real-time. To reach this goal, a prerequisite is to develop a speech synthesizer producing intelligible speech in real-time with a reasonable number of control parameters. We present here an articulatory-based speech synthesizer that can be controlled in real-time for future BCI applications. This synthesizer converts movements of the main speech articulators (tongue, jaw, velum, and lips) into intelligible speech. The articulatory-to-acoustic mapping is performed using a deep neural network (DNN) trained on electromagnetic articulography (EMA) data recorded on a reference speaker synchronously with the produced speech signal. This DNN is then used in both offline and online modes to map the position of sensors glued on different speech articulators into acoustic parameters that are further converted into an audio signal using a vocoder. In offline mode, highly intelligible speech could be obtained as assessed by perceptual evaluation performed by 12 listeners. Then, to anticipate future BCI applications, we further assessed the real-time control of the synthesizer by both the reference speaker and new speakers, in a closed-loop paradigm using EMA data recorded in real time. A short calibration period was used to compensate for differences in sensor positions and articulatory differences between new speakers and the reference speaker. We found that real-time synthesis of vowels and consonants was possible with good intelligibility. In conclusion, these results open to future speech BCI applications using such articulatory-based speech synthesizer.


IEEE Transactions on Audio, Speech, and Language Processing | 2017

Biosignal-Based Spoken Communication: A Survey

Tanja Schultz; Michael Wand; Thomas Hueber; Dean J. Krusienski; Christian Herff; Jonathan S. Brumberg

Speech is a complex process involving a wide range of biosignals, including but not limited to acoustics. These biosignals—stemming from the articulators, the articulator muscle activities, the neural pathways, and the brain itself—can be used to circumvent limitations of conventional speech processing in particular, and to gain insights into the process of speech production in general. Research on biosignal-based speech processing is a wide and very active field at the intersection of various disciplines, ranging from engineering, computer science, electronics and machine learning to medicine, neuroscience, physiology, and psychology. Consequently, a variety of methods and approaches have been used to investigate the common goal of creating biosignal-based speech processing devices for communication applications in everyday situations and for speech rehabilitation, as well as gaining a deeper understanding of spoken communication. This paper gives an overview of the various modalities, research approaches, and objectives for biosignal-based spoken communication.


instrumentation and measurement technology conference | 2013

Vocal tract imaging system for post-laryngectomy voice replacement

Jun Cai; Thomas Hueber; Sotiris Manitsaris; Pierre Roussel; Lise Crevier-Buchman; Maureen Stone; Claire Pillot-Loiseau; Gérard Chollet; Gérard Dreyfus; Bruce Denby

The article describes a system that uses real time measurements of the vocal tract to drive a voice-replacement system for post-laryngectomy patients. Based on a thermoformed acquisition helmet, miniature ultrasound machine, and video camera, and incorporating Hidden Markov Model speech recognition, the device has been tested on three speakers, one of whom has undergone a total laryngectomy. Results show that the device obtains exploitable recognition rates, and that performances on normal and post-laryngectomy speakers are nearly identical. The technique can also enable voice communication for normal speakers in situations where silence must be maintained.


9th International Summer Workshop on Multimodal Interfaces (eNTERFACE) | 2013

Reactive Statistical Mapping: Towards the Sketching of Performative Control with Data

Nicolas d’Alessandro; Joëlle Tilmanne; Maria Astrinaki; Thomas Hueber; Rasmus Dall; Thierry Ravet; Alexis Moinet; Hüseyin Çakmak; Onur Babacan; Adela Barbulescu; Valentin Parfait; Victor Huguenin; Emine Sümeyye Kalaycı; Qiong Hu

This paper presents the results of our participation to the ninth eNTERFACE workshop on multimodal user interfaces. Our target for this workshop was to bring some technologies currently used in speech recognition and synthesis to a new level, i.e. being the core of a new HMM-based mapping system. The idea of statistical mapping has been investigated, more precisely how to use Gaussian Mixture Models and Hidden Markov Models for realtime and reactive generation of new trajectories from inputted labels and for realtime regression in a continuous-to-continuous use case. As a result, we have developed several proofs of concept, including an incremental speech synthesiser, a software for exploring stylistic spaces for gait and facial motion in realtime, a reactive audiovisual laughter and a prototype demonstrating the realtime reconstruction of lower body gait motion strictly from upper body motion, with conservation of the stylistic properties. This project has been the opportunity to formalise HMM-based mapping, integrate various of these innovations into the Mage library and explore the development of a realtime gesture recognition tool.


non-linear speech processing | 2007

Some experiments in audio-visual speech processing

Gérard Chollet; R. Landais; Thomas Hueber; Hervé Bredin; Chafic Mokbel; Patrick Perrot; Leila Zouari

Natural speech is produced by the vocal organs of a particular talker. The acoustic features of the speech signal must therefore be correlated with the movements of the articulators (lips, jaw, tongue, velum, ...). For instance, hearing impaired people (and not only them) improve their understanding of speech by lip reading. This chapter is an overview of audiovisual speech processing with emphasis on some experiments concerning recognition, speaker verification, indexing and corpus based synthesis from tongue and lips movements.


Clinical Linguistics & Phonetics | 2018

Speech recovery and language plasticity can be facilitated by Sensori-Motor Fusion training in chronic non-fluent aphasia. A case report study

Célise Haldin; Audrey Acher; Louise Kauffmann; Thomas Hueber; Emilie Cousin; Pierre Badin; Pascal Perrier; Diandra Fabre; D. Pérennou; Olivier Detante; Assia Jaillard; Hélène Lœvenbruck; Monica Baciu

ABSTRACT The rehabilitation of speech disorders benefits from providing visual information which may improve speech motor plans in patients. We tested the proof of concept of a rehabilitation method (Sensori-Motor Fusion, SMF; Ultraspeech player) in one post-stroke patient presenting chronic non-fluent aphasia. SMF allows visualisation by the patient of target tongue and lips movements using high-speed ultrasound and video imaging. This can improve the patient’s awareness of his/her own lingual and labial movements, which can, in turn, improve the representation of articulatory movements and increase the ability to coordinate and combine articulatory gestures. The auditory and oro-sensory feedback received by the patient as a result of his/her own pronunciation can be integrated with the target articulatory movements they watch. Thus, this method is founded on sensorimotor integration during speech. The SMF effect on this patient was assessed through qualitative comparison of language scores and quantitative analysis of acoustic parameters measured in a speech production task, before and after rehabilitation. We also investigated cerebral patterns of language reorganisation for rhyme detection and syllable repetition, to evaluate the influence of SMF on phonological-phonetic processes. Our results showed that SMF had a beneficial effect on this patient who qualitatively improved in naming, reading, word repetition and rhyme judgment tasks. Quantitative measurements of acoustic parameters indicate that the patient’s production of vowels and syllables also improved. Compared with pre-SMF, the fMRI data in the post-SMF session revealed the activation of cerebral regions related to articulatory, auditory and somatosensory processes, which were expected to be recruited by SMF. We discuss neurocognitive and linguistic mechanisms which may explain speech improvement after SMF, as well as the advantages of using this speech rehabilitation method.


international conference on acoustics, speech, and signal processing | 2017

Feature extraction using multimodal convolutional neural networks for visual speech recognition

Eric Tatulli; Thomas Hueber

This article addresses the problem of continuous speech recognition from visual information only, without exploiting any audio signal. Our approach combines a video camera and an ultrasound imaging system for monitoring simultaneously the speakers lips and the movement of the tongue. We investigate the use of convolutional neural networks (CNN) to extract visual features directly from the raw ultrasound and video images. We propose different architectures among which a multimodal CNN processing jointly the two visual modalities. Combined with an HMM-GMM decoder, the CNN-based approach outperforms our previous baseline based on Principal Component Analysis. Importantly, the recognition accuracy is only 4% lower than the one obtained when decoding the audio signal, which makes it a good candidate for a practical visual speech recognition system.


Speech Communication | 2017

Automatic animation of an articulatory tongue model from ultrasound images of the vocal tract

Diandra Fabre; Thomas Hueber; Laurent Girin; Xavier Alameda-Pineda; Pierre Badin

Visual biofeedback is the process of gaining awareness of physiological functions through the display of visual information. As speech is concerned, visual biofeedback usually consists in showing a speaker his/her own articulatory movements, which has proven useful in applications such as speech therapy or second language learning. This article presents a novel method for automatically animating an articulatory tongue model from ultrasound images. Integrating this model into a virtual talking head enables to overcome the limitations of displaying raw ultrasound images, and provides a more complete and user-friendly feedback by showing not only the tongue, but also the palate, teeth, pharynx, etc. Altogether, these cues are expected to lead to an easier understanding of the tongue movements. Our approach is based on a probabilistic model which converts raw ultrasound images of the vocal tract into control parameters of the articulatory tongue model. We investigated several mapping techniques such as the Gaussian Mixture Regression (GMR), and in particular the Cascaded Gaussian Mixture Regression (C-GMR) techniques, recently proposed in the context of acoustic-articulatory inversion. Both techniques are evaluated on a multispeaker database. The C-GMR consists in the adaptation of a GMR reference model, trained with a large dataset of multimodal articulatory data from a reference speaker, to a new source speaker using a small set of adaptation data recorded during a preliminary enrollment session (system calibration). By using prior information from the reference model, the C-GMR approach is able (i) to maintain good mapping performance while minimizing the amount of adaptation data (and thus limiting the duration of the enrollment session), and (ii) to generalize to articulatory configurations not seen during enrollment better than the GMR approach. As a result, the C-GMR appears to be a good mapping technique for a practical system of visual biofeedback.


Journal of Cognitive Neuroscience | 2017

Inside speech: Multisensory and modality-specific processing of tongue and lip speech actions

Avril Treille; Coriandre Vilain; Thomas Hueber; Laurent Lamalle; Marc Sato

Action recognition has been found to rely not only on sensory brain areas but also partly on the observers motor system. However, whether distinct auditory and visual experiences of an action modulate sensorimotor activity remains largely unknown. In the present sparse sampling fMRI study, we determined to which extent sensory and motor representations interact during the perception of tongue and lip speech actions. Tongue and lip speech actions were selected because tongue movements of our interlocutor are accessible via their impact on speech acoustics but not visible because of its position inside the vocal tract, whereas lip movements are both “audible” and visible. Participants were presented with auditory, visual, and audiovisual speech actions, with the visual inputs related to either a sagittal view of the tongue movements or a facial view of the lip movements of a speaker, previously recorded by an ultrasound imaging system and a video camera. Although the neural networks involved in visual visuolingual and visuofacial perception largely overlapped, stronger motor and somatosensory activations were observed during visuolingual perception. In contrast, stronger activity was found in auditory and visual cortices during visuofacial perception. Complementing these findings, activity in the left premotor cortex and in visual brain areas was found to correlate with visual recognition scores observed for visuolingual and visuofacial speech stimuli, respectively, whereas visual activity correlated with RTs for both stimuli. These results suggest that unimodal and multimodal processing of lip and tongue speech actions rely on common sensorimotor brain areas. They also suggest that visual processing of audible but not visible movements induces motor and visual mental simulation of the perceived actions to facilitate recognition and/or to learn the association between auditory and visual signals.

Collaboration


Dive into the Thomas Hueber's collaboration.

Top Co-Authors

Avatar

Pierre Badin

Centre national de la recherche scientifique

View shared research outputs
Top Co-Authors

Avatar

Gérard Bailly

Centre national de la recherche scientifique

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Diandra Fabre

Centre national de la recherche scientifique

View shared research outputs
Top Co-Authors

Avatar

Frédéric Elisei

Centre national de la recherche scientifique

View shared research outputs
Top Co-Authors

Avatar

Pierre Roussel

Centre national de la recherche scientifique

View shared research outputs
Top Co-Authors

Avatar

Atef Ben Youssef

Grenoble Institute of Technology

View shared research outputs
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge