Frédéric Berthommier
Centre national de la recherche scientifique
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Frédéric Berthommier.
EURASIP Journal on Advances in Signal Processing | 2002
Martin Heckmann; Frédéric Berthommier; Kristian Kroschel
It has been shown that integration of acoustic and visual information especially in noisy conditions yields improved speech recognition results. This raises the question of how to weight the two modalities in different noise conditions. Throughout this paper we develop a weighting process adaptive to various background noise situations. In the presented recognition system, audio and video data are combined following a Separate Integration (SI) architecture. A hybrid Artificial Neural Network/Hidden Markov Model (ANN/HMM) system is used for the experiments. The neural networks were in all cases trained on clean data. Firstly, we evaluate the performance of different weighting schemes in a manually controlled recognition task with different types of noise. Next, we compare different criteria to estimate the reliability of the audio stream. Based on this, a mapping between the measurements and the free parameter of the fusion process is derived and its applicability is demonstrated. Finally, the possibilities and limitations of adaptive weighting are compared and discussed.
Hearing Research | 1995
Christian Lorenzi; Chrisophe Micheyl; Frédéric Berthommier
The goal of the present paper is to relate the coding of amplitude modulation (AM) in the auditory pathway to the behavioral detection performance. To address this issue, the detectability of AM was estimated by modelling a single neuron located in the central nucleus of the inferior colliculus (IC). The computational model is based on cochlear nucleus responses and a coincidence detection mechanism. The model replicated the main feature of the neuronal AM transfer function, namely a bandpass function. The IC-unit model was initially tuned to a 200-Hz modulation frequency. A single neurometric function for AM detection at this modulation frequency was generated using a 2-interval, 2-alternative forced-choice paradigm. On each trial of the experiments, AM was taken to be correctly detected by the model if the number of spikes in response to the modulated signal exceeded the number of spikes in an otherwise identical interval that contained an unmodulated signal. Psychometric functions for 4 human subjects were also measured under the same stimulus conditions. Comparison of the simulated neurometric and psychometric functions suggested that there was sufficient information in the rate response of an IC neuron well-tuned in the modulation-frequency domain to support behavioral detection performance.
conference of the international speech communication association | 2002
Seungjin Choi; Heonseok Hong; Hervé Glotin; Frédéric Berthommier
Abstract This paper addresses a method of multichannel signal separation (MSS) with its application to cocktail party speech recognition. First, we present a fundamental principle for multichannel signal separation which uses the spatial independence of located sources as well as the temporal dependence of speech signals. Second, for practical implementation of the signal separation filter, we consider a dynamic recurrent network and develop a simple new learning algorithm. The performance of the proposed method is evaluated in terms of word recognition error rate (WER) in a large speech recognition experiment. The results show that our proposed method dramatically improves the word recognition performance in the case of two simultaneous speech inputs, and that a timing effect is involved in the segregation process.
international conference on acoustics, speech, and signal processing | 2001
Martin Heckmann; Frédéric Berthommier; Kristian Kroschel
We investigate the fusion of audio and video a posteriori phonetic probabilities in a hybrid ANN/HMM audio-visual speech recognition system. Three basic conditions to the fusion process are stated and implemented in a linear and a geometric weighting scheme. These conditions are the assumption of conditional independence of the audio and video data and the contribution of only one of the two paths when the SNR is very high or very low, respectively. In the case of the geometric weighting a new weighting scheme is developed whereas the linear weighting follows the full combination approach as employed in multi-stream recognition. We compare these two new concepts in audio-visual recognition to a rather standard approach known from the literature. Recognition tests were performed in a continuous number recognition task on a single speaker database containing 1712 utterances with two different types of noise added.
PLOS ONE | 2017
Louis-Jean Boë; Frédéric Berthommier; Thierry Legou; Guillaume Captier; Caralyn Kemp; Thomas R. Sawallis; Yannick Becker; Arnaud Rey; Joël Fagot
Language is a distinguishing characteristic of our species, and the course of its evolution is one of the hardest problems in science. It has long been generally considered that human speech requires a low larynx, and that the high larynx of nonhuman primates should preclude their producing the vowel systems universally found in human language. Examining the vocalizations through acoustic analyses, tongue anatomy, and modeling of acoustic potential, we found that baboons (Papio papio) produce sounds sharing the F1/F2 formant structure of the human [ɨ æ ɑ ɔ u] vowels, and that similarly with humans those vocalic qualities are organized as a system on two acoustic-anatomic axes. This confirms that hominoids can produce contrasting vowel qualities despite a high larynx. It suggests that spoken languages evolved from ancient articulatory skills already present in our last common ancestor with Cercopithecoidea, about 25 MYA.
Speech Communication | 2004
Frédéric Berthommier
Abstract The improvement of detectability of visible speech cues found by Grant and Seitz [2000. The use of visible speech cues for improving auditory detection of spoken sentences. JASA 108, 1197–1208] has been related to the degree of correlation between acoustic envelopes and visible movements. This suggests that audio and visual signals could interact early during the audio-visual perceptual process on the basis of audio envelope cues. On the other hand, acoustic-visual correlations were previously reported by Yehia et al. [1998. Quantitative association of vocal tract and facial behavior. Speech Commun. 26 (1), 23–43]. Taking into account these two main facts, the problem of extraction of the redundant audio-visual components is revisited: the video parametrization of natural images and three types of audio parameters are tested together, leading to new and realistic applications in video synthesis and audio-visual speech enhancement. Consistent with Grant and Seitz’s prediction, the 4-subband envelope energy features are found to be optimal for encoding the redundant components available for the enhancement task. The proposed computational model of audio-visual interaction is based on the product, in the audio pathway, between the time-aligned audio envelopes and video-predicted envelopes. This interaction scheme is shown to be phonetically neutral, so that it will not bias phonetic identification. The low-level stage which is described is compatible with a late integration process, which may be used as a potential front-end for speech recognition applications.
Journal of the Acoustical Society of America | 1999
Christian Lorenzi; Frédéric Berthommier; Laurent Demany
Listeners were asked to discriminate between two amplitude-modulation functions imposed on white noise and consisting of the sum of two sinusoids. The frequency ratio of the sinusoids constituting each function was 2 or 3. In one function, the sinusoids had a constant relative phase. In the other function, their phase relation was continuously and cyclically changing, at a slow rate. For all listeners, the two functions with a frequency ratio of 2 were easily discriminated. However, discrimination was impossible when the frequency ratio was 3. Simulations were performed using an envelope-detector model and various decision statistics. The max/min statistic predicted discrimination above chance level when the frequency ratio was 3. It seems, therefore, that listeners are unable to use this statistic. In contrast, the crest factor and skewness of the envelope accounted well for the discrimination data.
Journal of the Acoustical Society of America | 2015
Olha Nahorna; Frédéric Berthommier; Jean-Luc Schwartz
While audiovisual interactions in speech perception have long been considered as automatic, recent data suggest that this is not the case. In a previous study, Nahorna et al. [(2012). J. Acoust. Soc. Am. 132, 1061-1077] showed that the McGurk effect is reduced by a previous incoherent audiovisual context. This was interpreted as showing the existence of an audiovisual binding stage controlling the fusion process. Incoherence would produce unbinding and decrease the weight of the visual input in fusion. The present paper explores the audiovisual binding system to characterize its dynamics. A first experiment assesses the dynamics of unbinding, and shows that it is rapid: An incoherent context less than 0.5 s long (typically one syllable) suffices to produce a maximal reduction in the McGurk effect. A second experiment tests the rebinding process, by presenting a short period of either coherent material or silence after the incoherent unbinding context. Coherence provides rebinding, with a recovery of the McGurk effect, while silence provides no rebinding and hence freezes the unbinding process. These experiments are interpreted in the framework of an audiovisual speech scene analysis process assessing the perceptual organization of an audiovisual speech input before decision takes place at a higher processing stage.
Journal of the Acoustical Society of America | 2014
Aurélie Huyse; Jacqueline Leybaert; Frédéric Berthommier
This study investigated the impact of aging on audio-visual speech integration. A syllable identification task was presented in auditory-only, visual-only, and audio-visual congruent and incongruent conditions. Visual cues were either degraded or unmodified. Stimuli were embedded in stationary noise alternating with modulated noise. Fifteen young adults and 15 older adults participated in this study. Results showed that older adults had preserved lipreading abilities when the visual input was clear but not when it was degraded. The impact of aging on audio-visual integration also depended on the quality of the visual cues. In the visual clear condition, the audio-visual gain was similar in both groups and analyses in the framework of the fuzzy-logical model of perception confirmed that older adults did not differ from younger adults in their audio-visual integration abilities. In the visual reduction condition, the audio-visual gain was reduced in the older group, but only when the noise was stationary, suggesting that older participants could compensate for the loss of lipreading abilities by using the auditory information available in the valleys of the noise. The fuzzy-logical model of perception confirmed the significant impact of aging on audio-visual integration by showing an increased weight of audition in the older group.
international conference on acoustics, speech, and signal processing | 2004
Frédéric Berthommier
The strong association existing between audio subband envelope parameters and video parameters extracted using the full DCT (discrete cosine transform) can be exploited for audiovisual speech enhancement, thanks to a good prediction of amplitude variations by a statistical model. Since the video parameter space is highly multidimensional, the causality of this association must be clarified. At first, a new method of retro-marking is proposed in order to build a transformation function of DCT parameters into explicit ABS mouth opening parameters. Secondly, a reduction to single parameter spaces is performed by selection of the best parameters. We show in two noisy conditions that the degradation of the enhancement performance due to the transformation and to the reduction is moderate.