Laurent Girin
French Institute for Research in Computer Science and Automation
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Laurent Girin.
Journal of the Acoustical Society of America | 2001
Laurent Girin; Jean-Luc Schwartz; Gang Feng
A key problem for telecommunication or human-machine communication systems concerns speech enhancement in noise. In this domain, a certain number of techniques exist, all of them based on an acoustic-only approach--that is, the processing of the audio corrupted signal using audio information (from the corrupted signal only or additive audio information). In this paper, an audio-visual approach to the problem is considered, since it has been demonstrated in several studies that viewing the speakers face improves message intelligibility, especially in noisy environments. A speech enhancement prototype system that takes advantage of visual inputs is developed. A filtering process approach is proposed that uses enhancement filters estimated with the help of lip shape information. The estimation process is based on linear regression or simple neural networks using a training corpus. A set of experiments assessed by Gaussian classification and perceptual tests demonstrates that it is indeed possible to enhance simple stimuli (vowel-plosive-vowel sequences) embedded in white Gaussian noise.
IEEE Transactions on Audio, Speech, and Language Processing | 2007
Bertrand Rivet; Laurent Girin; Christian Jutten
Looking at the speakers face can be useful to better hear a speech signal in noisy environment and extract it from competing sources before identification. This suggests that the visual signals of speech (movements of visible articulators) could be used in speech enhancement or extraction systems. In this paper, we present a novel algorithm plugging audiovisual coherence of speech signals, estimated by statistical tools, on audio blind source separation (BSS) techniques. This algorithm is applied to the difficult and realistic case of convolutive mixtures. The algorithm mainly works in the frequency (transform) domain, where the convolutive mixture becomes an additive mixture for each frequency channel. Frequency by frequency separation is made by an audio BSS algorithm. The audio and visual informations are modeled by a newly proposed statistical model. This model is then used to solve the standard source permutation and scale factor ambiguities encountered for each frequency after the audio blind separation stage. The proposed method is shown to be efficient in the case of 2 times 2 convolutive mixtures and offers promising perspectives for extracting a particular speech source of interest from complex mixtures
Signal Processing | 2012
Antoine Liutkus; Jonathan Pinel; Roland Badeau; Laurent Girin; Gaël Richard
We address the issue of underdetermined source separation in a particular informed configuration where both the sources and the mixtures are known during a so-called encoding stage. This knowledge enables the computation of a side-information which is small enough to be inaudibly embedded into the mixtures. At the decoding stage, the sources are no longer assumed to be known, only the mixtures and the extracted side-information are processed for source separation. The proposed system models the sources as independent and locally stationary Gaussian processes (GP) and the mixing process as a linear filtering. This model allows reliable estimation of the sources through generalized Wiener filtering, provided their spectrograms are known. As these spectrograms are too large to be embedded in the mixtures, we show how they can be efficiently approximated using either Nonnegative Tensor Factorization (NTF) or image compression. A high-capacity embedding method is used by the system to inaudibly embed the separation side-information into the mixtures. This method is an application of the Quantization Index Modulation technique applied to the time-frequency coefficients of the mixtures and permits to reach embedding rates of about 250kbps. Finally, a study of the performance of the full system is presented.
international conference on acoustics, speech, and signal processing | 2006
David Sodoyer; Bertrand Rivet; Laurent Girin; Jean-Luc Schwartz; Christian Jutten
We present a new approach to the voice activity detection (VAD) problem for speech signals embedded in non-stationary noise. The method is based on automatic lipreading: the objective is to detect voice activity or non-activity by exploiting the coherence between the speech acoustic signal and the speakers lip movements. From a comprehensive analysis of lip shape parameters during speech and non-speech events, we show that a single appropriate visual parameter, defined to characterize the lip movements, can be used for the detection of sections of voice activity or more precisely, for the detection of silence sections. Detection scores obtained on spontaneous speech confirm the efficiency of the visual voice activity detector (VVAD)
IEEE Transactions on Audio, Speech, and Language Processing | 2010
Mathieu Parvaix; Laurent Girin; Jean-Marc Brossier
In this paper, the issue of audio source separation from a single channel is addressed, i.e., the estimation of several source signals from a single observation of their mixture. This challenging problem is tackled with a specific two levels coder-decoder configuration. At the coder, source signals are assumed to be available before the mix is processed. Each source signal is characterized by a set of parameters that provide additional information useful for separation. We propose an original method using a watermarking technique to imperceptibly embed this information about the source signals into the mix signal. At the decoder, the watermark is extracted from the mix signal to enable an end-user who has no access to the original sources to separate these signals from their mixture. Hence, we call this separation process informed source separation (ISS). Thereby, several instruments or voice signals can be segregated from a single piece of music to enable post-mixing processing such as volume control, echo addition, spatialization, or timbre transformation. Good performances are obtained for the separation of up to four source signals, from mixtures of speech or music signals. Promising results open up new perspectives in both under-determined source separation and audio watermarking domains.
IEEE Transactions on Audio, Speech, and Language Processing | 2011
Mathieu Parvaix; Laurent Girin
In this paper, we address the issue of underdetermined source separation of I nonstationary audio sources from a J -channel linear instantaneous mixture (J <; I). This problem is addressed with a specific coder-decoder configuration. At the coder, source signals are assumed to be available before the mixing is processed. A time-frequency (TF) joint analysis of each source signal and mixture signal enables to select the subset of sources (among I ) leading to the best separation results in each TF region. A corresponding source(s) index code is imperceptibly embedded into the mix signal using a watermarking technique. At the decoder, where the original source signals are unknown, the extraction of the watermark enables to invert the mixture in each TF region to recover the source signals. With such an informed approach, it is shown that five instruments and singing voice signals can be efficiently separated from two-channel stereo mixtures, with a quality that significantly overcomes the quality obtained by a semi-blind reference method and enables separate manipulation of the source signals during stereo music restitution (i.e., remixing).
Speech Communication | 2004
David Sodoyer; Laurent Girin; Christian Jutten; Jean-Luc Schwartz
Abstract Looking at the speaker’s face is useful to hear better a speech signal and extract it from competing sources before identification. This might result in elaborating new speech enhancement or extraction techniques exploiting the audio-visual coherence of speech stimuli. In this paper, a novel algorithm plugging audio-visual coherence estimated by statistical tools on classical blind source separation algorithms is presented, and its assessment is described. We show, in the case of additive mixtures, that this algorithm performs better than classical blind tools both when there are as many sensors as sources, and when there are less sensors than sources. Audio-visual coherence enables a focus on the speech source to extract. It may also be used at the output of a classical source separation algorithm, to select the “best” sensor with reference to a target source.
EURASIP Journal on Advances in Signal Processing | 2002
David Sodoyer; Jean-Luc Schwartz; Laurent Girin; Jacob Klinkisch; Christian Jutten
We present a new approach to the source separation problem in the case of multiple speech signals. The method is based on the use of automatic lipreading, the objective is to extract an acoustic speech signal from other acoustic signals by exploiting its coherence with the speaker′s lip movements. We consider the case of an additive stationary mixture of decorrelated sources, with no further assumptions on independence or non-Gaussian character. Firstly, we present a theoretical framework showing that it is indeed possible to separate a source when some of its spectral characteristics are provided to the system. Then we address the case of audio-visual sources. We show how, if a statistical model of the joint probability of visual and spectral audio input is learnt to quantify the audio-visual coherence, separation can be achieved by maximizing this probability. Finally, we present a number of separation results on a corpus of vowel-plosive-vowel sequences uttered by a single speaker, embedded in a mixture of other voices. We show that separation can be quite good for mixtures of 2, 3, and 5 sources. These results, while very preliminary, are encouraging, and are discussed in respect to their potential complementarity with traditional pure audio separation or enhancement techniques.
Speech Communication | 2007
Bertrand Rivet; Laurent Girin; Christian Jutten
Audio-visual speech source separation consists in mixing visual speech processing techniques (e.g., lip parameters tracking) with source separation methods to improve the extraction of a speech source of interest from a mixture of acoustic signals. In this paper, we present a new approach that combines visual information with separation methods based on the sparseness of speech: visual information is used as a voice activity detector (VAD) which is combined with a new geometric method of separation. The proposed audio-visual method is shown to be efficient to extract a real spontaneous speech utterance in the difficult case of convolutive mixtures even if the competing sources are highly non-stationary. Typical gains of 18-20dB in signal to interference ratios are obtained for a wide range of (2x2) and (3x3) mixtures. Moreover, the overall process is computationally quite simpler than previously proposed audio-visual separation schemes.
international conference on acoustics, speech, and signal processing | 2004
Laurent Girin; Sylvain Marchand
In this paper, the application of the sinusoidal model for audio/speech signals to the watermarking task is proposed. The basic idea is that adequate modulation of medium rank partials (frequency) trajectories is not perceptible and thus this modulation may contain the data to be embedded in the signal. The modulation (encoding) and estimation (decoding) of the message are described and preliminary promising results are given in the case of speech signals.