Juergen Luettin
University of Sheffield
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Juergen Luettin.
IEEE Transactions on Multimedia | 2000
Stéphane Dupont; Juergen Luettin
This paper describes a speech recognition system that uses both acoustic and visual speech information to improve recognition performance in noisy environments. The system consists of three components: a visual module; an acoustic module; and a sensor fusion module. The visual module locates and tracks the lip movements of a given speaker and extracts relevant speech features. This task is performed with an appearance-based lip model that is learned from example images. Visual speech features are represented by contour information of the lips and grey-level information of the mouth area. The acoustic module extracts noise-robust features from the audio signal. Finally the sensor fusion module is responsible for the joint temporal modeling of the acoustic and visual feature streams and is realized using multistream hidden Markov models (HMMs). The multistream method allows the definition of different temporal topologies and levels of stream integration and hence enables the modeling of temporal dependencies more accurately than traditional approaches. We present two different methods to learn the asynchrony between the two modalities and how to incorporate them in the multistream models. The superior performance for the proposed system is demonstrated on a large multispeaker database of continuously spoken digits. On a recognition task at 15 dB acoustic signal-to-noise ratio (SNR), acoustic perceptual linear prediction (PLP) features lead to 56% error rate, noise robust RASTA-PLP (relative spectra) acoustic features to 7.2% error rate and combined noise robust acoustic features and visual features to 2.5% error rate.
Computer Vision and Image Understanding | 1997
Juergen Luettin; Neil A. Thacker
We describe a robust method for locating and tracking lips in gray-level image sequences. Our approach learns patterns of shape variability from a training set which constrains the model during image search to only deform in ways similar to the training examples. Image search is guided by a learned gray-level model which is used to describe the large appearance variability of lips. Such variability might be due to different individuals, illumination, mouth opening, specularity, or visibility of teeth and tongue. Visual speech features are recovered from the tracking results and represent both shape and intensity information. We describe a speechreading (lip-reading) system, where the extracted features are modeled by Gaussian distributions and their temporal dependencies by hidden Markov models. Experimental results are presented for locating lips, tracking lips, and speechreading. The database used consists of a broad variety of speakers and was recorded in a natural environment with no special lighting or lip markers used. For a speaker independent digit recognition task using visual information only, the system achieved an accuracy about equivalent to that of untrained humans.
Pattern Recognition Letters | 1997
Pierre Jourlin; Juergen Luettin; Dominique Genoud; Hubert Wassner
This paper describes a multimodal approach for speaker verification. The system consists of two classifiers, one using visual features and the other using acoustic features. A lip tracker is used to extract visual information from the speaking face which provides shape and intensity features. We describe an approach for normalizing and mapping different modalities onto a common confidence interval. We also describe a novel method for integrating the scores of multiple classifiers. Verification experiments are reported for the individual modalities and for the combined classifier. The performance of the integrated system outperformed each sub-system and reduced the false acceptance rate of the acoustic sub-system from 2.3% to 0.5%.
international conference on acoustics, speech, and signal processing | 2001
Hervé Glotin; D. Vergyr; Chalapathy Neti; Gerasimos Potamianos; Juergen Luettin
We demonstrate an improvement in the state-of-the-art large vocabulary continuous speech recognition (LVCSR) performance, under clean and noisy conditions, by the use of visual information, in addition to the traditional audio one. We take a decision fusion approach for the audio-visual information, where the single-modality (audio- and visual- only) HMM classifiers are combined to recognize audio-visual speech. More specifically, we tackle the problem of estimating the appropriate combination weights for each of the modalities. Two different techniques are described: the first uses an automatically extracted estimate of the audio stream reliability in order to modify the weights for each modality (both clean and noisy audio results are reported), while the second is a discriminative model combination approach where weights on pre-defined model classes are optimized to minimize WER (clean audio only results).
international conference on acoustics, speech, and signal processing | 2001
Juergen Luettin; Gerasimos Potamianos; Chalapathy Neti
Addresses the problem of audio-visual information fusion to provide highly robust speech recognition. We investigate methods that make different assumptions about asynchrony and conditional dependence across streams and propose a technique based on composite HMMs that can account for stream asynchrony and different levels of information integration. We show how these models can be trained jointly based on maximum likelihood estimation. Experiments, performed for a speaker-independent large vocabulary continuous speech recognition task and different integration methods, show that best performance is obtained by asynchronous stream integration. This system reduces the error rate at a 8.5 dB SNR with additive speech babble noise by 27 % relative over audio-only models and by 12 % relative over traditional audio-visual models using concatenative feature fusion.
international conference on spoken language processing | 1996
Juergen Luettin; Neil A. Thacker; Steve W. Beet
This paper describes a new approach for speaker identification based on lipreading. Visual features are extracted from image sequences of the talking face and consist of shape parameters which describe the lip boundary and intensity parameters which describe the grey-level distribution of the mouth area. Intensity information is based on principal component analysis using eigenspaces which deform with the shape model. The extracted parameters account for both, speech dependent and speaker dependent information. We built spatio-temporal speaker models based on these features, using HMMs with mixtures of Gaussians. Promising results were obtained for text dependent and text independent speaker identification tests performed on a small video database.
Speechreading by Humans and Machines | 1996
Juergen Luettin; Neil A. Thacker; Steve W. Beet
Most approaches for lip modelling are based on heuristic constraints imposed by the user. We describe the use of Active Shape Models for extracting visual speech features for use by automatic speechreading systems, where the deformation of the lip model as well as image search is based on a priori knowledge learned from a training set. We demonstrate the robustness and accuracy of the technique for locating and tracking lips on a database consisting of a broad variety of talkers and lighting conditions.
international conference on acoustics speech and signal processing | 1996
Juergen Luettin; Neil A. Thacker; Steve W. Beet
This paper describes a novel approach for visual speech recognition. The shape of the mouth is modelled by an active shape model which is derived from the statistics of a training set and used to locate, track and parameterise the speakers lip movements. The extracted parameters representing the lip shape are modelled as continuous probability distributions and their temporal dependencies are modelled by hidden Markov models. We present recognition tests performed on a database of a broad variety of speakers and illumination conditions. The system achieved an accuracy of 85.42% for a speaker independent recognition task of the first four digits using lip shape information only.
international conference on spoken language processing | 1996
Juergen Luettin; Neil A. Thacker; Steve W. Beet
We describe a speechreading system that uses both shape information from the lip contours and intensity information from the mouth area. Shape information is obtained by tracking and parameterizing the inner and outer lip boundary in an image sequence. Intensity information is extracted from a grey level model, based on principal component analysis. In comparison to other approaches, the intensity area deforms with the shape model to ensure that similar object features are represented after non-rigid deformation of the lips. We describe speaker independent recognition experiments based on these features and hidden Markov models. Preliminary results suggest that similar performance can be achieved by using either shape or intensity information and slightly higher performance by their combined use.
international conference on pattern recognition | 1996
Juergen Luettin; Neil A. Thacker; Steve W. Beet
This paper describes a robust method for extracting visual speech information from the shape of lips to be used for an automatic speechreading (lipreading) systems. Lip deformation is modelled by a statistically based deformable contour model which learns typical lip deformation from a training set. The main difficulty in locating and tracking lips consists of finding dominant image features for representing the lip contours. We describe the use of a statistical profile model which learns dominant image features from a training set. The model captures global intensity variation due to different illumination and different skin reflectance as well as intensity changes at the inner lip contour due to mouth opening and visibility of teeth and tongue. The method is validated for locating and tracking lip movements on a database of a broad variety of speakers.