Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Steffen Zeiler is active.

Publication


Featured researches published by Steffen Zeiler.


IEEE Transactions on Audio, Speech, and Language Processing | 2015

Learning dynamic stream weights for coupled-HMM-based audio-visual speech recognition

Ahmed Hussen Abdelaziz; Steffen Zeiler; Dorothea Kolossa

With the increasing use of multimedia data in communication technologies, the idea of employing visual information in automatic speech recognition (ASR) has recently gathered momentum. In conjunction with the acoustical information, the visual data enhances the recognition performance and improves the robustness of ASR systems in noisy and reverberant environments. In audio-visual systems, dynamic weighting of audio and video streams according to their instantaneous confidence is essential for reliably and systematically achieving high performance. In this paper, we present a complete framework that allows blind estimation of dynamic stream weights for audio-visual speech recognition based on coupled hidden Markov models (CHMMs). As a stream weight estimator, we consider using multilayer perceptrons and logistic functions to map multidimensional reliability measure features to audiovisual stream weights. Training the parameters of the stream weight estimator requires numerous input-output tuples of reliability measure features and their corresponding stream weights. We estimate these stream weights based on oracle knowledge using an expectation maximization algorithm. We define 31-dimensional feature vectors that combine model-based and signal-based reliability measures as inputs to the stream weight estimator. During decoding, the trained stream weight estimator is used to blindly estimate stream weights. The entire framework is evaluated using the Grid audio-visual corpus and compared to state-of-the-art stream weight estimation strategies. The proposed framework significantly enhances the performance of the audio-visual ASR system in all examined test conditions.


IEEE Signal Processing Letters | 2013

Noise-Adaptive LDA: A New Approach for Speech Recognition Under Observation Uncertainty

Dorothea Kolossa; Steffen Zeiler; Rahim Saeidi; Ramón Fernández Astudillo

Automatic speech recognition (ASR) performance suffers severely from non-stationary noise, precluding widespread use of ASR in natural environments. Recently, so-termed uncertainty-of-observation techniques have helped to recover good performance. These consider the clean speech features as a hidden variable, of which the observable features are only an imperfect estimate. An estimated error variance of features is therefore used to further guide recognition. Based on the same idea, we introduce a new strategy: Reducing the speech feature dimensionality for optimal discriminance under observation uncertainty can yield significantly improved recognition performance, and is derived easily via Fishers criterion of discriminant analysis.


conference of the international speech communication association | 2016

Dynamic Stream Weighting for Turbo-Decoding-Based Audiovisual ASR.

Sebastian Gergen; Steffen Zeiler; Ahmed Hussen Abdelaziz; Robert M. Nickel; Dorothea Kolossa

Automatic speech recognition (ASR) enables very intuitive human-machine interaction. However, signal degradations due to reverberation or noise reduce the accuracy of audio-based recognition. The introduction of a second signal stream that is not affected by degradations in the audio domain (e.g., a video stream) increases the robustness of ASR against degradations in the original domain. Here, depending on the signal quality of audio and video at each point in time, a dynamic weighting of both streams can optimize the recognition performance. In this work, we introduce a strategy for estimating optimal weights for the audio and video streams in turbo-decodingbased ASR using a discriminative cost function. The results show that turbo decoding with this maximally discriminative dynamic weighting of information yields higher recognition accuracy than turbo-decoding-based recognition with fixed stream weights or optimally dynamically weighted audiovisual decoding using coupled hidden Markov models.


international conference on acoustics, speech, and signal processing | 2013

Twin-HMM-based audio-visual speech enhancement

Ahmed Hussen Abdelaziz; Steffen Zeiler; Dorothea Kolossa

Most approaches for speech signal processing rely solely on acoustic input, which has the consequence that spectrum estimation becomes exceedingly difficult when the signal-to-noise ratio drops to values near 0 dB. However, alternative sources of information are becoming widely available with increasing use of multimedia data in everyday communication. In the following paper, we suggest to use video input as an auxiliary modality for speech processing by applying a new statistical model - the twin hidden Markov model. The resulting enhancement algorithm for audiovisual data greatly outperforms the standard audio-only log-MMSE estimator on all considered instrumental speech quality measures covering spectral and perceptual quality.


Robust Speech Recognition of Uncertain or Missing Data | 2011

Use of Missing and Unreliable Data for Audiovisual Speech Recognition

Alexander Vorwerk; Steffen Zeiler; Dorothea Kolossa; Ramón Fernández Astudillo; Dennis Lerch

Under acoustically distorted conditions, any available video information is especially helpful for increasing recognition robustness. However, an optimal strategy for integrating audio and video information is difficult to find, since both streams may independently suffer from time-varying degrees of distortion. In this chapter, we show how missing-feature techniques for coupled HMMs can help us fuse information from both uncertain information sources.We also focus on the estimation of reliability for the video feature stream, which is obtained from a linear discriminant analysis (LDA) applied to a set of shape- and appearance-based features. The approach has resulted in significant performance improvements under strongly distorted conditions, while, in conjunction with stream weight tuning, being lowerbounded in performance by the best of the two single-stream recognizers under all tested conditions.


international conference on acoustics, speech, and signal processing | 2016

Robust audiovisual speech recognition using noise-adaptive linear discriminant analysis

Steffen Zeiler; Robert Nicheli; Ning Ma; Guy J. Brown; Dorothea Kolossa

Automatic speech recognition (ASR) has become a widespread and convenient mode of human-machine interaction, but it is still not sufficiently reliable when used under highly noisy or reverberant conditions. One option for achieving far greater robustness is to include another modality that is unaffected by acoustic noise, such as video information. Currently the most successful approaches for such audiovisual ASR systems, coupled hidden Markov models (HMMs) and turbo decoding, both allow for slight asynchrony between audio and video features, and significantly improve recognition rates in this way. However, both typically still neglect residual errors in the estimation of audio features, so-called observation uncertainties. This paper compares two strategies for adding these observation uncertainties into the decoder, and shows that significant recognition rate improvements are achievable for both coupled HMMs and turbo decoding.


international conference on acoustics, speech, and signal processing | 2014

A newem estimationof dynamic stream weights for coupled-HMM-based audio-visual ASR

Ahmed Hussen Abdelaziz; Steffen Zeiler; Dorothea Kolossa

Mutually deploying visual and acoustical information in automatic speech recognition systems increases their robustness against acoustical environmental effects like additive noise and reverberation. Optimal fusion of the audio and video streams requires dynamic adaptation of the relative contribution of each modality. This can be achieved by weighting each stream according to its reliability by an appropriate stream weight. In this paper we propose a new expectation maximization algorithm that estimates oracle frame-dependent stream weights for coupled-HMM-based audio-visual speech recognition. Moreover, we introduce a greedy optimization approach that reasonably initializes this algorithm. The proposed approach is evaluated on the Grid audio-visual database and results in an average relative word error rate reduction of 38% and 58% compared to grid search and Bayes fusion, respectively. The estimated oracle stream weights can be used instead of the conventional global fixed stream weights to improve the supervised training of stream weight estimators.


international conference on acoustics, speech, and signal processing | 2013

GMM-based significance decoding

Ahmed Hussen Abdelaziz; Steffen Zeiler; Dorothea Kolossa; Volker Leutnant

The accuracy of automatic speech recognition systems in noisy and reverberant environments can be improved notably by exploiting the uncertainty of the estimated speech features using so-called uncertainty-of-observation techniques. In this paper, we introduce a new Bayesian decision rule that can serve as a mathematical framework from which both known and new uncertainty-of-observation techniques can be either derived or approximated. The new decision rule in its direct form leads to the new significance decoding approach for Gaussian mixture models, which results in better performance compared to standard uncertainty-of-observation techniques in different additive and convolutive noise scenarios.


international conference on acoustics, speech, and signal processing | 2012

Inventory-style speech enhancement with uncertainty-of-observation techniques

Robert M. Nickel; Ramón Fernández Astudillo; Dorothea Kolossa; Steffen Zeiler; Rainer Martin

We present a new method for inventory-style speech enhancement that significantly improves over earlier approaches [1]. Inventory-style enhancement attempts to resynthesize a clean speech signal from a noisy signal via corpus-based speech synthesis. The advantage of such an approach is that one is not bound to trade noise suppression against signal distortion in the same way that most traditional methods do. A significant improvement in perceptual quality is typically the result. Disadvantages of this new approach, however, include speaker dependency, increased processing delays, and the necessity of substantial system training. Earlier published methods relied on a-priori knowledge of the expected noise type during the training process [1]. In this paper we present a new method that exploits uncertainty-of-observation techniques to circumvent the need for noise specific training. Experimental results show that the new method is not only able to match, but outperform the earlier approaches in perceptual quality.


conference of the international speech communication association | 2016

Introducing the Turbo-Twin-HMM for Audio-Visual Speech Enhancement.

Steffen Zeiler; Hendrik Meutzner; Ahmed Hussen Abdelaziz; Dorothea Kolossa

Models for automatic speech recognition (ASR) hold detailed information about spectral and spectro-temporal characteristics of clean speech signals. Using these models for speech enhancement is desirable and has been the target of past research efforts. In such model-based speech enhancement systems, a powerful ASR is imperative. To increase the recognition rates especially in low-SNR conditions, we suggest the use of the additional visual modality, which is mostly unaffected by degradations in the acoustic channel. An optimal integration of acoustic and visual information is achievable by joint inference in both modalities within the turbo-decoding framework. Thus combining turbo-decoding with Twin-HMMs for speech enhancement, notable improvements can be achieved, not only in terms of instrumental estimates of speech quality, but also in actual speech intelligibility. This is verified through listening tests, which show that in highly challenging noise conditions, average human recognition accuracy can be improved from 64% without signal processing to 80% when using the presented architecture.

Collaboration


Dive into the Steffen Zeiler's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Rahim Saeidi

Radboud University Nijmegen

View shared research outputs
Top Co-Authors

Avatar

Alexander Vorwerk

Technical University of Berlin

View shared research outputs
Top Co-Authors

Avatar

Reinhold Orglmeister

Technical University of Berlin

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Pejman Mowlaee

Graz University of Technology

View shared research outputs
Researchain Logo
Decentralizing Knowledge