Ahmed Hussen Abdelaziz
Ruhr University Bochum
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Ahmed Hussen Abdelaziz.
IEEE Transactions on Audio, Speech, and Language Processing | 2015
Ahmed Hussen Abdelaziz; Steffen Zeiler; Dorothea Kolossa
With the increasing use of multimedia data in communication technologies, the idea of employing visual information in automatic speech recognition (ASR) has recently gathered momentum. In conjunction with the acoustical information, the visual data enhances the recognition performance and improves the robustness of ASR systems in noisy and reverberant environments. In audio-visual systems, dynamic weighting of audio and video streams according to their instantaneous confidence is essential for reliably and systematically achieving high performance. In this paper, we present a complete framework that allows blind estimation of dynamic stream weights for audio-visual speech recognition based on coupled hidden Markov models (CHMMs). As a stream weight estimator, we consider using multilayer perceptrons and logistic functions to map multidimensional reliability measure features to audiovisual stream weights. Training the parameters of the stream weight estimator requires numerous input-output tuples of reliability measure features and their corresponding stream weights. We estimate these stream weights based on oracle knowledge using an expectation maximization algorithm. We define 31-dimensional feature vectors that combine model-based and signal-based reliability measures as inputs to the stream weight estimator. During decoding, the trained stream weight estimator is used to blindly estimate stream weights. The entire framework is evaluated using the Grid audio-visual corpus and compared to state-of-the-art stream weight estimation strategies. The proposed framework significantly enhances the performance of the audio-visual ASR system in all examined test conditions.
conference of the international speech communication association | 2016
Sebastian Gergen; Steffen Zeiler; Ahmed Hussen Abdelaziz; Robert M. Nickel; Dorothea Kolossa
Automatic speech recognition (ASR) enables very intuitive human-machine interaction. However, signal degradations due to reverberation or noise reduce the accuracy of audio-based recognition. The introduction of a second signal stream that is not affected by degradations in the audio domain (e.g., a video stream) increases the robustness of ASR against degradations in the original domain. Here, depending on the signal quality of audio and video at each point in time, a dynamic weighting of both streams can optimize the recognition performance. In this work, we introduce a strategy for estimating optimal weights for the audio and video streams in turbo-decodingbased ASR using a discriminative cost function. The results show that turbo decoding with this maximally discriminative dynamic weighting of information yields higher recognition accuracy than turbo-decoding-based recognition with fixed stream weights or optimally dynamically weighted audiovisual decoding using coupled hidden Markov models.
international conference on acoustics, speech, and signal processing | 2013
Ahmed Hussen Abdelaziz; Steffen Zeiler; Dorothea Kolossa
Most approaches for speech signal processing rely solely on acoustic input, which has the consequence that spectrum estimation becomes exceedingly difficult when the signal-to-noise ratio drops to values near 0 dB. However, alternative sources of information are becoming widely available with increasing use of multimedia data in everyday communication. In the following paper, we suggest to use video input as an auxiliary modality for speech processing by applying a new statistical model - the twin hidden Markov model. The resulting enhancement algorithm for audiovisual data greatly outperforms the standard audio-only log-MMSE estimator on all considered instrumental speech quality measures covering spectral and perceptual quality.
international conference on acoustics, speech, and signal processing | 2014
Ahmed Hussen Abdelaziz; Steffen Zeiler; Dorothea Kolossa
Mutually deploying visual and acoustical information in automatic speech recognition systems increases their robustness against acoustical environmental effects like additive noise and reverberation. Optimal fusion of the audio and video streams requires dynamic adaptation of the relative contribution of each modality. This can be achieved by weighting each stream according to its reliability by an appropriate stream weight. In this paper we propose a new expectation maximization algorithm that estimates oracle frame-dependent stream weights for coupled-HMM-based audio-visual speech recognition. Moreover, we introduce a greedy optimization approach that reasonably initializes this algorithm. The proposed approach is evaluated on the Grid audio-visual database and results in an average relative word error rate reduction of 38% and 58% compared to grid search and Bayes fusion, respectively. The estimated oracle stream weights can be used instead of the conventional global fixed stream weights to improve the supervised training of stream weight estimators.
international conference on acoustics, speech, and signal processing | 2016
Mahdie Karbasi; Ahmed Hussen Abdelaziz; Dorothea Kolossa
Most of the objective measures employed for speech intelligibility prediction require a clean reference signal, which is not accessible in all realistic scenarios. In this paper, we propose to re-synthesize the relevant features of the clean signal using only the noisy speech signal and utilize them inside an intelligibility prediction framework which requires a reference. A statistical model called twin hidden Markov model (THMM) is used to synthesize the clean speech features. For the intelligibility prediction framework, the short-time objective intelligibility (STOI) measure is used as an accurate and well-known method. The experimental results show a high correlation between the twin-HMM-based STOI (THMMB-STOI) and the human speech recognition results, even slightly outperforming the conventional STOI predictions computed using the actual clean reference signals.
international conference on acoustics, speech, and signal processing | 2013
Ahmed Hussen Abdelaziz; Steffen Zeiler; Dorothea Kolossa; Volker Leutnant
The accuracy of automatic speech recognition systems in noisy and reverberant environments can be improved notably by exploiting the uncertainty of the estimated speech features using so-called uncertainty-of-observation techniques. In this paper, we introduce a new Bayesian decision rule that can serve as a mathematical framework from which both known and new uncertainty-of-observation techniques can be either derived or approximated. The new decision rule in its direct form leads to the new significance decoding approach for Gaussian mixture models, which results in better performance compared to standard uncertainty-of-observation techniques in different additive and convolutive noise scenarios.
conference of the international speech communication association | 2016
Steffen Zeiler; Hendrik Meutzner; Ahmed Hussen Abdelaziz; Dorothea Kolossa
Models for automatic speech recognition (ASR) hold detailed information about spectral and spectro-temporal characteristics of clean speech signals. Using these models for speech enhancement is desirable and has been the target of past research efforts. In such model-based speech enhancement systems, a powerful ASR is imperative. To increase the recognition rates especially in low-SNR conditions, we suggest the use of the additional visual modality, which is mostly unaffected by degradations in the acoustic channel. An optimal integration of acoustic and visual information is achievable by joint inference in both modalities within the turbo-decoding framework. Thus combining turbo-decoding with Twin-HMMs for speech enhancement, notable improvements can be achieved, not only in terms of instrumental estimates of speech quality, but also in actual speech intelligibility. This is verified through listening tests, which show that in highly challenging noise conditions, average human recognition accuracy can be improved from 64% without signal processing to 80% when using the presented architecture.
conference of the international speech communication association | 2016
Mahdie Karbasi; Ahmed Hussen Abdelaziz; Hendrik Meutzner; Dorothea Kolossa
Automatic prediction of speech intelligibility is highly desirable in the speech research community, since listening tests are timeconsuming and can not be used online. Most of the available objective speech intelligibility measures are intrusive methods, as they require a clean reference signal in addition to the corresponding noisy/processed signal at hand. In order to overcome the problem of predicting the speech intelligibility in the absence of the clean reference signal, we have proposed in [1] to employ a recognition/synthesis framework called twin hidden Markov model (THMM) for synthesizing the clean features, required inside an intrusive intelligibility prediction method. The new framework can predict the speech intelligibility equally well as well-known intrusive methods like the short-time objective intelligibility (STOI). The original THMM, however, requires the correct transcription for synthesizing the clean reference features, which is not always available. In this paper, we go one step further and investigate the use of the recognized transcription instead of the oracle transcription for obtaining a more widely applicable speech intelligibility prediction. We show that the output of the newly-proposed blind approach is highly correlated with the human speech recognition results, collected via crowdsourcing in different noise conditions.
Speech Communication | 2016
Ahmed Hussen Abdelaziz; Dorothea Kolossa
Abstract Uncertainty decoding has recently been successful in improving automatic speech recognition performance in noisy environments by considering the pre-processed feature vectors not as deterministic but rather as random variables containing estimation errors, residual noise and also artifacts introduced by the signal pre-processors themselves. However, the achievable improvements depend strongly on how well the statistics of these random variables are estimated in the recognition domain. In this paper, we compare two approaches for estimating these statistics. The first approach directly estimates the needed statistics in the recognition domain. The second one estimates the statistics in the processing domain and then propagates them through the typically nonlinear feature extraction to obtain the corresponding statistics in the recognition domain. Based on this distinction, we propose a new hybrid approach that combines the advantages of both approaches and avoids their disadvantages. The new hybrid approach can be used with any speech pre-processor, which enables wider usage of the uncertainty decoding approach instead of the conventional maximum likelihood approach.
Unknown Journal | 2015
Ahmed Hussen Abdelaziz; Shinji Watanabe; John R. Hershey; Emmanuel Vincent; Dorothea Kolossa