Martin Wöllmer | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Martin Wöllmer is active.

Explore More

Publication

Featured researches published by Martin Wöllmer.

Computer Speech & Language | 2014

Feature Enhancement by Deep LSTM Networks for ASR in Reverberant Multisource Environments

Felix Weninger; Jürgen T. Geiger; Martin Wöllmer; Björn W. Schuller; Gerhard Rigoll

Abstract This article investigates speech feature enhancement based on deep bidirectional recurrent neural networks. The Long Short-Term Memory (LSTM) architecture is used to exploit a self-learnt amount of temporal context in learning the correspondences of noisy and reverberant with undistorted speech features. The resulting networks are applied to feature enhancement in the context of the 2013 2nd Computational Hearing in Multisource Environments (CHiME) Challenge track 2 task, which consists of the Wall Street Journal (WSJ-0) corpus distorted by highly non-stationary, convolutive noise. In extensive test runs, different feature front-ends, network training targets, and network topologies are evaluated in terms of frame-wise regression error and speech recognition performance. Furthermore, we consider gradually refined speech recognition back-ends from baseline ‘out-of-the-box’ clean models to discriminatively trained multi-condition models adapted to the enhanced features. In the result, deep bidirectional LSTM networks processing log Mel filterbank outputs deliver best results with clean models, reaching down to 42% word error rate (WER) at signal-to-noise ratios ranging from −6 to 9xa0dB (multi-condition CHiME Challenge baseline: 55% WER). Discriminative training of the back-end using LSTM enhanced features is shown to further decrease WER to 22%. To our knowledge, this is the best result reported for the 2nd CHiME Challenge WSJ-0 task yet.

IEEE Transactions on Audio, Speech, and Language Processing | 2014

Memory-enhanced neural networks and NMF for robust ASR

Jürgen T. Geiger; Felix Weninger; Jort F. Gemmeke; Martin Wöllmer; Björn W. Schuller; Gerhard Rigoll

In this article we address the problem of distant speech recognition for reverberant noisy environments. Speech enhancement methods, e. g., using non-negative matrix factorization (NMF), are succesful in improving the robustness of ASR systems. Furthermore, discriminative training and feature transformations are employed to increase the robustness of traditional systems using Gaussian mixture models (GMM). On the other hand, acoustic models based on deep neural networks (DNN) were recently shown to outperform GMMs. In this work, we combine a state-of-the art GMM system with a deep Long Short-Term Memory (LSTM) recurrent neural network in a double-stream architecture. Such networks use memory cells in the hidden units, enabling them to learn long-range temporal context, and thus increasing the robustness against noise and reverberation. The network is trained to predict frame-wise phoneme estimates, which are converted into observation likelihoods to be used as an acoustic model. It is of particular interest whether the LSTM system is capable of improving a robust state-of-the-art GMM system, which is confirmed in the experimental results. In addition, we investigate the efficiency of NMF for speech enhancement on the front-end side. Experiments are conducted on the medium-vocabulary task of the 2nd `CHiME Speech Separation and Recognition Challenge, which includes reverberation and highly variable noise. Experimental results show that the average word error rate of the challenge baseline is reduced by 64% relative. The best challenge entry, a noise-robust state-of-the-art recognition system, is outperformed by 25% relative.

international conference on acoustics, speech, and signal processing | 2013

Feature enhancement by bidirectional LSTM networks for conversational speech recognition in highly non-stationary noise

Martin Wöllmer; Zixing Zhang; Felix Weninger; Björn W. Schuller; Gerhard Rigoll

The recognition of spontaneous speech in highly variable noise is known to be a challenge, especially at low signal-to-noise ratios (SNR). In this paper, we investigate the effect of applying bidirectional Long Short-Term Memory (BLSTM) recurrent neural networks for speech feature enhancement in noisy conditions. BLSTM networks tend to prevail over conventional neural network architectures, whenever the recognition or regression task relies on an intelligent exploitation of temporal context information. We show that BLSTM networks are well-suited for mapping from noisy to clean speech features and that the obtained recognition performance gain is partly complementary to improvements via additional techniques such as speech enhancement by non-negative matrix factorization and probabilistic feature generation by Bottleneck-BLSTM networks. Compared to simple multi-condition training or feature enhancement via standard recurrent neural networks, our BLSTM-based feature enhancement approach leads to remarkable gains in word accuracy in a highly challenging task of recognizing spontaneous speech at SNR levels between -6 and 9 dB.

international conference on acoustics, speech, and signal processing | 2013

Probabilistic asr feature extraction applying context-sensitive connectionist temporal classification networks

Martin Wöllmer; Björn W. Schuller; Gerhard Rigoll

This paper proposes a novel automatic speech recognition (ASR) front-end that unites the principles of bidirectional Long Short-Term Memory (BLSTM), Connectionist Temporal Classification (CTC), and Bottleneck (BN) feature generation. BLSTM networks are known to produce better probabilistic ASR features than conventional multilayer perceptrons since they are able to exploit a self-learned amount of temporal context for phoneme estimation. Combining BLSTM networks with a CTC output layer implies the advantage that the network can be trained on unsegmented data so that the quality of phoneme prediction does not rely on potentially error-prone forced alignment segmentations of the training set. In challenging ASR scenarios involving highly spontaneous, disfluent, and noisy speech, our BN-CTC front-end leads to remarkable word accuracy improvements and prevails over a series of previously introduced BLSTM-based ASR systems.

international symposium on neural networks | 2017

Towards intoxicated speech recognition

Zixing Zhang; Felix Weninger; Martin Wöllmer; Jing Han; Björn W. Schuller

In a real-life scenario, the acoustic characteristics of speech often suffer from the variations induced by diverse environmental noises and different speakers. To overcome the speaker-related speech variation problem for Automatic Speech Recognition (ASR), many speaker adaptation techniques have been proposed and studied. Almost all of these studies, however, only considered the speakers long-term traits, such as age, gender, and dialect. Speakers short-term states, for example, affect and intoxication, are largely ignored. In this study, we address one particular speaker state, alcohol intoxication, which has rarely been studied in the context of ASR. To do this, empirical experiments are performed on a publicly available database used for the INTERSPEECH 2011 Speaker State Challenge, Intoxication Sub-Challenge. The experimental results show that the intoxicated state of the speaker indeed degrades the performance of ASR systems by a large margin for all of the three considered speech styles (spontaneous speech, tongue twisters, command & control). In addition, this paper further shows that multi-condition training can notably improve the acoustic model.

international conference on acoustics speech and signal processing | 2013