Ralf Schlüter | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Ralf Schlüter is active.

Explore More

Publication

Featured researches published by Ralf Schlüter.

international conference on acoustics, speech, and signal processing | 2001

Computing Mel-frequency cepstral coefficients on the power spectrum

Sirko Molau; Michael Pitz; Ralf Schlüter; Hermann Ney

We present a method to derive Mel-frequency cepstral coefficients directly from the power spectrum of a speech signal. We show that omitting the filterbank in signal analysis does not affect the word error rate. The presented approach simplifies the speech recognizers front end by merging subsequent signal analysis steps into a single one. It avoids possible interpolation and discretization problems and results in a compact implementation. We show that frequency warping schemes like vocal tract normalization can be integrated easily in our concept without additional computational efforts. Recognition test results obtained with the RWTH large vocabulary speech recognition system are presented for two different corpora: The German VerbMobil II dev99 corpus, and the English North American Business News 94 20k development corpus.

international conference on acoustics, speech, and signal processing | 2001

Using phase spectrum information for improved speech recognition performance

Ralf Schlüter; Hermann Ney

New acoustic features for continuous speech recognition based on the short-term Fourier phase spectrum are introduced for mono (telephone) recordings. The new phase based features were combined with standard Mel Frequency Cepstral Coefficients (MFCC), and results were produced with and without using additional linear discriminant analysis (LDA) to choose the most relevant features. Experiments were performed on the SieTill corpus for telephone line recorded German digit strings. Using LDA to combine purely phase based features with MFCCs, we obtained improvements in word error rate of up to 25% relative to using MFCCs alone with the same overall number of parameters in the system.

Speech Communication | 2001

Comparison of discriminative training criteria and optimization methods for speech recognition

Ralf Schlüter; Wolfgang Macherey; Boris Müller; Hermann Ney

Abstract The aim of this work is to build up a common framework for a class of discriminative training criteria and optimization methods for continuous speech recognition. A unified discriminative criterion based on likelihood ratios of correct and competing models with optional smoothing is presented. The unified criterion leads to particular criteria through the choice of competing word sequences and the choice of smoothing. Analytic and experimental comparisons are presented for both the maximum mutual information (MMI) and the minimum classification error (MCE) criterion together with the optimization methods gradient descent (GD) and extended Baum (EB) algorithm. A tree search-based restricted recognition method using word graphs is presented, so as to reduce the computational complexity of large vocabulary discriminative training. Moreover, for MCE training, a method using word graphs for efficient calculation of discriminative statistics is introduced. Experiments were performed for continuous speech recognition using the ARPA wall street journal (WSJ) corpus with a vocabulary of 5k words and for the recognition of continuously spoken digit strings using both the TI digit string corpus for American English digits, and the SieTill corpus for telephone line recorded German digits. For the MMI criterion, neither analytical nor experimental results do indicate significant differences between EB and GD optimization. For acoustic models of low complexity, MCE training gave significantly better results than MMI training. The recognition results for large vocabulary MMI training on the WSJ corpus show a significant dependence on the context length of the language model used for training. Best results were obtained using a unigram language model for MMI training. No significant correlation has been observed between the language models chosen for training and recognition.

international conference on acoustics, speech, and signal processing | 2007

Gammatone Features and Feature Combination for Large Vocabulary Speech Recognition

Ralf Schlüter; L. Bezrukov; Hannes Wagner; Hermann Ney

In this work, an acoustic feature set based on a gammatone filterbank is introduced for large vocabulary speech recognition. The gammatone features presented here lead to competitive results on the EPPS English task, and considerable improvements were obtained by subsequent combination to a number of standard acoustic features, i.e. MFCC, PLP, MF-PLP, and VTLN plus voicedness. Best results were obtained when combining gammatone features to all other features using weighted ROVER, resulting in a relative improvement of about 12% in word error rate compared to the best single feature system. We also found that ROVER gives better results for feature combination than both log-linear model combination and LDA.

IEEE Transactions on Audio, Speech, and Language Processing | 2015

From feedforward to recurrent LSTM neural networks for language modeling

Martin Sundermeyer; Hermann Ney; Ralf Schlüter

Language models have traditionally been estimated based on relative frequencies, using count statistics that can be extracted from huge amounts of text data. More recently, it has been found that neural networks are particularly powerful at estimating probability distributions over word sequences, giving substantial improvements over state-of-the-art count models. However, the performance of neural network language models strongly depends on their architectural structure. This paper compares count models to feedforward, recurrent, and long short-term memory (LSTM) neural network variants on two large-vocabulary speech recognition tasks. We evaluate the models in terms of perplexity and word error rate, experimentally validating the strong correlation of the two quantities, which we find to hold regardless of the underlying type of the language model. Furthermore, neural networks incur an increased computational complexity compared to count models, and they differently model context dependences, often exceeding the number of words that are taken into account by count based approaches. These differences require efficient search methods for neural networks, and we analyze the potential improvements that can be obtained when applying advanced algorithms to the rescoring of word lattices on large-scale setups.

international conference on acoustics, speech, and signal processing | 2013

Comparison of feedforward and recurrent neural network language models

Martin Sundermeyer; Ilya Oparin; Jean-Luc Gauvain; B. Freiberg; Ralf Schlüter; Hermann Ney

Research on language modeling for speech recognition has increasingly focused on the application of neural networks. Two competing concepts have been developed: On the one hand, feedforward neural networks representing an n-gram approach, on the other hand recurrent neural networks that may learn context dependencies spanning more than a fixed number of predecessor words. To the best of our knowledge, no comparison has been carried out between feedforward and state-of-the-art recurrent networks when applied to speech recognition. This paper analyzes this aspect in detail on a well-tuned French speech recognition task. In addition, we propose a simple and efficient method to normalize language model probabilities across different vocabularies, and we show how to speed up training of recurrent neural networks by parallelization.

international conference on machine learning | 2008

Modified MMI/MPE: a direct evaluation of the margin in speech recognition

Georg Heigold; Thomas Deselaers; Ralf Schlüter; Hermann Ney

In this paper we show how common speech recognition training criteria such as the Minimum Phone Error criterion or the Maximum Mutual Information criterion can be extended to incorporate a margin term. Different margin-based training algorithms have been proposed to refine existing training algorithms for general machine learning problems. However, for speech recognition, some special problems have to be addressed and all approaches proposed either lack practical applicability or the inclusion of a margin term enforces significant changes to the underlying model, e.g. the optimization algorithm, the loss function, or the parameterization of the model. In our approach, the conventional training criteria are modified to incorporate a margin term. This allows us to do large-margin training in speech recognition using the same efficient algorithms for accumulation and optimization and to use the same software as for conventional discriminative training. We show that the proposed criteria are equivalent to Support Vector Machines with suitable smooth loss functions, approximating the non-smooth hinge loss function or the hard error (e.g. phone error). Experimental results are given for two different tasks: the rather simple digit string recognition task Sietill which severely suffers from overfitting and the large vocabulary European Parliament Plenary Sessions English task which is supposed to be dominated by the risk and the generalization does not seem to be such an issue.

international conference on acoustics, speech, and signal processing | 2005

Acoustic feature combination for robust speech recognition

András Zolnay; Ralf Schlüter; Hermann Ney

In this paper, we consider the use of multiple acoustic features of the speech signal for robust speech recognition. We investigate the combination of various auditory based (mel frequency cepstrum coefficients, perceptual linear prediction, etc.) and articulatory based (voicedness) features. Features are combined by linear discriminant analysis and log-linear model combination based techniques. We describe the two feature combination techniques and compare the experimental results. Experiments performed on the large-vocabulary task VerbMobil II (German conversational speech) show that the accuracy of automatic speech recognition systems can be improved by the combination of different acoustic features.

international conference on acoustics, speech, and signal processing | 2005

Cross domain automatic transcription on the TC-STAR EPPS corpus

Christian Gollan; Maximilian Bisani; Stephan Kanthak; Ralf Schlüter; Hermann Ney

This paper describes the ongoing development of the British English European Parliament Plenary Session corpus. This corpus will be part of the speech-to-speech translation evaluation infrastructure of the European TC-STAR project. Furthermore, we present first recognition results on the English speech recordings. The transcription system has been derived from an older speech recognition system built for the North-American broadcast news task. We report on the measures taken for rapid cross-domain porting and present encouraging results.

international conference on acoustics, speech, and signal processing | 2013

System combination and score normalization for spoken term detection

Jonathan Mamou; Jia Cui; Xiaodong Cui; Mark J. F. Gales; Brian Kingsbury; Kate Knill; Lidia Mangu; David Nolden; Michael Picheny; Bhuvana Ramabhadran; Ralf Schlüter; Abhinav Sethy; Philip C. Woodland

Spoken content in languages of emerging importance needs to be searchable to provide access to the underlying information. In this paper, we investigate the problem of extending data fusion methodologies from Information Retrieval for Spoken Term Detection on low-resource languages in the framework of the IARPA Babel program. We describe a number of alternative methods improving keyword search performance. We apply these methods to Cantonese, a language that presents some new issues in terms of reduced resources and shorter query lengths. First, we show score normalization methodology that improves in average by 20% keyword search performance. Second, we show that properly combining the outputs of diverse ASR systems performs 14% better than the best normalized ASR system.

Explore More