Hagen Soltau | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Hagen Soltau is active.

Explore More

Publication

Featured researches published by Hagen Soltau.

international conference on acoustics, speech, and signal processing | 2005

fMPE: discriminatively trained features for speech recognition

Daniel Povey; Brian Kingsbury; Lidia Mangu; George Saon; Hagen Soltau; Geoffrey Zweig

MPE (minimum phone error) is a previously introduced technique for discriminative training of HMM parameters. fMPE applies the same objective function to the features, transforming the data with a kernel-like method and training millions of parameters, comparable to the size of the acoustic model. Despite the large number of parameters, fMPE is robust to over-training. The method is to train a matrix projecting from posteriors of Gaussians to a normal size feature space, and then to add the projected features to normal features such as PLP. The matrix is trained from a zero start using a linear method. Sparsity of posteriors ensures speed in both training and test time. The technique gives similar improvements to MPE (around 10% relative). MPE on top of fMPE results in error rates up to 6.5% relative better than MPE alone, or more if multiple layers of transform are trained.

ieee automatic speech recognition and understanding workshop | 2013

Speaker adaptation of neural network acoustic models using i-vectors

George Saon; Hagen Soltau; David Nahamoo; Michael Picheny

We propose to adapt deep neural network (DNN) acoustic models to a target speaker by supplying speaker identity vectors (i-vectors) as input features to the network in parallel with the regular acoustic features for ASR. For both training and test, the i-vector for a given speaker is concatenated to every frame belonging to that speaker and changes across different speakers. Experimental results on a Switchboard 300 hours corpus show that DNNs trained on speaker independent features and i-vectors achieve a 10% relative improvement in word error rate (WER) over networks trained on speaker independent features only. These networks are comparable in performance to DNNs trained on speaker-adapted features (with VTLN and FMLLR) with the advantage that only one decoding pass is needed. Furthermore, networks trained on speaker-adapted features and i-vectors achieve a 5-6% relative improvement in WER after hessian-free sequence training over networks trained on speaker-adapted features only.

ieee automatic speech recognition and understanding workshop | 2001

A one-pass decoder based on polymorphic linguistic context assignment

Hagen Soltau; Florian Metze; Christian Fügen; Alex Waibel

In this study, we examine how fast decoding of conversational speech with large vocabularies profits from efficient use of linguistic information, i.e. language models and grammars. Based on a re-entrant single pronunciation prefix tree, we use the concept of linguistic context polymorphism to allow an early incorporation of language model information. This approach allows us to use all available language model information in a one-pass decoder, using the same engine to decode with statistical n-gram language models as well as context free grammars or re-scoring of lattices in an efficient way. We compare this approach to our previous decoder, which needed three passes to incorporate all available information. The results on a very large vocabulary task show that the search can be speeded up by almost a factor of three, without introducing additional search errors.

international conference on acoustics, speech, and signal processing | 2001

Advances in automatic meeting record creation and access

Alex Waibel; Michael Bett; Florian Metze; Klaus Ries; Thomas Schaaf; Tanja Schultz; Hagen Soltau; Hua Yu; Klaus Zechner

Oral communication is transient, but many important decisions, social contracts and fact findings are first carried out in an oral setup, documented in written form and later retrieved. At Carnegie Mellon Universitys Interactive Systems Laboratories we have been experimenting with the documentation of meetings. The paper summarizes part of the progress that we have made in this test bed, specifically on the question of automatic transcription using large vocabulary continuous speech recognition, information access using non-keyword based methods, summarization and user interfaces. The system is capable of automatically constructing a searchable and browsable audio-visual database of meetings and provide access to these records.

Neural Networks | 2015

Deep Convolutional Neural Networks for Large-scale Speech Tasks

Tara N. Sainath; Brian Kingsbury; George Saon; Hagen Soltau; Abdel-rahman Mohamed; George E. Dahl; Bhuvana Ramabhadran

Convolutional Neural Networks (CNNs) are an alternative type of neural network that can be used to reduce spectral variations and model spectral correlations which exist in signals. Since speech signals exhibit both of these properties, we hypothesize that CNNs are a more effective model for speech compared to Deep Neural Networks (DNNs). In this paper, we explore applying CNNs to large vocabulary continuous speech recognition (LVCSR) tasks. First, we determine the appropriate architecture to make CNNs effective compared to DNNs for LVCSR tasks. Specifically, we focus on how many convolutional layers are needed, what is an appropriate number of hidden units, what is the best pooling strategy. Second, investigate how to incorporate speaker-adapted features, which cannot directly be modeled by CNNs as they do not obey locality in frequency, into the CNN framework. Third, given the importance of sequence training for speech tasks, we introduce a strategy to use ReLU+dropout during Hessian-free sequence training of CNNs. Experiments on 3 LVCSR tasks indicate that a CNN with the proposed speaker-adapted and ReLU+dropout ideas allow for a 12%-14% relative improvement in WER over a strong DNN system, achieving state-of-the art results in these 3 tasks.

spoken language technology workshop | 2010

The IBM Attila speech recognition toolkit

Hagen Soltau; George Saon; Brian Kingsbury

We describe the design of IBMs Attila speech recognition toolkit. We show how the combination of a highly modular and efficient library of low-level C++ classes with simple interfaces, an interconnection layer implemented in a modern scripting language (Python), and a standardized collection of scripts for system-building produce a flexible and scalable toolkit that is useful both for basic research and for construction of large transcription systems for competitive evaluations.

IEEE Transactions on Audio, Speech, and Language Processing | 2006

Advances in speech transcription at IBM under the DARPA EARS program

Stanley F. Chen; Brian Kingsbury; Lidia Mangu; Daniel Povey; George Saon; Hagen Soltau; Geoffrey Zweig

This paper describes the technical and system building advances made in IBMs speech recognition technology over the course of the Defense Advanced Research Projects Agency (DARPA) Effective Affordable Reusable Speech-to-Text (EARS) program. At a technical level, these advances include the development of a new form of feature-based minimum phone error training (fMPE), the use of large-scale discriminatively trained full-covariance Gaussian models, the use of septaphone acoustic context in static decoding graphs, and improvements in basic decoding algorithms. At a system building level, the advances include a system architecture based on cross-adaptation and the incorporation of 2100 h of training data in every system component. We present results on English conversational telephony test data from the 2003 and 2004 NIST evaluations. The combination of technical advances and an order of magnitude more training data in 2004 reduced the error rate on the 2003 test set by approximately 21% relative-from 20.4% to 16.1%-over the most accurate system in the 2003 evaluation and produced the most accurate results on the 2004 test sets in every speed category

ieee automatic speech recognition and understanding workshop | 2013

Improvements to Deep Convolutional Neural Networks for LVCSR

Tara N. Sainath; Brian Kingsbury; Abdel-rahman Mohamed; George E. Dahl; George Saon; Hagen Soltau; Tomas Beran; Aleksandr Y. Aravkin; Bhuvana Ramabhadran

Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.

international conference on acoustics speech and signal processing | 1998

Recognition of music types

Hagen Soltau; Tanja Schultz; Martin Westphal; Alex Waibel

This paper describes a music type recognition system that can be used to index and search in multimedia databases. A new approach to temporal structure modeling is supposed. The so called ETM-NN (explicit time modelling with neural network) method uses abstraction of acoustical events to the hidden units of a neural network. This new set of abstract features representing temporal structures, can be then learned via a traditional neural networks to discriminate between different types of music. The experiments show that this method outperforms HMMs significantly.

international conference on acoustics, speech, and signal processing | 2005

The IBM 2004 conversational telephony system for rich transcription

Hagen Soltau; Brian Kingsbury; Lidia Mangu; Daniel Povey; George Saon; Geoffrey Zweig

This paper describes the technical advances in IBMs conversational telephony submission to the DARPA-sponsored 2004 rich transcription evaluation (RT-04). These advances include a system architecture based on cross-adaptation; a new form of feature-based MPE training; the use of a full-scale discriminatively trained full covariance Gaussian system; the use of septaphone cross-word acoustic context in static decoding graphs; and the incorporation of 2100 hours of training data in every system component. These advances reduced the error rate by approximately 21% relative, on the 2003 test set, over the best-performing system in last years evaluation, and produced the best results on the RT-04 current and progress CTS data.

Explore More