Michael Pitz
RWTH Aachen University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Michael Pitz.
IEEE Transactions on Speech and Audio Processing | 2005
Michael Pitz; Hermann Ney
Vocal tract normalization (VTN) is a widely used speaker normalization technique which reduces the effect of different lengths of the human vocal tract and results in an improved recognition accuracy of automatic speech recognition systems. We show that VTN results in a linear transformation in the cepstral domain, which so far have been considered as independent approaches of speaker normalization. We are now able to compute the Jacobian determinant of the transformation matrix, which allows the normalization of the probability distributions used in speaker-normalization for automatic speech recognition. We show that VTN can be viewed as a special case of Maximum Likelihood Linear Regression (MLLR). Consequently, we can explain previous experimental results that improvements obtained by VTN and subsequent MLLR are not additive in some cases. For three typical warping functions the transformation matrix is calculated analytically and we show that the matrices are diagonal dominant and thus can be approximated by quindiagonal matrices.
international conference on acoustics, speech, and signal processing | 2001
Sirko Molau; Michael Pitz; Ralf Schlüter; Hermann Ney
We present a method to derive Mel-frequency cepstral coefficients directly from the power spectrum of a speech signal. We show that omitting the filterbank in signal analysis does not affect the word error rate. The presented approach simplifies the speech recognizers front end by merging subsequent signal analysis steps into a single one. It avoids possible interpolation and discretization problems and results in a compact implementation. We show that frequency warping schemes like vocal tract normalization can be integrated easily in our concept without additional computational efforts. Recognition test results obtained with the RWTH large vocabulary speech recognition system are presented for two different corpora: The German VerbMobil II dev99 corpus, and the English North American Business News 94 20k development corpus.
ieee automatic speech recognition and understanding workshop | 2001
Sirko Molau; Michael Pitz; Hermann Ney
We describe a technique called histogram normalization that aims at normalizing feature space distributions at different stages in the signal analysis front-end, namely the log-compressed filterbank vectors, cepstrum coefficients, and LDA (local density approximation) transformed acoustic vectors. Best results are obtained at the filterbank, and in most cases there is a minor additional gain when normalization is applied sequentially at different stages. We show that histogram normalization performs best if applied both in training and recognition, and that smoothing the target histogram obtained on the training data is also helpful. On the VerbMobil II corpus, a German large-vocabulary conversational speech recognition task, we achieve an overall reduction in word error rate of about 10% relative.
Speech Communication | 2002
Peter Beyerlein; Xavier L. Aubert; Matthew Harris; Dietrich Klakow; Andreas Wendemuth; Sirko Molau; Hermann Ney; Michael Pitz; Achim Sixtus
Abstract Automatic speech recognition of real-live broadcast news (BN) data (Hub-4) has become a challenging research topic in recent years. This paper summarizes our key efforts to build a large vocabulary continuous speech recognition system for the heterogenous BN task without inducing undesired complexity and computational resources. These key efforts included: • automatic segmentation of the audio signal into speech utterances; • efficient one-pass trigram decoding using look-ahead techniques; • optimal log-linear interpolation of a variety of acoustic and language models using discriminative model combination (DMC); • handling short-range and weak longer-range correlations in natural speech and language by the use of phrases and of distance-language models; • improving the acoustic modeling by a robust feature extraction, channel normalization, adaptation techniques as well as automatic script selection and verification. The starting point of the system development was the Philips 64k-NAB word-internal triphone trigram system. On the speaker-independent but microphone-dependent NAB-task (transcription of read newspaper texts) we obtained a word error rate of about 10%. Now, at the conclusion of the system development, we have arrived at Philips at an DMC-interpolated phrase-based crossword-pentaphone 4-gram system. This system transcribes BN data with an overall word error rate of about 17%.
Mustererkennung 2000, 22. DAGM-Symposium | 2000
Jörg Dahmen; Daniel Keysers; Michael Pitz; Hermann Ney
In this paper we present different approaches to structuring covariance matrices within Statistical classifiers. This is motivated by the fact that the use of full covariance matrices is infeasible in many applications. On the one hand, this is due to the high number of model Parameters that have to be estimated, on the other hand the computational complexity of a classifier based on full covariance matrices is very high. We propose the use of diagonal and band-matrices to replace full covariance matrices and we also show that computation of tangent distance is equivalent to using a structured covariance matrix within a Statistical classifier.
conference of the international speech communication association | 2003
Michael Pitz; Hermann Ney
conference of the international speech communication association | 2001
Michael Pitz; Sirko Molau; Ralf Schlüter; Hermann Ney
conference of the international speech communication association | 2000
Michael Pitz; Frank Wessel; Hermann Ney
conference of the international speech communication association | 1999
Peter Beyerlein; Xavier L. Aubert; Matthew Harris; Dietrich Klakow; Andreas Wendemuth; Sirko Molau; Michael Pitz; Achim Sixtus
Archive | 2005
Michael Pitz; Hermann Ney