Thomas Schaaf
Carnegie Mellon University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Thomas Schaaf.
international conference on acoustics, speech, and signal processing | 2001
Alex Waibel; Michael Bett; Florian Metze; Klaus Ries; Thomas Schaaf; Tanja Schultz; Hagen Soltau; Hua Yu; Klaus Zechner
Oral communication is transient, but many important decisions, social contracts and fact findings are first carried out in an oral setup, documented in written form and later retrieved. At Carnegie Mellon Universitys Interactive Systems Laboratories we have been experimenting with the documentation of meetings. The paper summarizes part of the progress that we have made in this test bed, specifically on the question of automatic transcription using large vocabulary continuous speech recognition, information access using non-keyword based methods, summarization and user interfaces. The system is capable of automatically constructing a searchable and browsable audio-visual database of meetings and provide access to these records.
international conference on acoustics, speech, and signal processing | 1997
Thomas Schaaf; Thomas Kemp
For many practical applications of speech recognition systems, it is desirable to have an estimate of confidence for each hypothesized word, i.e. to have an estimate of which words of the output of the speech recognizer are likely to be correct and which are not reliable. We describe the development of the measure of the confidence tagger JANKA, which is able to provide confidence information for the words at the output of the speech recognizer JANUS-3-SR. On a spontaneous German human-to-human database, JANKA achieves a tagging accuracy of 90% at a baseline word accuracy of 82%.
text speech and dialogue | 2006
Nico Schlaefer; Petra Gieselmann; Thomas Schaaf; Alex Waibel
This paper describes the Ephyra question answering engine, a modular and extensible framework that allows to integrate multiple approaches to question answering in one system Our framework can be adapted to languages other than English by replacing language-specific components It supports the two major approaches to question answering, knowledge annotation and knowledge mining Ephyra uses the web as a data resource, but could also work with smaller corpora In addition, we propose a novel approach to question interpretation which abstracts from the original formulation of the question Text patterns are used to interpret a question and to extract answers from text snippets Our system automatically learns the patterns for answer extraction, using question-answer pairs as training data Experimental results revealed the potential of this approach.
Speech Communication | 2004
John W. McDonough; Thomas Schaaf; Alex Waibel
Abstract Modern speech recognition systems are based on the hidden Markov model (HMM) and employ cepstral features to represent input speech. In speaker normalization, the cepstral features of speech from a given speaker are transformed to match the speaker independent HMM. In speaker adaptation, the means of the HMM are transformed to match the input speech. Vocal tract length normalization (VTLN) is a popular normalization scheme wherein the frequency axis of the short-time spectrum is rescaled prior to the extraction of cepstral features. In this work, we develop novel speaker adaptation schemes by exploiting the fact that frequency domain transformations similar to that inherent in VTLN can be accomplished entirely in the cepstral domain through the use of conformal maps. We describe two classes of such maps: rational all-pass transforms (RAPTs) which are well-known in the signal processing literature, and sine-log all-pass transforms (SLAPTs) which are novel in this work. For both classes of maps, we develop the relations necessary to perform maximum likelihood estimation of the relevant transform parameters using enrollment data from a new speaker. We also propose the means by which an HMM may be trained specifically for use with this type of adaptation. Finally, in a set of recognition experiments conducted on conversational speech material from the Switchboard Corpus as well as the English Spontaneous Scheduling Task, we demonstrate the capacity of APT-based speaker adaptation to achieve word error rate reductions superior to those obtained with other popular adaptation techniques, and moreover, reductions that are additive with those provided by VTLN.
international conference on multimodal interfaces | 2002
Ivica Rogina; Thomas Schaaf
Archiving, indexing, and later browsing through stored presentations and lectures is increasingly being used. We have investigated the special problems and advantages of lectures and propose the design and adaptation of a speech recognizer to a lecture such that the recognition accuracy can be significantly improved by prior analysis of the presented documents using a special class-based language model. We define a tracking accuracy measure which measures how well a system can automatically align recognized words with parts of a presentation and show that by prior exploitation of the presented documents, the tracking accuracy can be improved. The system described in this paper is part of an intelligent meeting room developed in the European Union-sponsored project FAME (Facilitating Agent for Multicultural Exchange).
international conference on acoustics, speech, and signal processing | 2002
John W. McDonough; Thomas Schaaf; Alex Waibel
In this work, we combine maximum mutual information-based parameter estimation with speaker-adapted training (SAT). As will be shown, this can be achieved by performing unsupervised parameter estimation on the test data, a distinct advantage for many recognition tasks involving conversational speech. We also propose an approximation to the maximum likelihood and maximum mutual information SAT re-estimation formulae that greatly reduces the amount of disk space required to conduct training on corpora such as Broadcast News, which contains speech from thousands of speakers. We present the results of a set of speech recognition experiments on three test sets: the English Spontaneous Scheduling Task corpus, Broadcast News, and a new corpus of Meeting Room data collected at the Interactive Systems Laboratories of the Carnegie Mellon University.
ieee automatic speech recognition and understanding workshop | 2005
M. Paulik; Sebastian Stüker; Christian Fügen; Tanja Schultz; Thomas Schaaf; Alex Waibel
Nowadays official documents have to be made available in many languages, like for example in the EU with its 20 official languages. Therefore, the need for effective tools to aid the multitude of human translators in their work becomes easily apparent. An ASR system, enabling the human translator to speak his translation in an unrestricted manner, instead of typing it, constitutes such a tool. In this work we improve the recognition performance of such an ASR system on the target language of the human translator by taking advantage of an either written or spoken source language representation. To do so, machine translation techniques are used to translate between the different languages and then the involved ASR systems are biased towards the gained knowledge. We present an iterative approach for ASR improvement and outperform our baseline system by a relative word error rate reduction of 35.8%/29.9% in the case of a written/spoken source language representation. Further, we show how multiple target languages, as for example provided by different simultaneous translators during European Parliament debates, can be incorporated into our system design for an improvement of all involved ASR systems
international conference on human language technology research | 2001
Alex Waibel; Hua Yu; Martin Westphal; Hagen Soltau; Tanja Schultz; Thomas Schaaf; Yue Pan; Florian Metze; Michael Bett
Speech recognition has advanced considerably, but has been limited almost entirely either to situations in which close speaking microphones are natural and acceptable (telephone, dictation, command&control, etc.) or in which high-quality recordings are ensured. Furthermore, most recognition applications involve controlled recording environments, in which the user turns the recognition event on and off and speaks cooperatively for the purpose of being recognized.
KI'06 Proceedings of the 29th annual German conference on Artificial intelligence | 2006
Hartwig Holzapfel; Thomas Schaaf; Hazim Kemal Ekenel; Christoph Schaa; Alex Waibel
Acquiring knowledge about persons is a key functionality for humanoid robots. In a natural environment, the robot not only interacts with different people who he recognizes and who he knows. He will also have to interact with unknown persons, and by acquiring information about them, the robot can memorize these persons and provide extended personalized services. Today, researchers build systems to recognize a persons face, voice and other features. Most of them depend on precollected data. We think that with the given technology it is about time to build a system that collects data autonomously and thus gets to know and learns to recognize persons completely on its own. n nThis paper describes the integration of different perceptual and dialog components and their individual functionality to build a robot that can contact persons, learns their names, and learns to recognize them in future encounters.
international conference on acoustics, speech, and signal processing | 2001
Hagen Soltau; Thomas Schaaf; Florian Metze; Alex Waibel
Describes the 2000 ISL large vocabulary speech recognition system for fast decoding of conversational speech which was used in the German Verbmobil-II project. The challenge of this task is to build robust acoustic models to handle different dialects, spontaneous effects, and crosstalk as occur in conversational speech. We present speaker incremental normalization and adaptation experiments close to real-time constraints. To reduce the number of consequential errors caused by out-of-vocabulary words, we conducted filler-model experiments to handle unknown proper names. The overall improvements from 1998 to 2000 resulted in a word error reduction from 40% to 17% on our development test set.