Matthias Eichner
Dresden University of Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Matthias Eichner.
international conference on acoustics, speech, and signal processing | 2004
Matthias Eichner; Matthias Wolff; Rüdiger Hoffmann
In the past, several approaches have been proposed for voice conversion in TTS systems. Mostly, conversion is done by modification of the spectral properties and pitch to match a certain target voice. This conversion causes distortions that deteriorate the quality of the synthesized speech. In this paper we investigate a very simple and straightforward method for voice conversion. It generates a new voice from the source speaker instead of generating a certain target speakers voice. For application in TTS systems it is often sufficient to synthesize new voices that sound sufficiently different to be distinguishable from each other. This is done by applying a spectral warping technique that is commonly used for speaker normalization in speech recognition systems called vocal tract length normalization (VTLN). Due to the low requirements of resources this method is especially suited for embedded systems.
IEEE Transactions on Speech and Audio Processing | 2004
Steffen Werner; Matthias Eichner; Matthias Wolff; Ruediger Hoffmann
State-of-the-art speech synthesis systems achieve a high overall quality. However, synthesized speech still lacks naturalness. To produce more natural and colloquial synthetic speech, our research focuses on integration of effects present in spontaneous speech. Conventional speech synthesis systems do not consider the probability of a word in its context. Recent investigations on corpora of natural speech showed that words that are very likely to occur in a given context are pronounced less accurately and faster than improbable ones. In this paper three approaches are introduced to model this effect found in spontaneous speech. The first algorithm changes the speaking rate directly by shortening or lengthening the syllables of a word depending on the language model probability of that word. Since probable words are not only pronounced faster but also less accurately this approach was extended by selecting appropriate pronunciation variants of a word according to the language model probability. This second algorithm changes the local speaking rate indirectly by controlling the grapheme-phoneme conversion. In a third stage, a pronunciation sequence model was used to select the appropriate variants according to their sequence probability. In listening experiments test participants were asked to rate the synthesized speech in the categories colloquial impression and naturalness. Our approaches achieved a significant improvement in the category colloquial impression. However, no significantly higher naturalness could be observed. The observed effects will be discussed in detail.
international conference on acoustics, speech, and signal processing | 2001
Matthias Eichner; Matthias Wolff; Sebastian Ohnewald; Rüdiger Hoffmann
Speech synthesis systems basing on concatenation of natural speech segments achieve a high quality in terms of naturalness and intelligibility. However, in many applications such systems are not easy to apply because of the huge demand for storage capacity. Speech synthesis systems based on HMMs could be an alternative to concatenative speech synthesis systems but do not yet achieve the quality needed for use in applications. In one of our research projects we investigate the possibility of combining speech synthesis and speech recognition to a unified system using the same databases and similar algorithms for synthesis and recognition. In this context we examine the suitability of stochastic Markov graphs instead of HMMs to improve the performance of such synthesis systems. The paper describes the training procedure we used to train the SMGs, explains the synthesis process and introduces an algorithm for state selection and state duration modeling. We focus particularly on issues which arise using SMGs instead of HMMs.
COST 2102'07 Proceedings of the 2007 COST action 2102 international conference on Verbal and nonverbal communication behaviours | 2007
Rüdiger Hoffmann; Matthias Eichner; Matthias Wolff
During the last few years, a framework for the development of algorithms for speech analysis and synthesis was implemented. The algorithms are connected to common databases on the different levels of a hierarchical structure. This framework which is called UASR (Unified Approach for Speech Synthesis and Recognition) and some related experiments and applications are described. Special focus is directed to the suitability of the system for processing nonverbal signals. This part is related to the analysis methods which are addressed in the COST 2102 initiative now. A potential application field in interaction research is discussed.
international conference on acoustics, speech, and signal processing | 2004
Constanze Tschöpe; Dieter Hentschel; Matthias Wolff; Matthias Eichner; Rüdiger Hoffmann
Non-speech acoustic signals are widely used as the input of systems for non-destructive testing. In this rapidly growing field, the signals have an increasing complexity leading to the fact that powerful models are required. Methods like DTW and HMM, which are established in speech recognition, have been successfully used but are not sufficient in all cases. We propose the application of generalized structured Markov graphs (SMG). We describe a task independent structure learning technique which automatically adapts the models to the structure of the test signals. We demonstrate that our solution outperforms hand-tuned HMM structures in terms of class discrimination by two case studies using data from real applications.
international conference on acoustics, speech, and signal processing | 2007
Matthias Wolff; Ulrich Kordon; H. Husssein; Matthias Eichner; Ruediger Hoffmann; Constanze Tschöpe
This paper reports on a study of applying an HMM-based labeler along with a tailored feature extraction to Korotkoff sounds. These sounds can be heard through a stethoscope during the auscultatory blood pressure measurement usually done at medical practices. While this method works well when the patient is at rest, interfering noise from muscles and joints cause major problems when the subject is doing any activities like sports or fitness exercises. We propose a signal processing and classification method to overcome these difficulties and present first promising results.
international conference on information technology coding and computing | 2004
Steffen Werner; Matthias Wolff; Matthias Eichner; Rüdiger Hoffmann
We investigate the deployment possibilities of speech enabled services in a Web based e-learning environment. The integration of speech technology is realized with a client/server architecture. Therefore, the services speech synthesis, speech recognition, and speaker verification are installed at a central SpeechServer. The client uses a Java applet (SpeechApplet), which is integrated in an HTML page. It takes the users input (e.g. speech or text input) and activates the according service at the SpeechServer. The SpeechApplet is easy to integrate into existing Web pages and aims at a simple JavaScript interface for communication between the Web page and the applet. In this paper we introduce this system, explain different modules, and discuss first evaluation results of these technologies.
international conference on acoustics, speech, and signal processing | 2004
Steffen Werner; Matthias Wolff; Matthias Eichner; R. Hoffinann
Integration of pronunciation modeling into speech synthesis makes synthetic speech more natural and colloquial. Pronunciation variation as one observable effect in spontaneous speech is a step towards spontaneous speech synthesis. In the previous works (see Proc. ICASSP, vol.1, p.417-20, Orlando, FL, USA, 2002 and Proc. ICASSP, Hong Kong, PR China, Apr. 2003) we introduced different duration control methods in speech synthesis. These methods are based on the observation that words, which are very likely to occur in a given context are pronounced faster and less accurately than improbable ones (see D. Jurafsky et al., Proc. ICASSP, vol.2, p.801-4, Salt Lake City, USA, 2001). Therefore we use the probability of a word in its context to either control directly the local speaking rate, or to select appropriate pronunciation variants in order to change the local speaking rate. Extending these methods by a pronunciation sequence model, we involve knowledge about how well two subsequent variants fit together. Using this proposed algorithm we could further improve the natural and colloquial listening impressions.
international conference on acoustics, speech, and signal processing | 2002
Matthias Eichner; Matthias Wolff; Rüdiger Hoffmann
Speech synthesis systems based on concatenation of segments derived from natural speech are very intelligible and achieve a high overall quality. Even though listeners often complain about wrong or missing temporal structure and timing in synthetic speech. We propose a new approach for duration control in speech synthesis that uses the probability of a word in its context to control the local speaking rate within the utterance. This idea bases on the observation that words that are very likely to occur in a given context are pronounced less accurate and faster than improbable ones. In this paper we introduce an algorithm that implements the duration control using a multigram language model and will present first experimental results.
international conference on acoustics, speech, and signal processing | 2003
Matthias Eichner; Steffen Werner; Matthias Wolff; Rüdiger Hoffmann
State of the art speech synthesis systems achieve a high overall quality. However, the synthesized speech still lacks naturalness. To make speech synthesis more natural and colloquial we are trying to integrate effects that are observable in spontaneous speech. In a previous paper we introduced a new approach for duration control in speech synthesis that uses the probability of a word in its context to control the local speaking rate within the utterance. This idea is based on the observation that words that are very likely to occur in a given context are pronounced faster than improbable ones. Since probable words are not only pronounced faster but also less accurate we extend this approach by selecting appropriate pronunciation variants to realize the change in the local speaking rate.