Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Tobias Cincarek is active.

Publication


Featured researches published by Tobias Cincarek.


dagm conference on pattern recognition | 2005

Pronunciation feature extraction

Christian Hacker; Tobias Cincarek; Rainer Gruhn; Stefan Steidl; Elmar Nöth; Heinrich Niemann

Automatic pronunciation scoring makes novel applications for computer assisted language learning possible. In this paper we concentrate on the feature extraction. A relatively large feature vector with 28 sentence- and 33 word-level features has been designed. On the word-level correctly and mispronounced words are classified, on the sentence-level utterances are rated with 5 discrete marks. The features are evaluated on two databases with non-native adults’ and children’s speech, respectively. Up to 72 % class-wise-averaged recognition rate is achieved for 2 classes; the result of the 5-class problem can be interpreted as 80 % recognition rate.


international conference on robot communication and coordination | 2007

Voice activity detection applied to hands-free spoken dialogue robot based on decoding using acoustic and language model

Hiroyuki Sakai; Tobias Cincarek; Hiromichi Kawanami; Hiroshi Saruwatari; Kiyohiro Shikano; Akinobu Lee

Speech recognition and speech-based dialogue are means for realizing communication between humans and robots. In case of conventional system setup a headset or a directional microphone is used to collect speech with high signal-to-noise ratio (SNR). However, the user must wear a microphone or has to approach the system closely for interaction. Therefore its preferable to develop a hands-free speech recognition system which enables the user to speak to the system from a distant point. To collect speech from distant speakers a microphone array is usually employed. However, the SNR will degrade in a real environment because of the presence of various kinds of background noise besides the users utterance. This will most often decrease speech recognition performance and no reliable speech dialogue would be possible. Voice Activity Detection (VAD) is a method to detect the user utterance part in the input signal. If VAD fails, all following processing steps including speech recognition and dialogue will not work. Conventional VAD based on amplitude level and zero cross count is difficult to apply to hands-free speech recognition, because speech detection will most often fail due to low SNR. This paper proposes a VAD method based on the acoustic model (AM) for background noise and the speech recognition algorithm applied to hands-free speech recognition. There will always be non-speech segments at the beginning and end of each user utterance. The proposed VAD approach compares the likelihood of phoneme and silence segments in the top recognition hypotheses during decoding. We implemented the proposed method for the open-source speech recognition engine Julius. Experimental results for various SNRs conditions show that the proposed method attains a higher VAD accuracy and higher recognition rate than conventional VAD.


Journal of Pattern Recognition Research | 2009

QMOS - A Robust Visualization Method for Speaker Dependencies with Different Microphones

Andreas K. Maier; Maria Schuster; Ulrich Eysholdt; Tino Haderlein; Tobias Cincarek; Stefan Steidl; Anton Batliner; Stefan Wenhardt; Elmar Nöth

Abstract There are several methods to create visualizations of speech data. All of them, how-ever, lack the ability to remove microphone-dependent distortions. We examined the useof Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), and theCOmprehensive Space Map of Objective Signal (COSMOS) method in this work. To solvethe problem of lacking microphone independency of PCA, LDA, and COSMOS, we presenttwo methods to reduce the influence of the recording conditions on the visualization. Thefirst one is a rigid registration of maps created from identical speakers recorded underdifferent conditions, i.e. different microphones and distances. The second method is anextension of the COSMOS method, which performs a non-rigid registration during themapping procedure. As a measure for the quality of the visualization, we computed themapping error which occurs during the dimension reduction and the grouping error as theaverage distance between the representations of the same speaker recorded by different mi-crophones. The best linear method in leave-one-speaker-out evaluation is PCA plus rigidregistration with a mapping error of 47% and a grouping error of 18%. The proposedmethod, however, surpasses this even further with a mapping error of 24% and a groupingerror which is close to zero.Keywords: Speech intelligibility, speech and voice disorders, speech evaluation, dimen-sionality reduction, Sammon mapping, QMOS, COSMOS, COmprehensive Space Map ofObjective Signal


IEICE Transactions on Information and Systems | 2008

Development, Long-Term Operation and Portability of a Real-Environment Speech-Oriented Guidance System

Tobias Cincarek; Hiromichi Kawanami; Ryuichi Nisimura; Akinobu Lee; Hiroshi Saruwatari; Kiyohiro Shikano

In this paper, the development, long-term operation and portability of a practical ASR application in a real environment is investigated. The target application is a speech-oriented guidance system installed at the local community center. The system has been exposed to ordinary people since November 2002. More than 300 hours or more than 700,000 inputs have been collected during four years. The outcome is a rare example of a large scale real-environment speech database. A simulation experiment is carried out with this database to investigate how the systems performance improves during the first two years of operation. The purpose is to determine empirically the amount of real-environment data which has to be prepared to build a system with reasonable speech recognition performance and response accuracy. Furthermore, the relative importance of developing the main system components, i.e. speech recognizer and the response generation module, is assessed. Although depending on the systems modeling capacities and domain complexity, experimental results show that overall performance stagnates after employing about 10-15 k utterances for training the acoustic model, 40–50 k utterances for training the language model and 40 k–50 k utterances for compiling the question and answer database. The Q&A database was most important for improving the systems response accuracy. Finally, the portability of the well-trained first system prototype for a different environment, a local subway station, is investigated. Since collection and preparation of large amounts of real data is impractical in general, only one month of data from the new environment is employed for system adaptation. While the speech recognition component of the first prototype has a high degree of portability, the response accuracy is lower than in the first environment. The main reason is a domain difference between the two systems, since they are installed in different environments. This implicates that it is imperative to take the behavior of users under real conditions into account to build a system with high user satisfaction.


ieee automatic speech recognition and understanding workshop | 2005

Selective EM training of acoustic models based on sufficient statistics of single utterances

Tobias Cincarek; Tomoki Toda; Hiroshi Saruwatari; Kiyohiro Shikano

In this paper, a new algorithm for selective training of acoustic models is proposed. The algorithm is formulated for an HMM-based model with Gaussian mixture densities, but works in principle for any statistical model, which has sufficient statistics. Since there are too many possibilities for selecting a data subset from a larger database, a heuristic has to be employed. The algorithm is based on deleting single utterances from a data pool temporarily or alternating between successive deletion or addition of utterances. The optimization criterion is the likelihood of the new model parameters given some development data, which can be calculated in a short amount of time based on sufficient statistics. The method is applied to automatically obtain task-dependent acoustic models for infant and elderly speech by selecting utterances from a data pool which are acoustically close to the development data. The proposed method is computationally practical and also addresses the issue of reducing the high costs evolving from the development of applications which make use of speech recognition technology


IEICE Transactions on Information and Systems | 2006

Utterance-Based Selective Training for the Automatic Creation of Task-Dependent Acoustic Models

Tobias Cincarek; Tomoki Toda; Hiroshi Saruwatari; Kiyohiro Shikano

To obtain a robust acoustic model for a certain speech recognition task, a large amount of speech data is necessary. However, the preparation of speech data including recording and transcription is very costly and time-consuming. Although there are attempts to build generic acoustic models which are portable among different applications, speech recognition performance is typically task-dependent. This paper introduces a method for automatically building task-dependent acoustic models based on selective training. Instead of setting up a new database, only a small amount of task-specific development data needs to be collected. Based on the likelihood of the target model parameters given this development data, utterances which are acoustically close to the development data are selected from existing speech data resources. Since there are too many possibilities for selecting a data subset from a larger database in general, a heuristic has to be employed. The proposed algorithm deletes single utterances temporarily or alternates between successive deletion and addition of multiple utterances. In order to make selective training computationally practical, model retraining and likelihood calculation need to be fast. It is shown, that the model likelihood can be calculated fast and easily based on sufficient statistics without the need for explicit reconstruction of model parameters. The algorithm is applied to obtain an infant- and elderly-dependent acoustic model with only very few development data available. There is an improvement in word accuracy of up to 9% in comparison to conventional EM training without selection. Furthermore, the approach was also better than MLLR and MAP adaptation with the development data.


international conference on acoustics, speech, and signal processing | 2009

Hands-free speech recognition challenge for real-world speech dialogue systems

Hiroshi Saruwatari; Hiromichi Kawanami; Shota Takeuchi; Yu Takahashi; Tobias Cincarek; Kiyohiro Shikano

In this paper, we describe and review our recent development of hands-free speech dialogue system which is used for railway station guidance. In the application at the real railway station, robustness against reverberation and noise is the most essential issue for the dialogue system. To address the problem, we introduce two key techniques in our proposed hands-free system; (a) speech dialogue system construction with real speech database collection and language/acoustic model improvement, and (b) microphone array preprocessing using blind spatial subtraction array which can solve the reverberation-naiveness problem inherent in conventional microphone arrays. The experimental assessment of the proposed dialogue system reveals that our system can provide the recognition accuracy of more than 80% under realistic railway-station conditions.


IEICE Transactions on Information and Systems | 2008

Cost Reduction of Acoustic Modeling for Real-Environment Applications Using Unsupervised and Selective Training

Tobias Cincarek; Tomoki Toda; Hiroshi Saruwatari; Kiyohiro Shikano

Development of an ASR application such as a speech-oriented guidance system for a real environment is expensive. Most of the costs are due to human labeling of newly collected speech data to construct the acoustic model for speech recognition. Employment of existing models or sharing models across multiple applications is often difficult, because the characteristics of speech depend on various factors such as possible users, their speaking style and the acoustic environment. Therefore, this paper proposes a combination of unsupervised learning and selective training to reduce the development costs. The employment of unsupervised learning alone is problematic due to the task-dependency of speech recognition and because automatic transcription of speech is error-prone. A theoretically well-defined approach to automatic selection of high quality and task-specific speech data from an unlabeled data pool is presented. Only those unlabeled data which increase the model likelihood given the labeled data are employed for unsupervised training. The effectivity of the proposed method is investigated with a simulation experiment to construct adult and child acoustic models for a speech-oriented guidance system. A completely human-labeled database which contains real-environment data collected over two years is available for the development simulation. It is shown experimentally that the employment of selective training alleviates the problems of unsupervised learning, i.e. it is possible to select speech utterances of a certain speaker group but discard noise inputs and utterances with lower recognition accuracy. The simulation experiment is carried out for several selected combinations of data collection and human transcription period. It is found empirically that the proposed method is especially effective if only relatively few of the collected data can be labeled and transcribed by humans.


international conference on acoustics, speech, and signal processing | 2007

Insights Gained from Development and Long-Term Operation of a Real-Environment Speech-Oriented Guidance System

Tobias Cincarek; Ryuichi Nisimura; Akinobu Lee; Kiyohiro Shikano

This paper presents insights gained from operating a public speech-oriented guidance system. A real-environment speech database (300 hours) collected with the system over four years is described and analyzed regarding usage frequency, content and diversity. Having the first two years of the data completely transcribed, simulation of system development and evaluation of system performance over time is possible. The database is employed for acoustic and language modeling as well as construction of a question and answer database. Since the system input is not text but speech, the database enables also research on open-domain speech-based information access. Apart from that research on unsupervised acoustic modeling, language modeling and system portability can be carried out. A performance evaluation of the system in an early stage as well as late stage when using two years of real-environment data for constructing all system components shows the relative importance of developing each system component. The systems response accuracy is 83% for adults and 68% for children.


ieee automatic speech recognition and understanding workshop | 2007

Development and portability of ASR and Q&A modules for real-environment speech-oriented guidance systems

Tobias Cincarek; Hiromichi Kawanami; Hiroshi Saruwatari; Kiyohiro Shikano

In this paper, we investigate development and portability of ASR and Q&A modules of speech-oriented guidance systems for two different real environments. An initial prototype system has been constructed for a local community center using two years of human-labeled data collected by the system. Collection of real user data is required because ASR task and Q&A domain of a guidance system are defined by the target environment and potential users. However, since human preparation of data is always costly, most often only a relatively small amount real data will be available for system adaptation in practice. Therefore, the portability of the initial prototype system is investigated for a different environment, a local subway station. The purpose is to identify reusable system parts. The ASR module is found to be highly portable across the two environments. However, the portability of the Q&A module was only medium. From an objective analysis it became clear that this is mainly due to the environment-dependent domain differences between the two systems. This implicates that it will always be important to take the behavior of actual users under real conditions into account to build a system with high user satisfaction.

Collaboration


Dive into the Tobias Cincarek's collaboration.

Top Co-Authors

Avatar

Kiyohiro Shikano

Nara Institute of Science and Technology

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Hiromichi Kawanami

Nara Institute of Science and Technology

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Shota Takeuchi

Nara Institute of Science and Technology

View shared research outputs
Top Co-Authors

Avatar

Akinobu Lee

Nagoya Institute of Technology

View shared research outputs
Top Co-Authors

Avatar

Christian Hacker

University of Erlangen-Nuremberg

View shared research outputs
Top Co-Authors

Avatar

Elmar Nöth

University of Erlangen-Nuremberg

View shared research outputs
Top Co-Authors

Avatar

Hiroyuki Sakai

Nara Institute of Science and Technology

View shared research outputs
Researchain Logo
Decentralizing Knowledge