Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Tatsuya Kawahara is active.

Publication


Featured researches published by Tatsuya Kawahara.


international conference on acoustics, speech, and signal processing | 2000

A new phonetic tied-mixture model for efficient decoding

Akinobu Lee; Tatsuya Kawahara; Kazuya Takeda; Kiyohiro Shikano

A phonetic tied-mixture (PTM) model for efficient large vocabulary continuous speech recognition is presented. It is synthesized from context-independent phone models with 64 mixture components per state by assigning different mixture weights according to the shared states of triphones. Mixtures are then re-estimated for optimization. The model achieves a word error rate of 7.0% with a 20000-word dictation of newspaper corpus, which is comparable to the best figure by the triphone of much higher resolutions. Compared with conventional PTMs that share Gaussians by all states, the proposed model is easily trained and reliably estimated. Furthermore, the model enables the decoder to perform efficient Gaussian pruning. It is found out that computing only two out of 64 components does not cause any loss of accuracy. Several methods for the pruning are proposed and compared, and the best one reduced the computation to about 20%.


IEEE Transactions on Speech and Audio Processing | 1998

Flexible speech understanding based on combined key-phrase detection and verification

Tatsuya Kawahara; Chin-Hui Lee; Biing-Hwang Juang

We propose a novel speech understanding strategy based on combined detection and verification of semantically tagged key-phrases in spontaneous spoken utterances. Key-phrases are defined in a top-down manner so as to constitute semantic slots. Their detection directly leads to robust understanding. A phrase network realizes both a wide coverage and a reasonable constraint for detection. A subword-based verifier is then incorporated to reduce false alarms in detection and attach confidence measures of the detected phrases. This set of phrase confidence measures, when incorporated in a spoken dialogue system, forms a basis for designing intelligent speech interfaces that accept only verified key-phrases and reprompt users to clarify unspecified or unrecognized portions. Several forms of confidence measures based on subword-level tests are investigated. The proposed approach was tested on field data collected from real-world trial applications. The combined detection and verification strategy drastically improves the accuracy in handling out-of-grammar utterances over the conventional decoding approaches while maintaining the performance for in-grammar utterances.


IEEE Transactions on Speech and Audio Processing | 2004

Language model and speaking rate adaptation for spontaneous presentation speech recognition

Hiroaki Nanjo; Tatsuya Kawahara

The paper addresses adaptation methods to language model and speaking rate (SR) of individual speakers which are two major problems in automatic transcription of spontaneous presentation speech. To cope with a large variation in expression and pronunciation of words depending on the speaker, firstly, we investigate the effect of statistical and context-dependent pronunciation modeling. Secondly, we present unsupervised methods of language model adaptation to a specific speaker and a topic by 1) selecting similar texts based on the word perplexity and TF-IDF measure and 2) making direct use of the initial recognition result for generating an enhanced model. We confirm that all proposed adaptation methods and their combinations reduce the perplexity and word error rate. We also present a decoding strategy adapted to the SR. In spontaneous speech, SR is generally fast and may vary a lot. We also observe different error tendencies for portions of presentations where speech is fast or slow. Therefore, we propose a SR-dependent decoding strategy that applies the most appropriate acoustic analysis, phone models, and decoding parameters according to the SR. Several methods are investigated and their selective application leads to improved accuracy. The combined effect of the two proposed adaptation methods is also confirmed in transcription of real academic presentation.


international conference on computational linguistics | 2000

Flexible mixed-initiative dialogue management using concept-level confidence measures of speech recognizer output

Kazunori Komatani; Tatsuya Kawahara

We present a method to realize flexible mixed-initiative dialogue, in which the system can make effective confirmation and guidance using concept-level confidence measures (CMs) derived from speech recognizer output in order to handle speech recognition errors. We define two concept-level CMs, which are on content-words and on semantic-attributes, using 10-best outputs of the speech recognizer and parsing with phrase-level grammars. Content-word CM is useful for selecting plausible interpretations. Less confident interpretations are given to confirmation process. The strategy improved the interpretation accuracy by 11.5%. Moreover, the semantic-attribute CM is used to estimate users intention and generates system-initiative guidances even when successful interpretation is not obtained.


international conference on acoustics, speech, and signal processing | 2001

Gaussian mixture selection using context-independent HMM

Akinobu Lee; Tatsuya Kawahara; Kiyohiro Shikano

We address a method to efficiently select Gaussian mixtures for fast acoustic likelihood computation. It makes use of context-independent models for selection and back-off of corresponding triphone models. Specifically, for the k-best phone models by the preliminary evaluation, triphone models of higher resolution are applied, and others are assigned likelihoods with the monophone models. This selection scheme assigns more reliable back-off likelihoods to the un-selected states than the conventional Gaussian selection based on a VQ codebook. It can also incorporate efficient Gaussian pruning at the preliminary evaluation, which offsets the increased size of the pre-selection model. Experimental results show that the proposed method achieves comparable performance as the standard Gaussian selection, and performs much better under aggressive pruning condition. Together with the phonetic tied-mixture modeling, acoustic matching cost is reduced to almost 14% with little loss of accuracy.


international conference on spoken language processing | 1996

Key-phrase detection and verification for flexible speech understanding

Tatsuya Kawahara; Chin-Hui Lee; Biing-Hwang Juang

A novel framework of robust speech understanding is presented. It is based on a detection and verification strategy. It extracts the semantically significant parts and rejects the irrelevant parts rather than decoding the whole utterances. There are two key features in the strategy. Firstly, the discriminative verifier is integrated to suppress false alarms. It uses anti-subword models specifically trained to verify the recognition results. The second feature is the use of a key-phrase network as the detection unit. It embeds a stochastic constraint of keyword and key-phrase connections to improve the coverage and detection rates. The automatic generation of the key-phrase network structure is also addressed. This top-down variable-length language model can be trained with a small corpus and ported to different tasks. This property coupled with the vocabulary-independent detector and verifier enhances the portability of the framework.


international conference on acoustics, speech, and signal processing | 1997

Task adaptation using MAP estimation in N-gram language modeling

Hirokazu Masataki; Yoshinori Sagisaka; Kazuya Hisaki; Tatsuya Kawahara

Describes a method of task adaptation in N-gram language modeling for accurately estimating the N-gram statistics from the small amount of data of the target task. Assuming a task-independent N-gram to be a-priori knowledge, the N-gram is adapted to a target task by MAP (maximum a-posteriori probability) estimation. Experimental results showed that the perplexities of the task-adapted models were 15% (trigram) and 24% (bigram) lower than those of the task-independent model, and that the perplexity reduction of the adaptation went up to a maximum of 39% when the amount of text data in the adapted task was very small.


international conference on acoustics, speech, and signal processing | 2003

Unsupervised speaker indexing using speaker model selection based on Bayesian information criterion

Masafumi Nishida; Tatsuya Kawahara

The paper addresses unsupervised speaker indexing for discussion audio archives. In discussions, the speaker changes frequently, thus the duration of utterances is very short and its variation is large, which causes significant problems in applying conventional methods such as model adaptation and variance-BIC (Bayesian information criterion) methods. We propose a flexible framework that selects an optimal speaker model (GMM or VQ) based on the BIC according to the duration of utterances. When the speech segment is short, the simple and robust VQ-based method is expected to be chosen, while GMM can be reliably trained for long segments. For a discussion archive having a total duration of 10 hours, it is demonstrated that the proposed method achieves higher indexing performance than that of conventional methods.


User Modeling and User-adapted Interaction | 2005

User Modeling in Spoken Dialogue Systems to Generate Flexible Guidance

Kazunori Komatani; Shinichi Ueno; Tatsuya Kawahara; Hiroshi G. Okuno

We address the issue of appropriate user modeling to generate cooperative responses to users in spoken dialogue systems. Unlike previous studies that have focused on a user’s knowledge, we propose more generalized modeling. We specifically set up three dimensions for user models: the skill level in use of the system, the knowledge level about the target domain, and the degree of urgency. Moreover, the models are automatically derived by decision tree learning using actual dialogue data collected by the system. We obtained reasonable accuracy in classification for all dimensions. Dialogue strategies based on user modeling were implemented on the Kyoto City Bus Information System that was developed at our laboratory. Experimental evaluations revealed that the cooperative responses adapted to each subject type served as good guides for novices without increasing the duration dialogue lasted for skilled users.


international conference on acoustics, speech, and signal processing | 2004

Real-time word confidence scoring using local posterior probabilities on tree trellis search

Akinobu Lee; Kiyohiro Shikano; Tatsuya Kawahara

Confidence scoring based on word posterior probability is usually performed as a post process of speech recognition decoding, and also needs a large number of word hypotheses to get enough confidence quality. We propose a simple way of computing the word confidence using estimated posterior probability while decoding. At the word expansion of stack decoding search, the local sentence likelihoods that contain heuristic scores of unreached segment are directly used to compute the posterior probabilities. Experimental results showed that, although the likelihoods are not optimal, we can provide slightly better confidence measures compared with N-best lists, while the computation is faster than the 100-best method because no N-best decoding is required.

Collaboration


Dive into the Tatsuya Kawahara's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Akinobu Lee

Nagoya Institute of Technology

View shared research outputs
Top Co-Authors

Avatar

Kiyohiro Shikano

Nara Institute of Science and Technology

View shared research outputs
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge