Katsutoshi Ohtsuki | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Katsutoshi Ohtsuki is active.

Explore More

Publication

Featured researches published by Katsutoshi Ohtsuki.

Communications of The ACM | 2000

Japanese broadcast news transcription and information extraction

Sadaoki Furui; Katsutoshi Ohtsuki; Zhipeng Zhang

Inspired by the activities within the DARPA research community, we have been developing a large-vocabulary, continuous-speech recognition (LVCSR) system for Japanese broadcast news speech transcription [4]. This is a part of a joint research project with NHK broadcasting whose goal is the closed-captioning of TV programs. While some of the problems that we have investigated are Japanese-specific, others are language independent. The broadcast news manuscripts used for constructing our language models were taken from NHK news broadcasts over a period between July 1992 and May 1996, and comprised roughly 500,000 sentences and 22 million words. To calculate word n-gram language models , we segmented the broadcast news manuscripts into words by using a morphological analyzer since Japanese sentences are written without spaces between words. A word-frequency list was derived for the news manuscripts , and the 20,000 most frequently used words were selected as vocabulary words. This 20,000-word vocabulary covered approximately 98% of the words in the broadcast news manuscripts. We calculated bigrams and trigrams and estimated unseen n-grams using Katzs back-off smoothing method. The feature vector consisted of 16 cepstral coefficients , normalized logarithmic power, and their delta features (derivatives). The total number of parameters in each vector was 34. Cepstral coefficients were normalized by the cepstral mean subtraction (CMS) method. The acoustic models were gender-dependent shared-state triphone hidden Markov models (HMMs) and were designed using tree-based clustering. They were trained using phonetically balanced sentences and dialogues read by 53 male speakers and 56 female speakers. The total number of training utterances was 13,270 for male and 13,367 for female, and the total length of the training data was approximately 20 hours for each gender. The total number of HMM states was approximately 2,000 for each gender, and the number of Gaussian mixture components per state was four. News speech data, from TV broadcasts in July 1996, were divided into two parts, a clean part and a noisy part, and were separately evaluated. The clean part consisted of utterances with no background noise, and the noisy part consisted of utterances with background noise. The noisy part included spontaneous speech such as reports by correspondents. We extracted 50 male utterances and 50 female utterances for each part. Each set included utterances by five or six speakers. All utterances were manually segmented into sentences. Due to space limitations, we report only the results for the clean part here. Reading-dependent language modeling. …

international conference on acoustics, speech, and signal processing | 2000

On-line incremental speaker adaptation with automatic speaker change detection

Zhipeng Zhang; Sadaoki Furui; Katsutoshi Ohtsuki

In order to improve the performance of speech recognition systems when speakers change frequently and each of them utters a series of several sentences, a new unsupervised, online and incremental speaker adaptation technique combined with automatic detection of speaker changes is proposed. The speaker change is detected by comparing likelihoods using speaker-independent and speaker-adaptive Gaussian mixture models (GMMs). Both the phone HMM and GMM are adapted by MLLR transformation. In a broadcast news transcription task, this method reduces the word error rate by 10.0%. In comparison with the conventional method that uses HMMs for the speaker change detection, the GMM-based method requires a significantly less number of computations at the cost of only a slightly lower word recognition rate.

international conference on acoustics, speech, and signal processing | 2005

Unsupervised vocabulary expansion for automatic transcription of broadcast news

Katsutoshi Ohtsuki; Nobuaki Hiroshima; Masahiro Oku; Akihiro Imamura

We present an unsupervised vocabulary adaptation method for large vocabulary continuous speech recognition based on relevant word extraction. This method addresses the out-of-vocabulary (OOV) problem, which is one of the most challenging problems in current automatic speech recognition (ASR) systems. Words relevant to the content of the input speech are extracted from a vocabulary database, based on speech recognition results obtained in the first recognition process using a reference vocabulary. The relevance between words is calculated based on concept vectors, which are trained using word cooccurrence statistics. An expanded vocabulary that includes fewer OOV words is built by adding the extracted words to the reference vocabulary and used for the second recognition process. The experimental results for broadcast news speech show that our method achieves a 30% reduction in the OOV rate and also improves speech recognition accuracy.

Speech Communication | 2002

On-line incremental speaker adaptation for broadcast news transcription

Zhipeng Zhang; Sadaoki Furui; Katsutoshi Ohtsuki

Abstract This paper describes a new unsupervised, on-line and incremental speaker adaptation technique that improves the performance of speech recognition systems when there are frequent changes in speaker identity and each speaker utters a series of several sentences. The speaker change is detected using speaker-independent (SI) and speaker-adaptive (SA) Gaussian mixture models (GMMs), and both phone hidden Markov model (HMM) and GMM are adapted by maximum likelihood linear regression (MLLR) transformation. Using this method, the word error rate of a broadcast news transcription task was reduced by 10.0% relative to the results using the SI models.

international conference on acoustics speech and signal processing | 1998

Topic extraction with multiple topic-words in broadcast-news speech

Katsutoshi Ohtsuki; T. Matsutoka; Shoichi Matsunaga; Sadaoki Furui

This paper reports on topic extraction in Japanese broadcast-news speech. We studied, using continuous speech recognition, the extraction of several topic-words from broadcast-news. A combination of multiple topic-words represents the content of the news. This is a more detailed and more flexible approach than using a single word or a single category. A topic extraction model shows the degree of relevance between each topic-word and each word in the article. For all words in an article, topic-words which have high total relevance score are extracted. We trained the topic extraction model with five years of newspapers, using the frequency of topic-words taken from headlines and words in articles. The degree of relevance between topic-words and words in articles is calculated on the basis of statistical measures, i.e., mutual information or the /spl chi//sup 2/-value. In topic extraction experiments for recognized broadcast-news speech, we extracted five topic-words from the 10-best hypotheses using a /spl chi//sup 2/-based model and found that 76.6% of them agreed with the topic-words chosen by subjects.

ieee automatic speech recognition and understanding workshop | 1997

Topic extraction based on continuous speech recognition in broadcast-news speech

Katsutoshi Ohtsuki; Shoichi Matsunaga; T. Matsuoka; Sadaoki Furui

The paper reports on topic extraction in Japanese broadcast news speech. We studied, using continuous speech recognition, the extraction of several topic words from broadcast news. A combination of multiple topic words represents the content of the news. This is more detailed and more flexible than a single word or a single category. A topic extraction model shows the degree of relevance between each topic word and each word in the articles. For all words in an article, topic words which have high total relevance score are extracted from the article. We trained the topic extraction model with five years of newspapers, using the frequency of topic words taken from headlines and words in articles. The degree of relevance between topic words and words in articles is calculated on the basis of statistical measures, i.e., mutual information or the /spl chi//sup 2/ value. In topic extraction experiments for recognized broadcast news speech, we extracted five topic words using a /spl chi//sup 2/ based model and found that 75% of them agreed with topic words chosen by subjects.

international conference on spoken language processing | 1996

Japanese large-vocabulary continuous-speech recognition using a business-newspaper corpus

Tatsuo Matsuoka; Katsutoshi Ohtsuki; Takeshi Mori; Sadaoki Furui; Katsuhiko Shirai

Studies Japanese large-vocabulary continuous-speech recognition (LV CSR) for a Japanese business newspaper. To enable word N-grams to be used, sentences were first segmented into words (morphemes) using a morphological analyzer. About five years of newspaper articles were used to train N-gram language models. To evaluate our recognition system, we recorded speech data for sentences from another set of articles. Using the speech corpus, LV CSR experiments were conducted. For a 7k vocabulary, the word error rate was 82.8% when no grammar and context-independent acoustic models were used. This improved to 20.0% when both bigram language models and context-dependent acoustic models were used.

international conference on acoustics speech and signal processing | 1999

Message-driven speech recognition and topic-word extraction

Katsutoshi Ohtsuki; Sadaoki Furui; Atsushi Iwasaki; Naoyuki Sakurai

This paper proposes a new formulation for speech recognition/understanding systems. In which the posteriori probability of a speakers message that the speaker intends to address given an observed acoustic sequence is maximized. This is an extension of the current criterion that maximizes the probability of a word sequence. Among the various possible representations, we employ a co-occurrence score of words measured by mutual information as the conditional probability of a word sequence occurring in a given message. The word sequence hypotheses obtained by bigram and trigram language models are rescored using the co-occurrence score. Experimental results show that the word accuracy is improved by this method. Topic-words which represent the content of a speech signal are then extracted from speech recognition results based on the significance score of each word. When five topic-words are extracted for each broadcast-news article, 82.8% of them are correct in average. This paper also proposes a verbalization-dependent language model which is useful for Japanese dictation systems.

conference of the international speech communication association | 1999