Masaharu Katoh | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Masaharu Katoh is active.

Explore More

Publication

Featured researches published by Masaharu Katoh.

international conference on audio, language and image processing | 2008

An investigation on speaker vector-based speaker identification under noisy conditions

Yuki Goto; Tatsuya Akatsu; Masaharu Katoh; Tetsuo Kosaka; Masaki Kohda

This paper presents the speaker identification method based on a speaker vector under noisy conditions. The aim of this work is to improve the performance of the speaker identification under noisy conditions. The identification system is based on the method of anchor models. In this system, the location of each speaker is represented by the speaker vector which consists of the set of the likelihood between a target utterance and the anchor models. Since the acoustic model of the target is not needed, speaker identification can be performed with a very short reference speech (5.5 sec average). In order to achieve the improvement, the structure of anchor models was investigated. Evaluations were performed on 8 or 30-speaker identification task in Japanese. The results showed that a speaker identification rate of 70.98\% has been obtained by using phonetic-class structured GMMs (pcs_GMMs) as anchor models under noisy conditions.

IEICE Transactions on Information and Systems | 2005

Robust Speech Recognition Using Discrete-Mixture HMMs

Tetsuo Kosaka; Masaharu Katoh; Masaki Kohda

This paper introduces new methods of robust speech recognition using discrete-mixture HMMs (DMHMMs). The aim of this work is to develop robust speech recognition for adverse conditions that contain both stationary and non-stationary noise. In particular, we focus on the issue of impulsive noise, which is a major problem in practical speech recognition system. In this paper, two strategies were utilized to solve the problem. In the first strategy, adverse conditions are represented by an acoustic model. In this case, a large amount of training data and accurate acoustic models are required to present a variety of acoustic environments. This strategy is suitable for recognition in stationary or slow-varying noise conditions. The second is based on the idea that the corrupted frames are treated to reduce the adverse effect by compensation method. Since impulsive noise has a wide variety of features and its modeling is difficult, the second strategy is employed. In order to achieve those strategies, we propose two methods. Those methods are based on DMHMM framework which is one type of discrete HMM (DHMM). First, an estimation method of DMHMM parameters based on MAP is proposed aiming to improve trainability. The second is a method of compensating the observation probabilities of DMHMMs by threshold to reduce adverse effect of outlier values. Observation probabilities of impulsive noise tend to be much smaller than those of normal speech. The motivation in this approach is that flooring the observation probability reduces the adverse effect caused by impulsive noise. Experimental evaluations on Japanese LVCSR for read newspaper speech showed that the proposed method achieved the average error rate reduction of 48.5% in impulsive noise conditions. Also the experimental results in adverse conditions that contain both stationary and impulsive noises showed that the proposed method achieved the average error rate reduction of 28.1%.

Systems and Computers in Japan | 2002

Construction and evaluation of language models based on stochastic context-free grammar for speech recognition

Chiori Hori; Masaharu Katoh; Akinori Ito; Masaki Kohda

This paper deals with the use of a stochastic context-free grammar (SCFG) for large vocabulary continuous speech recognition; in particular, an SCFG with phrase-level dependency rules is built. Unlike n-gram models, the SCFG can describe not only local constraints but also global constraints pertaining to the sentence as a whole, thus making possible language models with great expressive power. However, the inside-outside algorithm must be used for estimation of the SCFG parameters, which involves a great amount of calculation, proportional to the third power of the number of nonterminal symbols and of the input string length. Hence, due to problems in dealing with extensive text corpora, the SCFG has hardly been applied as a language model for very large vocabulary continuous speech recognition. The proposed phrase-level dependency SCFG allows a significant reduction of the computational load. In experiments with the EDR corpus, the proposed method proved effective. In experiments with the Mainichi corpus, a large-scale phrase-level dependency SCFG was built for a very large vocabulary continuous speech recognition system. Speech recognition tests with a vocabulary of about 5000 words showed that the proposed method could not compare with the trigram model in performance; however, when it was used in combination with a trigram model, the error rate was reduced by 14% compared to the trigram model alone.

Archive | 2007

Discrete-Mixture HMMs-based Approach for Noisy Speech Recognition

Tetsuo Kosaka; Masaharu Katoh; Masaki Kohda

It is well known that the application of hidden Markov models (HMMs) has led to a dramatic increase of the performance of automatic speech recognition in the 1980s and from that time onwards. In particular, large vocabulary continuous speech recognition (LVCSR) could be realized by using a recogn ition unit such as phones. A variety of speech characteristics can be modelled by using HMMs effectively. The HMM represents the transition of statistical characteristics by using the state sequence of a Markov chain. Each state of the chain is composed by either a discrete output probability or a continuous output probability distribution. In 1980s, discrete HMM was mainly used as an acoustic model of speech recognition. The SPHINX speech recognition system was developed by K.-F. Lee in the late 1980s (Lee & Hon, 1988). The system was a speaker-independent, continuous speech recognition system based on discrete HMMs. It was evaluated on the 997-word resource management task and obtained a word accuracy of 93% with a bigram language model. After that, comparative investigation between discrete HMM and continuous HMM had been made and then it was concluded that the performance of continuous-mi xture HMM overcame that of discrete HMM. Then almost all of recent speech recognition systems use continuous-mixture HMMs (CHMMs) as acoustic models. The parameters of CHMMs can be estimated efficiently under assumption of normal distribution. Meanwhile, the discrete Hidden Markov Models (DHMMs) based on vector quantization (VQ) have a problem that they are effected by quantization distortion. However, CHMMs may unfit to recognize noisy speech because of false assumption of normal distribution. The DHMMs can represent more complicated shapes and they are expected to be useful for noisy speech. This chapter introduces new methods of noise robust speech recognition using discretemixture HMMs (DMHMMs) based on maximum a posteriori (MAP) estimation. The aim of this work is to develop robust speech recognition for adverse conditions which contain both stationary and non-stationary noise. Especially, we focus on the issue of impulsive noise which is a major problem in practical speech recognition system. DMHMM is one type of DHMM frameworks. The method of DMHMM was originally proposed to reduce computation costs in decoding process (Takahashi et al., 1997).

International Journal of Computer Processing of Languages | 2009

Dictation of Japanese Speech Based on Kana and Kanji Character String

Akinori Ito; Hiroaki Kinno; Masaharu Katoh; Tetsuo Kosaka; Masaki Kohda

In this paper, character-based Japanese dictation method is proposed. This method is based on the kana and kanji string language model proposed by Ito et al. First, sentences in the training corpus are split into character-based units (CBUs). Then strings of CBUs (CBUSes) are chosen from the CBU corpus based on a statistical criterion. We examined three criteria for the CBUS selection. They are the frequency-based selection, the mutual-information based selection and their combination. From the experimental results, it was found that the combined method gave the best result (7.19% and 8.75% CBU error rates for the 20k and the 60k word vocabulary conditions, respectively) which was better than the ordinary word-based method (7.61% and 9.15% CBU error rates for the 20k and the 60k word vocabulary conditions, respectively). In addition, we carried out a recognition experiment for the Corpus of Spontaneous Japanese to confirm that the proposed method is effective for not only the read speech but also for spontaneous speech. As a result, we obtained the best result (29.82%) using the frequency-based method, which is better than the word-based recognition result (32.80%).

Journal of the Acoustical Society of America | 2006

Noisy speech recognition based on codebook normalization of discrete‐mixture hidden Markov models

Tetsuo Kosaka; Masaharu Katoh; Masaki Kohda

This paper presents a new method of robust speech recognition under noisy conditions based on discrete‐mixture HMMs (DMHMMs). The DMHMMs were proposed originally to reduce calculation costs. Recently, we applied DMHMMs to noisy speech recognition and found that they were effective for modeling noisy speech [Kosaka et al., Proc. of ICA04 (2004), Vol. II, pp. 1691–1694]. For further improvement of noisy speech recognition, we propose a novel normalization method for DMHMM codebooks. The codebook normalization method is based on histogram equalization (HEQ). The HEQ is commonly applied for feature space normalization. The DMHMM codebooks were normalized in this study. Therefore, this method is considered as normalization not in feature space, but in model space. Some merits are inherent in choosing model space normalization. In model space normalization, a transformation function can be prepared for each acoustic model. In addition, it is not necessary to normalize input parameters frame‐by‐frame. The propos...

conference of the international speech communication association | 1997