Yasuhiro Minami | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Yasuhiro Minami is active.

Explore More

Publication

Featured researches published by Yasuhiro Minami.

IEEE Transactions on Audio, Speech, and Language Processing | 2007

Efficient WFST-Based One-Pass Decoding With On-The-Fly Hypothesis Rescoring in Extremely Large Vocabulary Continuous Speech Recognition

Takaaki Hori; Chiori Hori; Yasuhiro Minami; Atsushi Nakamura

This paper proposes a novel one-pass search algorithm with on-the-fly composition of weighted finite-state transducers (WFSTs) for large-vocabulary continuous-speech recognition. In the standard search method with on-the-fly composition, two or more WFSTs are composed during decoding, and a Viterbi search is performed based on the composed search space. With this new method, a Viterbi search is performed based on the first of the two WFSTs. The second WFST is only used to rescore the hypotheses generated during the search. Since this rescoring is very efficient, the total amount of computation required by the new method is almost the same as when using only the first WFST. In a 65k-word vocabulary spontaneous lecture speech transcription task, our proposed method significantly outperformed the standard search method. Furthermore, our method was faster than decoding with a single fully composed and optimized WFST, where our method used only 38% of the memory required for decoding with the single WFST. Finally, we have achieved high-accuracy one-pass real-time speech recognition with an extremely large vocabulary of 1.8 million words

IEEE Transactions on Speech and Audio Processing | 2004

Variational bayesian estimation and clustering for speech recognition

Shinji Watanabe; Yasuhiro Minami; Atsushi Nakamura; Naonori Ueda

In this paper, we propose variational Bayesian estimation and clustering for speech recognition (VBEC), which is based on the variational Bayesian (VB) approach. VBEC is a total Bayesian framework: all speech recognition procedures (acoustic modeling and speech classification) are based on VB posterior distribution, unlike the maximum likelihood (ML) approach based on ML parameters. The total Bayesian framework generates two major Bayesian advantages over the ML approach for the mitigation of over-training effects, as it can select an appropriate model structure without any data set size condition, and can classify categories robustly using a predictive posterior distribution. By using these advantages, VBEC: 1) allows the automatic construction of acoustic models along two separate dimensions, namely, clustering triphone hidden Markov model states and determining the number of Gaussians and 2) enables robust speech classification, based on Bayesian predictive classification using VB posterior distributions. The capabilities of the VBEC functions were confirmed in large vocabulary continuous speech recognition experiments for read and spontaneous speech tasks. The experiments confirmed that VBEC automatically constructed accurate acoustic models and robustly classified speech, i.e., totally mitigated the over-training effects with high word accuracies due to the VBEC functions.

international conference on acoustics, speech, and signal processing | 1995

A maximum likelihood procedure for a universal adaptation method based on HMM composition

Yasuhiro Minami; Sadaoki Furui

This paper proposes an adaptation method for universal noise (additive noise and multiplicative distortion) based on the HMM composition (compensation) technique. Although the original HMM composition can be applied only to additive noise, our new method can estimate multiplicative distortion by maximizing the likelihood value. The signal-to-noise ratio is automatically estimated as part of the estimation of multiplicative distortion. Phoneme recognition experiments show that this method improves recognition accuracy for noisy and distorted speech.

international conference on acoustics, speech, and signal processing | 1993

Phoneme HMMs constrained by frame correlations

Satoshi Takahashi; Tatsuo Matsuoka; Yasuhiro Minami; Kiyohiro Shikano

Phoneme HMMs (hidden Markov models) that use correlations between two frames are proposed. The proposed technique constrains the output probability distributions of speaker-independent HMMs so that they are suitable for the input speaker. The speaker-dependent BC (bigram-constrained)-HMMs and speaker-independent BC-HMMs are generated from the conventional speaker-independent HMMs by combining the VQ (vector quantization)-code bigram (discrete case and tied-mixture case) or the conditional Gaussian density function (continuous case). The new models were evaluated by 23-phoneme recognition in continuous speech. In the speaker-dependent BC-HMMs, which use the speaker-dependent bigram created by 50 additional sentences of the test speaker, the best recognition accuracy of 74.8% was obtained by the tied-mixture type BC-HMMs. In the speaker-independent BC-HMMs, the best recognition accuracy of 67.5% was obtained by the continuous type BC-HMMs.<<ETX>>

annual meeting of the special interest group on discourse and dialogue | 2009

Analysis of Listening-Oriented Dialogue for Building Listening Agents

Toyomi Meguro; Ryuichiro Higashinaka; Kohji Dohsaka; Yasuhiro Minami; Hideki Isozaki

Our aim is to build listening agents that can attentively listen to the user and satisfy his/her desire to speak and have himself/herself heard. This paper investigates the characteristics of such listening-oriented dialogues so that such a listening process can be achieved by automated dialogue systems. We collected both listening-oriented dialogues and casual conversation, and analyzed them by comparing the frequency of dialogue acts, as well as the dialogue flows using Hidden Markov Models (HMMs). The analysis revealed that listening-oriented dialogues and casual conversation have characteristically different dialogue flows and that it is important for listening agents to self-disclose before asking questions and to utter more questions and acknowledgment than in casual conversation to be good listeners.

international workshop on spoken dialogue systems technology | 2010

Issues in predicting user satisfaction transitions in dialogues: individual differences, evaluation criteria, and prediction models

Ryuichiro Higashinaka; Yasuhiro Minami; Kohji Dohsaka; Toyomi Meguro

This paper addresses three important issues in automatic prediction of user satisfaction transitions in dialogues. The first issue concerns the individual differences in user satisfaction ratings and how they affect the possibility of creating a user-independent prediction model. The second issue concerns how to determine appropriate evaluation criteria for predicting user satisfaction transitions. The third issue concerns how to train suitable prediction models. We present our findings for these issues on the basis of the experimental results using dialogue data in two domains.

international conference on acoustics speech and signal processing | 1996

Adaptation method based on HMM composition and EM algorithm

Yasuhiro Minami; Sadaoki Furui

A method for adapting HMMs to additive noise and multiplicative distortion at the same time is proposed. This method first creates a noise HMM for additive noise, then composes HMMs for noisy and distorted speech data from this HMM and speech HMMs so that these composed HMMs become the functions of signal-to-noise (S/N) ratio and multiplicative distortion. S/N ratio and multiplicative distortion are estimated by maximizing the likelihood of the HMMs to the input speech. To achieve this, we propose a new method that divides the maximization process into estimation of S/N ratio and estimation of cepstrum bias. The S/N ratio is estimated using the parallel model method. The cepstrum bias is estimated using the EM algorithm. To evaluate this method, two experiments in terms of phoneme recognition and connected digit recognition are performed. The guarantee of convergence of this algorithm is also discussed.

human language technology | 1994

A large-vocabulary continuous speech recognition algorithm and its application to a multi-modal telephone directory assistance system

Yasuhiro Minami; Kiyohiro Shikano; Osamu Yoshioka; Satoshi Takahashi; Tomokazu Yamada; Sadaoki Furui

A golf training and practice apparatus has a television display and a plurality of sensors for sensing positions of a head of a golf club during the swing at a ball at a given location. A circuit responsive to the times of positioning of the head with respect to the sensors provides output signals to enable display on the television display of a graphic representation of the direction of the swing. The circuit also enables alphanumeric display of other parameters of the swing, and provides on the television display a fixed image of the angle of the face of the club at a time just before the ball reaches the ball position location. In order to also provide information relating to the golfers stance, the apparatus includes sensors for indicating in alphanumeric characters the relative weight on each of the golfers feet during various portions of the swing.

international conference on acoustics, speech, and signal processing | 2003

Language model adaptation using WFST-based speaking-style translation

Takaaki Hori; Daniel Willett; Yasuhiro Minami

This paper describes a new approach to language model adaptation for speech recognition based on the statistical framework of speech translation. The main idea of this approach is to compose a weighted finite-state transducer (WFST) that translates sentence styles from in-domain to out-of-domain. It enables to integrate language models of different styles of speaking or dialects and even of different vocabularies. The WFST is built by combining in-domain and out-of-domain models through the translation, while each model and the translation itself is expressed as a WFST. We apply this technique to building language models for spontaneous speech recognition using large written-style corpora. We conducted experiments on a 20k-word Japanese spontaneous speech recognition task. With a small in-domain corpus, a 2.9% absolute improvement in word error rate is achieved over the in-domain model.

international conference on acoustics, speech, and signal processing | 2002

A recognition method with parametric trajectory synthesized using direct relations between static and dynamic feature vector time series

Yasuhiro Minami; Erik McDermott; Atsushi Nakamura; Shigeru Katagiri

Parametric trajectory models have been proposed to exploit this time-dependency. However, parametric trajectory modeling methods are unable to take advantage of efficient HMM training and recognition methods. We have proposed a new speech recognition technique that generates a speech trajectory using an HMM-based speech synthesis method. This method generates an acoustic trajectory by maximizing the likelihood of the trajectory while taking into account the relation between the cepstrum, delta-cepstrum, and delta-delta cepstrum. In this paper, we extend our method to a general formulation including variance training procedure. Speaker independent speech recognition experiments show that the proposed method is effective for speech recognition.

Explore More