Is this you? Create Your Porfile

Kiyohiro Shikano

Nara Institute of Science and Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Kiyohiro Shikano is active.

Explore More

Publication

Featured researches published by Kiyohiro Shikano.

IEEE Transactions on Acoustics, Speech, and Signal Processing | 1989

Phoneme recognition using time-delay neural networks

Alex Waibel; Toshiyuki Hanazawa; Geoffrey E. Hinton; Kiyohiro Shikano; Kevin J. Lang

The authors present a time-delay neural network (TDNN) approach to phoneme recognition which is characterized by two important properties: (1) using a three-layer arrangement of simple computing units, a hierarchy can be constructed that allows for the formation of arbitrary nonlinear decision surfaces, which the TDNN learns automatically using error backpropagation; and (2) the time-delay arrangement enables the network to discover acoustic-phonetic features and the temporal relationships between them independently of position in time and therefore not blurred by temporal shifts in the input. As a recognition task, the speaker-dependent recognition of the phonemes B, D, and G in varying phonetic contexts was chosen. For comparison, several discrete hidden Markov models (HMM) were trained to perform the same task. Performance evaluation over 1946 testing tokens from three speakers showed that the TDNN achieves a recognition rate of 98.5% correct while the rate obtained by the best of the HMMs was only 93.7%. >

international conference on acoustics, speech, and signal processing | 2000

A new phonetic tied-mixture model for efficient decoding

Akinobu Lee; Tatsuya Kawahara; Kazuya Takeda; Kiyohiro Shikano

A phonetic tied-mixture (PTM) model for efficient large vocabulary continuous speech recognition is presented. It is synthesized from context-independent phone models with 64 mixture components per state by assigning different mixture weights according to the shared states of triphones. Mixtures are then re-estimated for optimization. The model achieves a word error rate of 7.0% with a 20000-word dictation of newspaper corpus, which is comparable to the best figure by the triphone of much higher resolutions. Compared with conventional PTMs that share Gaussians by all states, the proposed model is easily trained and reliably estimated. Furthermore, the model enables the decoder to perform efficient Gaussian pruning. It is found out that computing only two out of 64 components does not cause any loss of accuracy. Several methods for the pruning are proposed and compared, and the best one reduced the computation to about 20%.

international conference on acoustics, speech, and signal processing | 2001

Gaussian mixture selection using context-independent HMM

Akinobu Lee; Tatsuya Kawahara; Kiyohiro Shikano

We address a method to efficiently select Gaussian mixtures for fast acoustic likelihood computation. It makes use of context-independent models for selection and back-off of corresponding triphone models. Specifically, for the k-best phone models by the preliminary evaluation, triphone models of higher resolution are applied, and others are assigned likelihoods with the monophone models. This selection scheme assigns more reliable back-off likelihoods to the un-selected states than the conventional Gaussian selection based on a VQ codebook. It can also incorporate efficient Gaussian pruning at the preliminary evaluation, which offsets the increased size of the pre-selection model. Experimental results show that the proposed method achieves comparable performance as the standard Gaussian selection, and performs much better under aggressive pruning condition. Together with the phonetic tied-mixture modeling, acoustic matching cost is reduced to almost 14% with little loss of accuracy.

international conference on spoken language processing | 1996

Robust speech recognition with speaker localization by a microphone array

Takeshi Yamada; Satoshi Nakamura; Kiyohiro Shikano

This paper proposes robust speech recognition with speaker localization by an arrayed microphone (SLAM) to realize a hands-free speech interface in noisy environments. In order to localize a speaker direction accurately in low SNR conditions, a speaker localization algorithm based on extracting pitch harmonics is introduced. To evaluate the performance of the proposed system, speech recognition experiments are carried out both in computer simulation and real environments. These results show that the proposed system attains much higher speech recognition performance than that of a single microphone not only in computer simulation but also in real environments.

international conference on acoustics speech and signal processing | 1998

Efficient representation of short-time phase based on group delay

Hideki Banno; Jinlin Lu; Satoshi Nakamura; Kiyohiro Shikano; Hideki Kawahara

An efficient representation of short-time phase characteristics of speech sounds is proposed, based on findings which suggest the perceptual importance of phase characteristics. Subjective tests indicated that the synthesized speech sounds by the proposed method are indistinguishable from the original speech sounds with a moderate data compression. The proposed representation uses lower-order coefficients of the inverse Fourier transform of the group delay of speech. It also alleviates the voiced/unvoiced decision, which is an indispensable part in conventional speech coding algorithms. These features make our method potentially very useful in many applications like speech morphing.

international conference on acoustics, speech, and signal processing | 2001

Unsupervised speaker adaptation based on sufficient HMM statistics of selected speakers

Shinichi Yoshizawa; Akira Baba; Kanako Matsunami; Yuichiro Mera; Miichi Yamada; Kiyohiro Shikano

Describes an efficient method for unsupervised speaker adaptation. This method is based on (1) selecting a subset of speakers who are acoustically close to a test speaker, and (2) calculating adapted model parameters according to the previously stored sufficient HMM statistics of the selected speakers data. In this method, only a few unsupervised test speakers data are required for the adaptation. Also, by using the sufficient HMM statistics of the selected speakers data, a quick adaptation can be done. Compared with a pre-clustering method, the proposed method can obtain a more optimal speaker cluster because the clustering result is determined according to test speakers data on-line. Experimental results show that the proposed method attains better improvement than MLLR from the speaker independent model. Moreover the proposed method utilizes only one unsupervised sentence utterance, while MLLR usually utilizes more than ten supervised sentence utterances.

ieee international conference on automatic face and gesture recognition | 1998

Lip movement synthesis from speech based on hidden Markov models

Eli Yamamoto; Satoshi Nakamura; Kiyohiro Shikano

Speech intelligibility can be improved by adding lip image and facial image to speech signal. Thus the lip image synthesis plays an important role to realize a natural human-like face of computer agents. Moreover the synthesized lip movement images can compensate lack of auditory information for hearing impaired people. We propose a novel lip movement synthesis method based on mapping from input speech based on Hidden Markov Model (HMM). This paper compares the HMM-based method and a conventional method using vector quantization (VQ). In the experiment, error and time differential error between synthesized lip movement images and original ones are used for evaluation. The result shows that the error of the HMM based method is 8.7% smaller than that of the VQ-based method. Moreover, the HMM-based method reduces time differential error by 32% than the VQs. The result also shows that the errors are mostly caused by phoneme /h/ and /Q/. Since lip shapes of those phonemes are strongly dependent on succeeding phoneme, the context dependent synthesis on the HMM-based method is applied to reduce the error. The improved HMM-based method realizes reduction of the error (differential error) by 10.5% (11%) compared with the original HMM-based method.

international conference on acoustics speech and signal processing | 1998

Robust speech recognition in car environments

Makoto Shozakai; Satoshi Nakamura; Kiyohiro Shikano

A user-friendly speech interface in a car cabin is highly needed for safety reasons. This paper describes a robust speech recognition method that can cope with additive noise and multiplicative distortions. A known additive noise, a source signal of which is available, might be canceled by NLMS-VAD (normalized least mean squares with frame-wise voice activity detection). On the other hand, an unknown additive noise, a source signal of which is not available, is suppressed with CSS (continuous spectral subtraction). Furthermore, various multiplicative distortions are simultaneously compensated with E-CMN (exact cepstrum mean normalization) which is speaker dependent/environment-dependent CMN for speech/non-speech. Evaluation results of the proposed method for car cabin environments are finally described.

human language technology | 1994

A large-vocabulary continuous speech recognition algorithm and its application to a multi-modal telephone directory assistance system

Yasuhiro Minami; Kiyohiro Shikano; Osamu Yoshioka; Satoshi Takahashi; Tomokazu Yamada; Sadaoki Furui

A golf training and practice apparatus has a television display and a plurality of sensors for sensing positions of a head of a golf club during the swing at a ball at a given location. A circuit responsive to the times of positioning of the head with respect to the sensors provides output signals to enable display on the television display of a graphic representation of the direction of the swing. The circuit also enables alphanumeric display of other parameters of the swing, and provides on the television display a fixed image of the angle of the face of the club at a time just before the ball reaches the ball position location. In order to also provide information relating to the golfers stance, the apparatus includes sensors for indicating in alphanumeric characters the relative weight on each of the golfers feet during various portions of the swing.

international conference on acoustics speech and signal processing | 1996

Noise and room acoustics distorted speech recognition by HMM composition

Satoshi Nakamura; Tetsuya Takiguchi; Kiyohiro Shikano

This paper presents a robust speech recognition method based on the HMM composition for the noisy room acoustics distorted speech. The method realizes an improved user interface such as the user is not encumbered by microphone equipment. The proposed HMM composition is obtained by naturally extending the HMM composition method of an additive noise to that of the convolutional room acoustics distortion. The HMM composition is conducted by 2 steps: (1) composition of HMMs of a speech and acoustical transfer function in the cepstrum domain, and (2) composition of distorted speech and noise HMMs in the linear spectral domain. The speaker dependent/independent word recognition experiments are carried out using the speech database contaminated by the additive noise and convolutional room acoustics distortion. The evaluation experiments are also conducted for unknown testing sound source positions. These results clarified the effectiveness of the proposed method.

Explore More