Hidefumi Sawai | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Hidefumi Sawai is active.

Explore More

Publication

Featured researches published by Hidefumi Sawai.

international conference on acoustics, speech, and signal processing | 1989

Consonant recognition by modular construction of large phonemic time-delay neural networks

Alex Waibel; Hidefumi Sawai; Kiyohiro Shikano

It is shown that neural networks for speech recognition can be constructed in a modular fashion by exploiting the hidden structure of previously trained phonetic subcategory networks. The performance of resulting larger phonetic nets was found to be as good as the performance of the subcomponent nets by themselves. This approach avoids the excessive learning times that would be necessary to train larger networks and allows for incremental learning. Large time-delay neural networks constructed incrementally by applying these modular training techniques achieved a recognition performance of 96.0% for all consonants and 94.7% for all phonemes.<<ETX>>

IEEE Transactions on Acoustics, Speech, and Signal Processing | 1989

Modularity and scaling in large phonemic neural networks

Alex Waibel; Hidefumi Sawai; Kiyohiro Shikano

The authors train several small time-delay neural networks aimed at all phonemic subcategories (nasals, fricatives, etc.) and report excellent fine phonemic discrimination performance for all cases. Exploiting the hidden structure of these small phonemic subcategory networks, they propose several technique that make it possible to grow larger nets in an incremental and modular fashion without loss in recognition performance and without the need for excessive training time or additional data. The techniques include class discriminatory learning, connectionist glue, selective/partial learning, and all-net fine tuning. A set of experiments shows that stop consonant networks (BDGPTK) constructed from subcomponent BDG- and PTK-nets achieved up to 98.6% correct recognition compared to 98.3 and 98.7% correct for the BDG- and PTK-nets. Similarly, an incrementally trained network aimed at all consonants achieved recognition scores of about 96% correct. These results are comparable to the performance of the subcomponent networks and significantly better than that of several alternative speech recognition methods. >

international conference on acoustics, speech, and signal processing | 1990

Integrated training for spotting Japanese phonemes using large phonemic time-delay neural networks

M. Miyatake; Hidefumi Sawai; Y. Minami; Kiyohiro Shikano

A description of integrated training methods of time-delay neural networks (TDNNs) for spotting all Japanese phonemes is presented. The time-shift invariance of the TDNN is confirmed by the use of 2620 testing words, with 95.8% of the phonemes correctly spotted. These experiments show that the spotting performance of the TDNN is high, while types of tendencies toward insertion and deletion errors are clarified. To reduce these spotting errors, integrated training methods using various training token positions are proposed. These methods allow the TDNN to correctly spot phonemes at a rate of 98.0% and also make it possible to realize large-vocabulary, vocabulary-independent speech recognition. To verify this, large-vocabulary speech recognition with 5240 common Japanese words was performed using a predictive LR parser. Recognition rates of 92.6% and 97.6% were obtained for the first and second choices respectively.<<ETX>>

international conference on acoustics, speech, and signal processing | 1989

Spotting Japanese CV-syllables and phonemes using the time-delay neural networks

Hidefumi Sawai; Alex Waibel; M. Miyatake; Kiyohiro Shikano

The authors present techniques for spotting Japanese CV syllables/phonemes in input speech based on TDNNs. They constructed a TDNN which can discriminate a single CV syllable or phoneme group. In Japanese, there are only about one hundred syllables, or fewer than 30 phonemes, which makes it feasible to prepare and train the TDNN to spot all possible syllables or phonemes extracted as training tokens from training words. Syllable and phoneme spotting experiments show excellent results, including a syllable spotting rate of better than 96.7% correct. These spotting techniques are proved to be a significant step toward continuous speech recognition.<<ETX>>

international conference on acoustics, speech, and signal processing | 1991

Frequency-time-shift-invariant time-delay neural networks for robust continuous speech recognition

Hidefumi Sawai

The authors propose neural network (NN) architectures for robust speaker-independent, continuous speech recognition. One architecture is the frequency-time-shift-invariant time-delay neural network (FTDNN). Another architecture is based on windowing each layer of the NN with local time-frequency windows. This architecture makes it possible for the NN to capture global features from the upper layers as well as precise local features from the lower layers. Recognition experiments on easily confused phonemes were performed using /b/, /d/, /g/, /m/, /n/, and /N/ (syllabic nasal) phoneme tokens to verify robustness to variations of speech. Performance results for the different architectures are presented.<<ETX>>

international conference on acoustics, speech, and signal processing | 1991

TDNN-LR continuous speech recognition system using adaptive incremental TDNN training

Hidefumi Sawai

An investigation of speech recognition and language processing is described. The speech recognition part consists of the large phonemic time-delay neural networks (TDNNs) which can automatically spot all 24 Japanese phonemes by simply scanning input speech. The language processing part is made up of a predictive LR parser which predicts subsequent phonemes based on the currently proposed phonemes. This TDNN-LR recognition system provides large-vocabulary and continuous speech recognition. Recognition experiments for ATRs conference registration task were performed using the TDNN-LR method. Speaker-dependent phrase recognition rates of 65.1% for the first choices and 88.8% within the fifth choices were attained. Also, efficiency in the adaptive incremental training using a small number of training tokens extracted from continuous speech was confirmed in the TDNN-LR system.<<ETX>>

Systems and Computers in Japan | 1991

Large-vocabulary spoken word recognition using time-delay neural network phoneme spotting and predictive LR-parsing

Yasuhiro Minami; Hidefumi Sawai; Masanori Miyatake

This paper proposes a large-vocabulary speech recognition system using a phoneme spotting method by a time-delay neural network (TDNN) and a predictive LR parser. This is the first attempt to recognize large vocabulary speech using neural networks. The prediction of phonemes in words is performed by a predictive LR parser. Time alignment between predicted phonemes by the predicted LR parser and phoneme spotting results via TDNN is realized using a DTW (dynamic time warping) method. Speaker-dependent recognition for a 5240-word vocabulary using 2620 test words uttered by a male announcer resulted in a rate of 92.6 percent for the top choices, rates of 97.6 and 99.1 percent for the second and fifth choices, respectively.

Systems and Computers in Japan | 1992

Performance comparison of neural network architectures for speaker-independent phoneme recognition

Satoru Nakamura; Hidefumi Sawai

We applied several types of time-delay neural networks (TDNNs), generally used for speaker-dependent and multispeaker speech recognition, to speaker-independent speech recognition and compared their performance. Six or 12 speakers were used to train each network, and recognition experiments for voiced stops /b, d, g/ were performed in open speaker mode. The best recognition rates were 91.3 percent and 93.6 percent, using six and 12 training speakers, respectively. We found that constructing modular networks, such as modular TDNN with each network corresponding to a speaker, is effective in terms of decreasing the number of training iterations needed, showing slightly better performance than with a single TDNN with a comparable network capacity. This is because the modular networks make use of limited capacity effectively. On the other hand, a single TDNN with an increased number of hidden units showed a recognition rate comparable to that of the modular TDNN.

Systems and Computers in Japan | 1990

Spotting Phonemes and Syllables for Continuous Speech Recognition Using Time-Delay Neural Networks

Hidefumi Sawai; Masanori Miyatake; Alex Waibel; Kiyohiro Shikano

Phoneme or syllable spotting if reliably achieved provides a good solution to the spoken word and/or continuous speech recognition problem. We would like to extend the encouraging performance of TDNN to word/continuous speech recognition. We show techniques for spotting Japanese phonemes/CV-syllables in input speech based on TDNNs. We constructed the TDNN which can discriminate a single phoneme or CV-syllable. Phoneme and syllable spotting experiments show excellent results, including phoneme and syllable spotting rates of 92 percent and 96.7 percent, respectively. These spotting techniques are proved to be a step toward continuous speech recognition.

The Journal of The Institute of Image Information and Television Engineers | 1989