Shoei Sato
Waseda University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Shoei Sato.
international conference on acoustics, speech, and signal processing | 2000
Toru Imai; Akio Kobayashi; Shoei Sato; Hideki Tanaka; Akio Ando
This paper describes a 2-pass decoder that progressively outputs the latest available results used for real-time closed captioning of Japanese broadcast news. The decoder practically eliminates the disadvantage of multiple-pass decoders that delay a decision until the end of a sentence. During the first pass of search the proposed decoder periodically executes the second pass that rescores partial N-best word sequences up to that time. If the rescored best word sequence has words in common with the previous one, that part is regarded as likely to be correct and is decided to be a part of the final result. This method is not theoretically optimal but makes a quick response with a negligible increase in word errors. In a recognition experiment on Japanese broadcast news, the decoder worked with an average decision delay of 554 msec for each word and degraded word accuracy only by 0.22%.
IEICE Transactions on Information and Systems | 2007
Toru Imai; Shoei Sato; Shinichi Homma; Kazuo Onoe; Akio Kobayashi
This paper describes a new method to detect speech segments online with identifying gender attributes for efficient dual gender-dependent speech recognition and broadcast news captioning. The proposed online speech detection performs dual-gender phoneme recognition and detects a start-point and an end-point based on the ratio between the cumulative phoneme likelihood and the cumulative non-speech likelihood with a very small delay from the audio input. Obtaining the speech segments, the phoneme recognizer also identifies gender attributes with high discrimination in order to guide the subsequent dual-gender continuous speech recognizer efficiently. As soon as the start-point is detected, the continuous speech recognizer with paralleled gender-dependent acoustic models starts a search and allows search transitions between male and female in a speech segment based on the gender attributes. Speech recognition experiments on conversational commentaries and field reporting from Japanese broadcast news showed that the proposed speech detection method was effective in reducing the false rejection rate from 4.6% to 0.53% and also recognition errors in comparison with a conventional method using adaptive energy thresholds. It was also effective in identifying the gender attributes, whose correct rate was 99.7% of words. With the new speech detection and the gender identification, the proposed dual-gender speech recognition significantly reduced the word error rate by 11.2% relative to a conventional gender-independent system, while keeping the computational cost feasible for real-time operation.
international conference on acoustics, speech, and signal processing | 2012
Takahiro Oku; Shoei Sato; Akio Kobayashi; Shinichi Homma; Toru Imai
Low-latency speaker diarization is desirable for online-oriented speaker adaptation in real-time speech recognition. Especially in spontaneous conversations, several speakers tend to speak alternatively and continuously without any silence in between utterances. We therefore propose a speaker diarization method that detects speaker-change points and determines the speaker with a fixed low latency on the basis of a Bayesian information criterion (BIC) by using acoustic features classified into multiple phoneme classes. To improve the accuracy of speaker diarization in the low latency condition, the speaker-decision is made continuously at each phoneme boundary. In an experiment on conversational broadcast news programs, our diarization method reduced the speaker diarization error rate relatively by 20.0% compared to the conventional BIC with a single phoneme class. The online speaker adaptation applied in a speech-recognition experiment reduced word error rate at speaker-change points relatively by 7.8%.
IEICE Transactions on Information and Systems | 2007
Akio Kobayashi; Kazuo Onoe; Shinichi Homma; Shoei Sato; Toru Imai
This paper describes a new criterion for speech recognition using an integrated confidence measure to minimize the word error rate (WER). The conventional criteria for WER minimization obtain the expected WER of a sentence hypothesis merely by comparing it with other hypotheses in an n-best list. The proposed criterion estimates the expected WER by using an integrated confidence measure with word posterior probabilities for a given acoustic input. The integrated confidence measure, which is implemented as a classifier based on maximum entropy (ME) modeling or support vector machines (SVMs), is used to acquire probabilities reflecting whether the word hypotheses are correct. The classifier is comprised of a variety of confidence measures and can deal with a temporal sequence of them to attain a more reliable confidence. Our proposed criterion for minimizing WER achieved a WER of 9.8% and a 3.9% reduction, relative to conventional n-best rescoring methods in transcribing Japanese broadcast news in various environments such as under noisy field and spontaneous speech conditions.
IEICE Transactions on Information and Systems | 2008
Kazuo Onoe; Shoei Sato; Shinichi Homma; Akio Kobayashi; Toru Imai; Tohru Takagi
The extraction of acoustic features for robust speech recognition is very important for improving its performance in realistic environments. The bi-spectrum based on the Fourier transformation of the third-order cumulants expresses the non-Gaussianity and the phase information of the speech signal, showing the dependency between frequency components. In this letter, we propose a method of extracting short-time bi-spectral acoustic features with averaging features in a single frame. Merged with the conventional Mel frequency cepstral coefficients (MFCC) based on the power spectrum by the principal component analysis (PCA), the proposed features gave a 6.9% relative lower a word error rate in Japanese broadcast news transcription experiments.
Journal of Information Technology Research | 2014
Hiroyuki Segi; Kazuo Onoe; Shoei Sato; Akio Kobayashi; Akio Ando
Tied-mixture HMMs have been proposed as the acoustic model for large-vocabulary continuous speech recognition and have yielded promising results. They share base-distribution and provide more flexibility in choosing the degree of tying than state-clustered HMMs. However, it is unclear which acoustic models to superior to the other under the same training data. Moreover, LBG algorithm and EM algorithm, which are the usual training methods for HMMs, have not been compared. Therefore in this paper, the recognition performance of the respective HMMs and the respective training methods are compared under the same condition. It was found that the number of parameters and the word error rate for both HMMs are equivalent when the number of codebooks is sufficiently large. It was also found that training method using the LBG algorithm achieves a 90% reduction in training time compared to training method using the EM algorithm, without degradation of recognition accuracy.
IEICE Transactions on Information and Systems | 2008
Shoei Sato; Akio Kobayashi; Kazuo Onoe; Shinichi Homma; Toru Imai; Tohru Takagi; Tetsunori Kobayashi
We present a novel method of integrating the likelihoods of multiple feature streams, representing different acoustic aspects, for robust speech recognition. The integration algorithm dynamically calculates a frame-wise stream weight so that a higher weight is given to a stream that is robust to a variety of noisy environments or speaking styles. Such a robust stream is expected to show discriminative ability. A conventional method proposed for the recognition of spoken digits calculates the weights from the entropy of the whole set of HMM states. This paper extends the dynamic weighting to a real-time large-vocabulary continuous speech recognition (LVCSR) system. The proposed weight is calculated in real-time from mutual information between an input stream and active HMM states in a search space without an additional likelihood calculation. Furthermore, the mutual information takes the width of the search space into account by calculating the marginal entropy from the number of active states. In this paper, we integrate three features that are extracted through auditory filters by taking into account the human auditory systems ability to extract amplitude and frequency modulations. Due to this, features representing energy, amplitude drift, and resonant frequency drifts, are integrated. These features are expected to provide complementary clues for speech recognition. Speech recognition experiments on field reports and spontaneous commentary from Japanese broadcast news showed that the proposed method reduced error words by 9.2% in field reports and 4.7% in spontaneous commentaries relative to the best result obtained from a single stream.
IEICE Transactions on Information and Systems | 2006
Shoei Sato; Kazuo Onoe; Akio Kobayashi; Toru Imai
This paper proposes a new compensation method of acoustic scores in the Viterbi search for robust speech recognition. This method introduces noise models to represent a wide variety of noises and realizes robust decoding together with conventional techniques of subtraction and adaptation. This method uses likelihoods of noise models in two ways. One is to calculate a confidence factor for each input frame by comparing likelihoods of speech models and noise models. Then the weight of the acoustic score for a noisy frame is reduced according to the value of the confidence factor for compensation. The other is to use the likelihood of noise model as an alternative that of a silence model when given noisy input. Since a lower confidence factor compresses acoustic scores, the decoder rather relies on language scores and keeps more hypotheses within a fixed search depth for a noisy frame. An experiment using commentary transcriptions of a broadcast sports program (MLB: Major League Baseball) showed that the proposed method obtained a 6.7% relative word error reduction. The method also reduced the relative error rate of key words by 17.9%, and this is expected lead to an improvement metadata extraction accuracy.
international conference on computers helping people with special needs | 2018
Manon Ichiki; Toshihiro Shimizu; Atsushi Imai; Tohru Takagi; Mamoru Iwabuchi; Kiyoshi Kurihara; Taro Miyazaki; Tadashi Kumano; Hiroyuki Kaneko; Shoei Sato; Nobumasa Seiyama; Yuko Yamanouchi; Hideki Sumiyoshi
We are conducting research on “automated audio description (AAD)” which automatically generates audio descriptions from real-time competition data for visually impaired people to enjoy live sports programs. However, there is a problem that AAD overlaps with the live television commentary voice, making it difficult to hear each other’s comment. In this paper, first, we show that the game situation is conveyed effectively when visually impaired persons listen to the AAD alone. Then we state the results of experiments on the following items to solve the overlap issue: (1) There is a difference in optimum volume level between live commentary and AAD, (2) The ease of listening differs depending on the difference in the characteristics of text-to-speech synthesizer for AAD, (3) Playing back AAD through a speaker placed differently from the TV speaker makes both voice sounds easier to listen to. We had clues to solve that depending on the presentation method of AAD, we can make AAD easy to listen to even when AAD overlaps the live television commentary.
international conference on acoustics, speech, and signal processing | 2016
Akio Kobayashi; Kazuo Onoe; Manon Ichiki; Shoei Sato
This paper compares unsupervised sequence training techniques for deep neural networks (DNN) for broadcast transcriptions. Recent progress in digital archiving of broadcast content has made it easier to access large amounts of speech data. Such archived data will be helpful for acoustic/language modeling in live-broadcast captioning based on automatic speech recognition (ASR). In Japanese broadcasts, however, archived programs, e.g., sports news, do not always have closed-captions used typically as references. Thus, unsupervised adaptation techniques are needed for performance improvements even when a DNN is used as an acoustic model. In this paper, we compared three unsupervised sequence adaptation techniques: maximum a posteriori (MAP), entropy minimization, and Bayes risk minimization. Experimental results for transcribing sports news programs showed that the best ASR performance is brought about by Bayes risk minimization which reflects information as to expected errors, while comparable results are obtained with MAP, the simplest way of unsupervised sequence adaptation.