Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Takahiro Shinozaki is active.

Publication


Featured researches published by Takahiro Shinozaki.


Journal of the Acoustical Society of America | 2004

Noise‐robust speech recognition using multi‐band spectral features

Yoshitaka Nishimura; Takahiro Shinozaki; Koji Iwano; Sadaoki Furui

In most of the state‐of‐the‐art automatic speech recognition (ASR) systems, speech is converted into a time function of the MFCC (Mel Frequency Cepstrum Coefficient) vector. However, the problem with using the MFCC is that noise effects spread over all the coefficients even when the noise is limited within a narrow frequency band. If a spectrum feature is directly used, such a problem can be avoided and thus robustness against noise could be expected to increase. Although various researches on using spectral domain features have been conducted, improvement of recognition performances has been reported only in limited noise conditions. This paper proposes a novel multi‐band ASR method using a new log‐spectral domain feature. In order to increase the robustness, log‐spectrum features are normalized by applying the three processes: subtracting the mean log‐energy for each frame, emphasizing spectral peaks, and subtracting the log‐spectral mean averaged over an utterance. Spectral component likelihood values ...


international conference on acoustics, speech, and signal processing | 2002

Analysis on individual differences in automatic transcription of spontaneous presentations

Takahiro Shinozaki; Sadaoki Furui

This paper reports an analysis of individual differences in spontaneous presentation speech recognition performances. Ten minutes from each presentation given by 50 male speakers, for a total of 500 minutes, has been automatically recognized for the analysis. Correlation and regression analyses were applied to the word recognition accuracy and various speaker attributes. A restricted set of the speaker attributes comprising the speaking rate, the out of vocabulary rate and the repair rate was found to be most significant to yield individual differences in the word accuracy. Unsupervised MLLR speaker adaptation worked well for improving the word accuracy but did not change the structure of the individual differences. Approximately half of the variance in the word accuracy was explained by a regression model using the limited set of three attributes.


Computer Speech & Language | 2008

Cross-validation and aggregated EM training for robust parameter estimation

Takahiro Shinozaki; Mari Ostendorf

A new maximum likelihood training algorithm is proposed that compensates for weaknesses of the EM algorithm by using cross-validation likelihood in the expectation step to avoid overtraining. By using a set of sufficient statistics associated with a partitioning of the training data, as in parallel EM, the algorithm has the same order of computational requirements as the original EM algorithm. Another variation uses an approximation of bagging to reduce variance in the E-step but at a somewhat higher cost. Analyses using GMMs with artificial data show the proposed algorithms are more robust to overtraining than the conventional EM algorithm. Large vocabulary recognition experiments on Mandarin broadcast news data show that the methods make better use of more parameters and give lower recognition error rates than standard EM training.


international conference on acoustics, speech, and signal processing | 2006

Hmm State Clustering Based on Efficient Cross-Validation

Takahiro Shinozaki

Decision tree state clustering is explored using a cross validation likelihood criterion. Cross-validation likelihood is more reliable than conventional likelihood and can be efficiently computed using sufficient statistics. It results in a better tying structure and provides a termination criterion that does not rely on empirical thresholds. Large vocabulary recognition experiments on conversational telephone speech show that, for large numbers of tied states, the cross-validation method gives more robust results


international conference on acoustics, speech, and signal processing | 2003

Unsupervised class-based language model adaptation for spontaneous speech recognition

Tadasuke Yokoyama; Takahiro Shinozaki; Koji Iwano; Sadaoki Furui

This paper proposes an unsupervised, batch-type, class-based language model adaptation method for spontaneous speech recognition. The word classes are automatically determined by maximizing the average mutual information between the classes using a training set. A class-based language model is built based on recognition hypotheses obtained using a general word-based language model, and linearly interpolated with the general language model. All the input utterances are re-recognized using the adapted language model. The proposed method was applied to the recognition of spontaneous presentations and was found to be effective in improving the recognition accuracy for all the presentations. The best condition was found to be using 100 word classes, and in this condition 2.3% of the absolute value improvement in the word accuracy averaged over all the speakers was achieved.


international conference on acoustics, speech, and signal processing | 2001

Ubiquitous speech processing

Sadaoki Furui; Koji Iwano; Chiori Hori; Takahiro Shinozaki; Yohei Saito; Satoshi Tamura

In the ubiquitous (pervasive) computing era, it is expected that everybody will access information services anytime anywhere, and these services are expected to augment various human intelligent activities. Speech recognition technology can play an important role in this era by providing: (a) conversational systems for accessing information services and (b) systems for transcribing, understanding and summarizing ubiquitous speech documents such as meetings, lectures, presentations and voicemails. In the former systems, robust conversation using wireless handheld/hands-free devices in the real mobile computing environment will be crucial as will multimodal speech recognition technology. To create the latter systems, the ability to understand and summarize speech documents is one of the key requirements. The paper presents technological perspectives and introduces several research activities being conducted from these standpoints in our research group.


international conference on acoustics, speech, and signal processing | 2015

Structure discovery of deep neural network based on evolutionary algorithms

Takahiro Shinozaki; Shinji Watanabe

Deep neural networks (DNNs) are constructed by considering highly complicated configurations including network structure and several tuning parameters (number of hidden states and learning rate in each layer), which greatly affect the performance of speech processing applications. To reach optimal performance in such systems, deep understanding and expertise in DNNs is necessary, which limits the development of DNN systems to skilled experts. To overcome the problem, this paper proposes an efficient optimization strategy for DNN structure and parameters using evolutionary algorithms. The proposed approach parametrizes the DNN structure by a directed acyclic graph, and the DNN structure is represented by a simple binary vector. Genetic algorithm and covariance matrix adaptation evolution strategy efficiently optimize the performance jointly with respect to the above binary vector and the other tuning parameters. Experiments on phoneme recognition and spoken digit detection tasks show the effectiveness of the proposed approach by discovering the appropriate DNN structure automatically.


international conference on acoustics, speech, and signal processing | 2007

Cross-Validation EM Training for Robust Parameter Estimation

Takahiro Shinozaki; Mari Ostendorf

A new maximum likelihood training algorithm is proposed that compensates for weaknesses of the EM algorithm by using cross-validation likelihood in the expectation step to avoid overtraining. By using a set of sufficient statistics associated with a partitioning of the training data, as in parallel EM, the algorithm has the same order of computational requirements as the original EM algorithm. Analyses using a GMM with artificial data show the proposed algorithm is more robust for overtraining than the conventional EM algorithm. Large vocabulary recognition experiments on Mandarin broadcast news data show that the method makes better use of more parameters and gives lower recognition error rates than EM training.


international conference on acoustics, speech, and signal processing | 2008

GMM and HMM training by aggregated EM algorithm with increased ensemble sizes for robust parameter estimation

Takahiro Shinozaki; Tatsuya Kawahara

In order to compensate for the weaknesses of the expectation maximization (EM) algorithm to over-training and to improve model performance for new data, we have recently proposed aggregated EM (Ag-EM) algorithm that introduces bagging like approach in the framework of the EM algorithm and have shown that it gives similar improvements as cross-validation EM (CV-EM) over conventional EM. However, a limitation with the experiments was that the number of multiple models used in the aggregation operation or the ensemble size was fixed to a small value. Here, we investigate the relationship between the ensemble size and the performance as well as giving a theoretical discussion with the order of the computational cost. The algorithm is first analyzed using simulated data and then applied to large vocabulary speech recognition on oral presentations. Both of these experiments show that Ag-EM outperforms CV-EM by using larger ensemble sizes.


IEEE Journal of Selected Topics in Signal Processing | 2010

Gaussian Mixture Optimization Based on Efficient Cross-Validation

Takahiro Shinozaki; Sadaoki Furui; Tatsuya Kawahara

A Gaussian mixture optimization method is developed by using the cross-validation (CV) likelihood as an objective function instead of the conventional training set likelihood. The optimization is based on reducing the number of mixture components by selecting and merging pairs of Gaussians step by step according to the objective function so as to remove redundant components and improve the generality of the model. The CV likelihood is more effective for avoiding over-fitting than is the conventional likelihood, and it provides a termination criterion that does not rely on empirical thresholds. While the idea is simple, one problem is its infeasible computational cost. To make such optimization practical, an efficient evaluation algorithm using sufficient statistics is proposed. In addition, aggregated CV (AgCV) is developed to further improve the generalization performance of CV. Large-vocabulary speech recognition experiments on oral presentations show that the proposed methods improve speech recognition performance with automatically determined model complexity. The AgCV-based optimization is computationally more expensive than the CV-based method but gives better recognition performance.

Collaboration


Dive into the Takahiro Shinozaki's collaboration.

Top Co-Authors

Avatar

Sadaoki Furui

University of Washington

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Shinji Watanabe

Mitsubishi Electric Research Laboratories

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Fuming Fang

Tokyo Institute of Technology

View shared research outputs
Top Co-Authors

Avatar

Makoto Oka

Tokyo Institute of Technology

View shared research outputs
Researchain Logo
Decentralizing Knowledge