Yotaro Kubo | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Yotaro Kubo is active.

Explore More

Publication

Featured researches published by Yotaro Kubo.

international conference on acoustics, speech, and signal processing | 2014

Real-time one-pass decoding with recurrent neural network language model for speech recognition

Takaaki Hori; Yotaro Kubo; Atsushi Nakamura

This paper proposes an efficient one-pass decoding method for realtime speech recognition employing a recurrent neural network language model (RNNLM). An RNNLM is an effective language model that yields a large gain in recognition accuracy when it is combined with a standard n-gram model. However, since every word probability distribution based on an RNNLM is dependent on the entire history from the beginning of the speech, the search space in Viterbi decoding grows exponentially with the length of the recognition hypotheses and makes computation prohibitively expensive. Therefore, an RNNLM is usually used by N-best rescoring or by approximating it to a back-off n-gram model. In this paper, we present another approach that enables one-pass Viterbi decoding with an RNNLM without approximation, where the RNNLM is represented as a prefix tree of possible word sequences, and only the part needed for decoding is generated on-the-fly and used to rescore each hypothesis using an on-the-fly composition technique we previously proposed. Experimental results on the MIT lecture transcription task show that our proposed method enables one-pass decoding with small overhead for the RNNLM and achieves a slightly higher accuracy than 1000-best rescoring. Furthermore, it reduces the latency from the end of each utterance in two-pass decoding by a factor of 10.

Computer Speech & Language | 2013

Speech recognition in living rooms: Integrated speech enhancement and recognition system based on spatial, spectral and temporal modeling of sounds

Marc Delcroix; Keisuke Kinoshita; Tomohiro Nakatani; Shoko Araki; Atsunori Ogawa; Takaaki Hori; Shinji Watanabe; Masakiyo Fujimoto; Takuya Yoshioka; Takanobu Oba; Yotaro Kubo; Mehrez Souden; Seong-Jun Hahm; Atsushi Nakamura

Research on noise robust speech recognition has mainly focused on dealing with relatively stationary noise that may differ from the noise conditions in most living environments. In this paper, we introduce a recognition system that can recognize speech in the presence of multiple rapidly time-varying noise sources as found in a typical family living room. To deal with such severe noise conditions, our recognition system exploits all available information about speech and noise; that is spatial (directional), spectral and temporal information. This is realized with a model-based speech enhancement pre-processor, which consists of two complementary elements, a multi-channel speech-noise separation method that exploits spatial and spectral information, followed by a single channel enhancement algorithm that uses the long-term temporal characteristics of speech obtained from clean speech examples. Moreover, to compensate for any mismatch that may remain between the enhanced speech and the acoustic model, our system employs an adaptation technique that combines conventional maximum likelihood linear regression with the dynamic adaptive compensation of the variance of the Gaussians of the acoustic model. Our proposed system approaches human performance levels by greatly improving the audible quality of speech and substantially improving the keyword recognition accuracy.

international conference on acoustics, speech, and signal processing | 2013

Large vocabulary continuous speech recognition based on WFST structured classifiers and deep bottleneck features

Yotaro Kubo; Takaaki Hori; Atsushi Nakamura

Recently, structured classification approaches have been considered important with a view to achieving unified modeling of the acoustic and linguistic aspects of speech recognizers. With these approaches, unified representation is achieved by directly optimizing a score function that measures the correspondence between the input and output of the system. Since structured classifiers typically employ a linear function as a score function, extracting expressive features from the input and output of the system is very important. On the other hand, the effectiveness of deep neural networks has been verified by several experiments, and it has been suggested that the outputs of hidden layers in deep neural networks (DNNs) are essential speech features that purely express phonetic information. In this paper, we propose a method for structured classification with DNN features. The proposed method expands conventional DNN- based acoustic models so that they optimizes the weight terms of the arcs in a decoding WFST, which is constructed with the on-the-fly composition method. Since DNN-based features can be considered enhancements in the input representation, the enhancements in the output representation based on the WFST arcs are expected to complement the DNN-based features. The proposed method achieved an 8 % relative error reduction even compared with a strong acoustic model based on DNNs.

IEEE Transactions on Audio, Speech, and Language Processing | 2012

Structural Classification Methods Based on Weighted Finite-State Transducers for Automatic Speech Recognition

Yotaro Kubo; Shinji Watanabe; Takaaki Hori; Atsushi Nakamura

The potential of structural classification methods for automatic speech recognition (ASR) has been attracting the speech community since they can realize the unified modeling of acoustic and linguistic aspects of recognizers. However, the structural classification approaches involve well-known tradeoffs between the richness of features and the computational efficiency of decoders. If we are to employ, for example, a frame-synchronous one-pass decoding technique, features considered to calculate the likelihood of each hypothesis must be restricted to the same form as the conventional acoustic and language models. This paper tackles this limitation directly by exploiting the structure of the weighted finite-state transducers (WFSTs) used for decoding. Although WFST arcs provide rich contextual information, close integration with a computationally efficient decoding technique is still possible since most decoding techniques only require that their likelihood functions are factorizable for each decoder arc and time frame. In this paper, we compare two methods for structural classification with the WFST-based features; the structured perceptron and conditional random field (CRF) techniques. To analyze the advantages of these two classifiers, we present experimental results for the TIMIT continuous phoneme recognition task, the WSJ transcription task, and the MIT lecture transcription task. We confirmed that the proposed approach improved the ASR performance without sacrificing the computational efficiency of the decoders, even though the baseline systems are already trained with discriminative training techniques (e.g., MPE).

international conference on acoustics, speech, and signal processing | 2011

Feature selection for log-linear acoustic models

Simon Wiesler; Alexander Richard; Yotaro Kubo; Ralf Schlüter; Hermann Ney

Log-linear acoustic models have been shown to be competitive with Gaussian mixture models in speech recognition. Their high training time can be reduced by feature selection. We compare a simple univariate feature selection algorithm with ReliefF - an efficient multivariate algorithm. An alternative to feature selection is ℓ1-regularized training, which leads to sparse models. We observe that this gives no speedup when sparse features are used, hence feature selection methods are preferable. For dense features, ℓ1-regularization can reduce training and recognition time. We generalize the well known Rprop algorithm for the optimization of ℓ1-regularized functions. Experiments on the Wall Street Journal corpus showed that a large number of sparse features could be discarded without loss of performance. A strong regularization led to slight performance degradations, but can be useful on large tasks, where training the full model is not tractable.

IEEE Journal of Selected Topics in Signal Processing | 2010

A Sequential Pattern Classifier Based on Hidden Markov Kernel Machine and Its Application to Phoneme Classification

Yotaro Kubo; Shinji Watanabe; Atsushi Nakamura; Erik McDermott; Tetsunori Kobayashi

This paper describes a novel classifier for sequential data based on nonlinear classification derived from kernel methods. In the proposed method, kernel methods are used for enhancing the emission probability density functions (pdfs) of hidden Markov models (HMMs). Because the emission pdfs enhanced by kernel methods have sufficient nonlinear classification performance, mixture models such as Gaussian mixture models (GMMs), which might cause problems of overfitting and local optima, are not necessary in the proposed method. Unlike the methods used in earlier studies on sequential pattern classification using kernel methods, our method can be regarded as an extension of conventional HMMs, and therefore, it can completely model the transition of hidden states with the observed vectors. Therefore, our method can be applied to many applications developed with conventional HMMs, especially for speech recognition. In this paper, we carried out an isolated phoneme classification as a preliminary experiment in order to evaluate the efficiency of the proposed sequential pattern classifier. We confirmed that the proposed method achieved steady improvements as compared to conventional HMMs with Gaussian-mixture emission pdfs trained by the maximum likelihood and the maximum mutual information procedures.

international conference on acoustics, speech, and signal processing | 2011

Subspace pursuit method for kernel-log-linear models

Yotaro Kubo; Simon Wiesler; Ralf Schlueter; Hermann Ney; Shinji Watanabe; Atsushi Nakamura; Tetsunori Kobayashi

This paper presents a novel method for reducing the dimensionality of kernel spaces. Recently, to maintain the convexity of training, log-linear models without mixtures have been used as emission probability density functions in hidden Markov models for automatic speech recognition. In that framework, nonlinearly-transformed high-dimensional features are used to achieve the nonlinear classification of the original observation vectors without using mixtures. In this paper, with the goal of using high-dimensional features in kernel spaces, the cutting plane subspace pursuit method proposed for support vector machines is generalized and applied to log-linear models. The experimental results show that the proposed method achieved an efficient approximation of the feature space by using a limited number of basis vectors

Hands-free Speech Communication and Microphone Arrays (HSCMA), 2014 4th Joint Workshop on | 2014

Spectrogram patch based acoustic event detection and classification in speech overlapping conditions

Miquel Espi; Masakiyo Fujimoto; Yotaro Kubo; Tomohiro Nakatani

Speech does not always contain all the information needed to understand a conversation scene. Non-speech events can reveal aspects of the scene that speakers miss or neglect to mention, which could further support speech enhancement and recognition systems with information about the surrounding noise. This paper focuses on the task of detecting and classifying acoustic events in a conversation scene where these often overlap with speech. State-of-the-art techniques are based on derived features (e.g. MFCC, or Mel-filter banks), which have successfully parameterized speech spectrograms, but that reduce both resolution and detail when we are targeting other kinds of events. In this paper, we propose a method that learns hidden features directly from spectrogram patches, and integrates them within the deep neural network framework to detect and classify acoustic events. The result is a model that performs feature extraction and classification simultaneously. Experiments confirm that the proposed method outperforms deep neural networks with derived features as well as related work on the CHIL2007-AED task, showing that there is room for further improvement.

international conference on acoustics, speech, and signal processing | 2012

Decoding network optimization using minimum transition error training

Yotaro Kubo; Shinji Watanabe; Atsushi Nakamura

The discriminative optimization of decoding networks is important for minimizing speech recognition error. Recently, several methods have been reported that optimize decoding networks by extending weighted finite state transducer (WFST)-based decoding processes to a linear classification process. In this paper, we model decoding processes by using conditional random fields (CRFs). Since the maximum mutual information (MMI) training technique is straightforwardly applicable for CRF training, several sophisticated training methods proposed as the variants of MMI can be incorporated in our decoding network optimization. This paper adapts the boosted MMI and the differenced MMI methods for decoding network optimization so that state transition errors are minimized in WFST decoding. We evaluated the proposed methods by conducting large-vocabulary continuous speech recognition experiments. We confirmed that the CRF-based framework and transition error minimization are efficient for improving the accuracy of automatic speech recognizers.

Speech Communication | 2011

Temporal AM-FM combination for robust speech recognition

Yotaro Kubo; Shigeki Okawa; Akira Kurematsu; Katsuhiko Shirai

A novel method for feature extraction from the frequency modulation (FM) in speech signals is proposed for robust speech recognition. To exploit of the multistream speech recognizers, each stream should compensate for the shortcomings of the other streams. In this light, FM features are promising as complemental features of amplitude modulation (AM). In order to extract effective features from FM patterns, we applied the proposed feature extraction method by the data-driven modulation analysis of instantaneous frequency. By evaluating the frequency responses of the temporal filters obtained by the proposed method, we confirmed that the modulation observed around 4Hz is important for the discrimination of FM patterns, as in the case of AM features. We evaluated the robustness of our method by performing noisy speech recognition experiments. We confirmed that our FM features can improve the noise robustness of speech recognizers even when the FM features are not combined with conventional AM and/or spectral envelope features. We also performed multistream speech recognition experiments. The experimental results show that combination of the conventional AM system and proposed FM system reduced word error by 43.6% at 10 dB SNR as compared to the baseline MFCC system and by 20.2% as compared to the conventional AM system. We investigated the complementarity of the AM and FM features by performing speech recognition experiments in artificial noisy environments. We found the FM features to be robust to wide-band noise, which certainly degrades the performance of AM features. Further, we evaluated the efficiency of multiconditional training. Although the performance of the proposed combination method was degraded by multiconditional training, we confirmed that the performance of the proposed FM method improved. Through a series of experiments, we confirmed that our FM features can be used as independent features as well as complemental features.

Explore More