Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Takaaki Hori is active.

Publication


Featured researches published by Takaaki Hori.


IEEE Transactions on Audio, Speech, and Language Processing | 2007

Efficient WFST-Based One-Pass Decoding With On-The-Fly Hypothesis Rescoring in Extremely Large Vocabulary Continuous Speech Recognition

Takaaki Hori; Chiori Hori; Yasuhiro Minami; Atsushi Nakamura

This paper proposes a novel one-pass search algorithm with on-the-fly composition of weighted finite-state transducers (WFSTs) for large-vocabulary continuous-speech recognition. In the standard search method with on-the-fly composition, two or more WFSTs are composed during decoding, and a Viterbi search is performed based on the composed search space. With this new method, a Viterbi search is performed based on the first of the two WFSTs. The second WFST is only used to rescore the hypotheses generated during the search. Since this rescoring is very efficient, the total amount of computation required by the new method is almost the same as when using only the first WFST. In a 65k-word vocabulary spontaneous lecture speech transcription task, our proposed method significantly outperformed the standard search method. Furthermore, our method was faster than decoding with a single fully composed and optimized WFST, where our method used only 38% of the memory required for decoding with the single WFST. Finally, we have achieved high-accuracy one-pass real-time speech recognition with an extremely large vocabulary of 1.8 million words


IEEE Transactions on Audio, Speech, and Language Processing | 2012

Low-Latency Real-Time Meeting Recognition and Understanding Using Distant Microphones and Omni-Directional Camera

Takaaki Hori; Shoko Araki; Takuya Yoshioka; Masakiyo Fujimoto; Shinji Watanabe; Takanobu Oba; Atsunori Ogawa; Kazuhiro Otsuka; Dan Mikami; Keisuke Kinoshita; Tomohiro Nakatani; Atsushi Nakamura; Junji Yamato

This paper presents our real-time meeting analyzer for monitoring conversations in an ongoing group meeting. The goal of the system is to recognize automatically “who is speaking what” in an online manner for meeting assistance. Our system continuously captures the utterances and face poses of each speaker using a microphone array and an omni-directional camera positioned at the center of the meeting table. Through a series of advanced audio processing operations, an overlapping speech signal is enhanced and the components are separated into individual speakers channels. Then the utterances are sequentially transcribed by our speech recognizer with low latency. In parallel with speech recognition, the activity of each participant (e.g., speaking, laughing, watching someone) and the circumstances of the meeting (e.g., topic, activeness, casualness) are detected and displayed on a browser together with the transcripts. In this paper, we describe our techniques and our attempt to achieve the low-latency monitoring of meetings, and we show our experimental results for real-time meeting transcription.


ieee automatic speech recognition and understanding workshop | 2015

The MERL/SRI system for the 3RD CHiME challenge using beamforming, robust feature extraction, and advanced speech recognition

Takaaki Hori; Zhuo Chen; Hakan Erdogan; John R. Hershey; Jonathan Le Roux; Vikramjit Mitra; Shinji Watanabe

This paper introduces the MERL/SRI system designed for the 3rd CHiME speech separation and recognition challenge (CHiME-3). Our proposed system takes advantage of recurrent neural networks (RNNs) throughout the model from the front speech enhancement to the language modeling. Two different types of beamforming are used to combine multi-microphone signals to obtain a single higher quality signal. Beamformed signal is further processed by a single-channel bi-directional long short-term memory (LSTM) enhancement network which is used to extract stacked mel-frequency cepstral coefficients (MFCC) features. In addition, two proposed noise-robust feature extraction methods are used with the beamformed signal. The features are used for decoding in speech recognition systems with deep neural network (DNN) based acoustic models and large-scale RNN language models to achieve high recognition accuracy in noisy environments. Our training methodology includes data augmentation and speaker adaptive training, whereas at test time model combination is used to improve generalization. Results on the CHiME-3 benchmark show that the full cadre of techniques substantially reduced the word error rate (WER). Combining hypotheses from different robust-feature systems ultimately achieved 9.10% WER for the real test data, a 72.4% reduction relative to the baseline of 32.99% WER.


international conference on acoustics, speech, and signal processing | 2017

Joint CTC-attention based end-to-end speech recognition using multi-task learning

Suyoun Kim; Takaaki Hori; Shinji Watanabe

Recently, there has been an increasing interest in end-to-end speech recognition that directly transcribes speech to text without any predefined alignments. One approach is the attention-based encoder-decoder framework that learns a mapping between variable-length input and output sequences in one step using a purely data-driven method. The attention model has often been shown to improve the performance over another end-to-end approach, the Connectionist Temporal Classification (CTC), mainly because it explicitly uses the history of the target character without any conditional independence assumptions. However, we observed that the performance of the attention has shown poor results in noisy condition and is hard to learn in the initial training stage with long input sequences. This is because the attention model is too flexible to predict proper alignments in such cases due to the lack of left-to-right constraints as used in CTC. This paper presents a novel method for end-to-end speech recognition to improve robustness and achieve fast convergence by using a joint CTC-attention model within the multi-task learning framework, thereby mitigating the alignment issue. An experiment on the WSJ and CHiME-4 tasks demonstrates its advantages over both the CTC and attention-based encoder-decoder baselines, showing 5.4–14.6% relative improvements in Character Error Rate (CER).


international conference on acoustics, speech, and signal processing | 2003

Deriving disambiguous queries in a spoken interactive ODQA system

Chiori Hori; Takaaki Hori; Hideki Isozaki; Eisaku Maeda; Shigeru Katagiri; Sadaoki Furui

Recently, open-domain question answering (ODQA) systems that extract an exact answer from large text corpora based on text input are intensively being investigated. However, the information in the first question input by a user is not usually enough to yield the desired answer. Interactions for collecting additional information to accomplish QA is needed. This paper proposes an interactive approach for spoken interactive ODQA systems. When the reliabilities for answer hypotheses obtained by an ODQA system are low, the system automatically derives disambiguous queries (DQ) that draw out additional information. The additional information based on the DQ should contribute to distinguishing effectively an exact answer and to supplementing a lack of information by recognition errors. In our spoken interactive ODQA system, SPIQA, spoken questions are recognized by an ASR system, and DQ are automatically generated to disambiguate the transcribed questions. We confirmed the appropriateness of the derived DQ by comparing them with manually prepared ones.


international conference on acoustics, speech, and signal processing | 2015

Context adaptive deep neural networks for fast acoustic model adaptation

Marc Delcroix; Keisuke Kinoshita; Takaaki Hori; Tomohiro Nakatani

Deep neural networks (DNNs) are widely used for acoustic modeling in automatic speech recognition (ASR), since they greatly outperform legacy Gaussian mixture model-based systems. However, the levels of performance achieved by current DNN-based systems remain far too low in many tasks, e.g. when the training and testing acoustic contexts differ due to ambient noise, reverberation or speaker variability. Consequently, research on DNN adaptation has recently attracted much interest. In this paper, we present a novel approach for the fast adaptation of a DNN-based acoustic model to the acoustic context. We introduce a context adaptive DNN with one or several layers depending on external factors that represent the acoustic conditions. This is realized by introducing a factorized layer that uses a different set of parameters to process each class of factors. The output of the factorized layer is then obtained by weighted averaging over the contribution of the different factor classes, given posteriors over the factor classes. This paper introduces the concept of context adaptive DNN and describes preliminary experiments with the TIMIT phoneme recognition task showing consistent improvement with the proposed approach.


Computer Speech & Language | 2011

Topic tracking language model for speech recognition

Shinji Watanabe; Tomoharu Iwata; Takaaki Hori; Atsushi Sako; Yasuo Ariki

In a real environment, acoustic and language features often vary depending on the speakers, speaking styles and topic changes. To accommodate these changes, speech recognition approaches that include the incremental tracking of changing environments have attracted attention. This paper proposes a topic tracking language model that can adaptively track changes in topics based on current text information and previously estimated topic models in an on-line manner. The proposed model is applied to language model adaptation in speech recognition. We use the MIT OpenCourseWare corpus and Corpus of Spontaneous Japanese in speech recognition experiments, and show the effectiveness of the proposed method.


international conference on acoustics, speech, and signal processing | 2010

A comparative study on methods of Weighted language model training for reranking lvcsr N-best hypotheses

Takanobu Oba; Takaaki Hori; Atsushi Nakamura

This paper focuses on discriminative n-gram language models for a large vocabulary speech recognition task. Specifically we compare three training methods, Reranking Boosting (ReBst), Minimum Error Rate Training (MERT) and the Weighted Global Log-Linear Model (W-GCLM). They have a mechanism for handling sample weights, which are useful for providing an accurate model and work as impact factors of hypotheses for training. W-GCLM is proposed in this paper. We discuss the relationship between the three methods by comparing their loss functions. We also compare them experimentally by reranking N-best hypotheses under several conditions. We show that MERT and W-GCLM are different types of expansion of ReBst and have different respective advantages. Our experimental results reveal that W-GCLM outperforms ReBst and whether MERT or W-GCLM is superior depends on the training and test conditions.


international conference on acoustics, speech, and signal processing | 2014

Real-time one-pass decoding with recurrent neural network language model for speech recognition

Takaaki Hori; Yotaro Kubo; Atsushi Nakamura

This paper proposes an efficient one-pass decoding method for realtime speech recognition employing a recurrent neural network language model (RNNLM). An RNNLM is an effective language model that yields a large gain in recognition accuracy when it is combined with a standard n-gram model. However, since every word probability distribution based on an RNNLM is dependent on the entire history from the beginning of the speech, the search space in Viterbi decoding grows exponentially with the length of the recognition hypotheses and makes computation prohibitively expensive. Therefore, an RNNLM is usually used by N-best rescoring or by approximating it to a back-off n-gram model. In this paper, we present another approach that enables one-pass Viterbi decoding with an RNNLM without approximation, where the RNNLM is represented as a prefix tree of possible word sequences, and only the part needed for decoding is generated on-the-fly and used to rescore each hypothesis using an on-the-fly composition technique we previously proposed. Experimental results on the MIT lecture transcription task show that our proposed method enables one-pass decoding with small overhead for the RNNLM and achieves a slightly higher accuracy than 1000-best rescoring. Furthermore, it reduces the latency from the end of each utterance in two-pass decoding by a factor of 10.


Computer Speech & Language | 2013

Speech recognition in living rooms: Integrated speech enhancement and recognition system based on spatial, spectral and temporal modeling of sounds

Marc Delcroix; Keisuke Kinoshita; Tomohiro Nakatani; Shoko Araki; Atsunori Ogawa; Takaaki Hori; Shinji Watanabe; Masakiyo Fujimoto; Takuya Yoshioka; Takanobu Oba; Yotaro Kubo; Mehrez Souden; Seong-Jun Hahm; Atsushi Nakamura

Research on noise robust speech recognition has mainly focused on dealing with relatively stationary noise that may differ from the noise conditions in most living environments. In this paper, we introduce a recognition system that can recognize speech in the presence of multiple rapidly time-varying noise sources as found in a typical family living room. To deal with such severe noise conditions, our recognition system exploits all available information about speech and noise; that is spatial (directional), spectral and temporal information. This is realized with a model-based speech enhancement pre-processor, which consists of two complementary elements, a multi-channel speech-noise separation method that exploits spatial and spectral information, followed by a single channel enhancement algorithm that uses the long-term temporal characteristics of speech obtained from clean speech examples. Moreover, to compensate for any mismatch that may remain between the enhanced speech and the acoustic model, our system employs an adaptation technique that combines conventional maximum likelihood linear regression with the dynamic adaptive compensation of the variance of the Gaussians of the acoustic model. Our proposed system approaches human performance levels by greatly improving the audible quality of speech and substantially improving the keyword recognition accuracy.

Collaboration


Dive into the Takaaki Hori's collaboration.

Top Co-Authors

Avatar

Atsushi Nakamura

Nippon Telegraph and Telephone

View shared research outputs
Top Co-Authors

Avatar

Shinji Watanabe

Mitsubishi Electric Research Laboratories

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

John R. Hershey

Mitsubishi Electric Research Laboratories

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Tomohiro Nakatani

Nippon Telegraph and Telephone

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Chiori Hori

Mitsubishi Electric Research Laboratories

View shared research outputs
Top Co-Authors

Avatar

Shoko Araki

Nippon Telegraph and Telephone

View shared research outputs
Researchain Logo
Decentralizing Knowledge