Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Ryo Masumura is active.

Publication


Featured researches published by Ryo Masumura.


international conference on acoustics, speech, and signal processing | 2013

Use of latent words language models in ASR: A sampling-based implementation

Ryo Masumura; Hirokazu Masataki; Takanobu Oba; Osamu Yoshioka; Satoshi Takahashi

This paper applies the latent words language model (LWLM) to automatic speech recognition (ASR). LWLMs are trained taking into account related words, i.e., grouping of similar words in terms of meaning and syntactic role. This means, for example, if a technical word and a general word play a similar syntactic role, they are given a similar probability. This is expected that the LWLM performs robustly over multiple domains. Furthermore, we can expect that the interpolation of the LWLM and a standard n-gram LM will be effective since each of the LMs have different learning criterion. In addition, this paper also describes an approximation method of the LWLM for ASR, in which words are randomly sampled on the LWLM and then a standard word n-gram language model is trained. This enables us one-pass decoding. Our experimental results show that the LWLM performs comparable to the hierarchical Pitman-Yor language model (HPYLM) in a target domain task, and more robustly performs in out-domain tasks. Moreover, an interpolation model with the HPYLM provides a lower word error rate in all the tasks.


international conference on communications | 2009

Relevant document retrieval using a spoken document

Akinori Ito; Yu Uno; Ryo Masumura; Masashi Ito; Shozo Makino

In this paper, we proposed a method of retrieving documents from the World Wide Web using a spoken document as a “key.” This method can be viewed as a speech version of an ordinary relevant document retrieval, where a text document is used as a query of retrieval. Basically the retrieval is based on an automatic transcription of a spoken document using a speech recognizer. The difficult point of this task is that the automatic transcription contains many recognition errors, therefore we cannot trust keywords extracted from the automatic transcription using conventional method such as tf·idf. To solve this problem, we developed three methods. The first one is to measure relevance of a keyword to the spoken document by using Web documents retrieved using a Web search engine by specifying the keyword as a query. The second one is to compose a query from the selected keywords so that words derive from misrecognitions are excluded and similar words are gathered. The third one is to measure relevance of a downloaded Web document to the spoken document. The experimental results suggest that the proposed methods are promising for retrieving relevant documents of a spoken document.


international conference on acoustics, speech, and signal processing | 2017

Domain adaptation of DNN acoustic models using knowledge distillation

Taichi Asami; Ryo Masumura; Yoshikazu Yamaguchi; Hirokazu Masataki; Yushi Aono

Constructing deep neural network (DNN) acoustic models from limited training data is an important issue for the development of automatic speech recognition (ASR) applications that will be used in various application-specific acoustic environments. To this end, domain adaptation techniques that train a domain-matched model without overfitting by lever-aging pre-constructed source models are widely used. In this paper, we propose a novel domain adaptation method for DNN acoustic models based on the knowledge distillation framework. Knowledge distillation transfers the knowledge of a teacher model to a student model and offers better generalizability of the student model by controlling the shape of posterior probability distribution of the teacher model, which was originally proposed for model compression. We apply this framework to model adaptation. Our domain adaptation method avoids overfitting of the adapted model trained on limited data by transferring the knowledge of the source model to the adapted model by distillation. Experiments show that the proposed method can effectively avoid the overfitting of convolutional neural network based acoustic models and yield lower error rates than conventional adaptation methods.


international conference natural language processing | 2010

Document expansion using relevant web documents for spoken document retrieval

Ryo Masumura; Akinori Ito; Yu Uno; Masashi Ito; Shozo Makino

Recently, automatic indexing of a spoken document using a speech recognizer attracts attention. However, index generation from an automatic transcription has many problems because the automatic transcription has many recognition errors and Out-Of-Vocabulary words. To solve this problem, we propose a document expansion method using Web documents. To obtain important keywords which included in the spoken document but lost by recognition errors, we acquire Web documents relevant to the spoken document. Then, an index of the spoken document is generated by combining an index that generated from the automatic transcription and the Web documents. We propose a method for retrieval of relevant documents, and the experimental result shows that the retrieved Web document contained many OOV words. Next, we propose a method for combining the recognized index and the Web index. The experimental result shows that the index of the spoken document generated by the document expansion was closer to an index from the manual transcription than the index generated by the conventional method. Finally, we conducted a spoken document retrieval experiment, and the document-expansion-based index gave better retrieval precision than the conventional indexing method.


conference of the international speech communication association | 2016

Language Identification Based on Generative Modeling of Posteriorgram Sequences Extracted from Frame-by-Frame DNNs and LSTM-RNNs.

Ryo Masumura; Taichi Asami; Hirokazu Masataki; Yushi Aono; Sumitaka Sakauchi

This paper aims to enhance spoken language identification methods based on direct discriminative modeling of language labels using deep neural networks (DNNs) and long shortterm memory recurrent neural networks (LSTM-RNNs). In conventional methods, frame-by-frame DNNs or LSTM-RNNs are used for utterance-level classification. Although they have strong frame-level classification performance and real-time efficiency, they are not optimized for variable length utterance-level classification since the classification is conducted by simply averaging frame-level prediction results. In addition, the simple classification methodology cannot fully utilize the combination of DNNs and LSTM-RNNs. To address these issues, our idea is to combine the frame-by-frame DNNs and LSTM-RNNs with a sequential generative model based classifier. In the proposed method, we regard posteriorgram sequences generated from a frame-by-frame classifier as feature sequences, and model them with respect to each language using language modeling technologies. The generative model based classifier does not model an identification boundary, so we can flexibly deal with variable length utterances without loss of conventional advantages. Furthermore, the proposed method can support the combination of DNNs and LSTMs using joint posteriorgram sequences, those of generative modeling can capture differences between two posteriorgram sequences. Experiments conducted using the GlobalPhone database demonstrate the proposed method’s effectiveness.


Endoscopy International Open | 2018

New report preparation system for endoscopic procedures using speech recognition technology

Toshitatsu Takao; Ryo Masumura; Sumitaka Sakauchi; Yoshiko Ohara; Elif Bilgic; Eiji Umegaki; Hiromu Kutsumi; Takeshi Azuma

Background and study aims  We developed a new reporting system based on structured data entry, which selectively extracts only endoscopic findings from endoscopists’ oral statements and automatically inputs them into appropriate columns in real time during endoscopic procedures. Methods  We compared the time for endoscopic procedures and report preparation (ER time) by using an esophagogastroduodenoscopy simulator in three groups: one preparing reports using a mouse after endoscopic procedures (CE group); a second group preparing reports by using voice alone during endoscopic procedures (SR group); and the final group preparing reports by operating the system with a foot switch and inputting findings using voice during endoscopic procedures (SR + FS group). For the SR and SR + FS groups, we identified the recognition rates of the speech recognition system. Results  Mean ER times for cases with three findings each were 162, 130 and 119 seconds in the CE, SR and SR + FS groups, respectively. The mean ER times for cases with six findings each were 220, 144 and 128 seconds, respectively. The times in the SR and SR + FS groups were significantly shorter than that in the CE group ( P  < 0.017). The recognition rate of the SR group for cases with three findings each was 98.4 %, and 97.6 % in the same group for cases with six findings each. The rates in the SR + FS group were 95.2 % and 98.4 %, respectively. Conclusion  Our reporting system was demonstrated to allow an endoscopist to efficiently complete the report in real time during endoscopic procedures.


international conference on acoustics, speech, and signal processing | 2017

Parallel phonetically aware DNNs and LSTM-RNNS for frame-by-frame discriminative modeling of spoken language identification

Ryo Masumura; Taichi Asami; Hirokazu Masataki; Yushi Aono

Parallel phonetically aware deep neural networks (PPA-DNNs) and long short-term memory recurrent neural networks (PPA-LSTM-RNNs) to enhance frame-by-frame discriminative modeling of spoken language identification are proposed. This idea is inspired by traditional systems based on parallel phoneme recognition followed by language modeling (PPRLM). The proposed methods utilize multiple senone bottleneck features individually extracted from language-dependent senone-based DNNs in a frame-by-frame manner. The multiple senone bottleneck features can yield phonetic awareness to frame-by-frame DNNs and LSTM-RNNs without losing compatibility to real time applications. In experiments, three senone-based DNNs are introduced in order to extract senone bottleneck features, and both single use and parallel use of them are examined. Furthermore, we also examine a combination of PPA-DNNs and PPA-LSTM-RNNs. The proposed methods effectiveness is investigated by comparison with a simple speech aware modeling and traditional systems based on PPRLM.


conference of the international speech communication association | 2016

Recurrent Out-of-Vocabulary Word Detection Using Distribution of Features.

Taichi Asami; Ryo Masumura; Yushi Aono; Koichi Shinoda

The repeated use of out-of-vocabulary (OOV) words in a spoken document seriously degrades a speech recognizer’s performance. This paper provides a novel method for accurately detecting such recurrent OOV words. Standard OOV word detection methods classify each word segment into in-vocabulary (IV) or OOV. This word-by-word classification tends to be affected by sudden vocal irregularities in spontaneous speech, triggering false alarms. To avoid this sensitivity to the irregularities, our proposal focuses on consistency of the repeated occurrence of OOV words. The proposed method preliminarily detects recurrent segments, segments that contain the same word, in a spoken document by open vocabulary spoken term discovery using a phoneme recognizer. If the recurrent segments are OOV words, features for OOV detection in those segments should exhibit consistency. We capture this consistency by using the mean and variance (distribution) of features (DOF) derived from the recurrent segments, and use the DOF for IV/OOV classification. Experiments illustrate that the proposed method’s use of the DOF significantly improves its performance in recurrent OOV word detection.


empirical methods in natural language processing | 2015

Hierarchical Latent Words Language Models for Robust Modeling to Out-Of Domain Tasks

Ryo Masumura; Taichi Asami; Takanobu Oba; Hirokazu Masataki; Sumitaka Sakauchi; Akinori Ito

This paper focuses on language modeling with adequate robustness to support different domain tasks. To this end, we propose a hierarchical latent word language model (h-LWLM). The proposed model can be regarded as a generalized form of the standard LWLMs. The key advance is introducing a multiple latent variable space with hierarchical structure. The structure can flexibly take account of linguistic phenomena not present in the training data. This paper details the definition as well as a training method based on layer-wise inference and a practical usage in natural language processing tasks with an approximation technique. Experiments on speech recognition show the effectiveness of hLWLM in out-of domain tasks.


international conference on acoustics, speech, and signal processing | 2014

Role play dialogue topic model for language model adaptation in multi-party conversation speech recognition

Ryo Masumura; Takanobu Oba; Hirokazu Masataki; Osamu Yoshioka; Satoshi Takahashi

This paper introduces an unsupervised language model adaptation technique for multi-party conversation speech recognition. The use of topic models provides one of the most accurate frameworks for unsupervised language model adaptation since they can inject long-range topic information into language models. However, conventional topic models are not suitable for multi-party conversation because they assume that each speech set has each different topic. In a multi-party conversation, each speaker will share the same conversation topic and each speaker utterance will depend on both topic and speaker role. Accordingly, this paper proposes new concept of the “role play dialogue topic model” to utilize multiparty conversation attributes. The proposed topic model can share the topic distribution among each speaker and can also consider both topic and speaker role. The proposed topic model based adaptation realizes a new framework that sets multiple recognition hypotheses for each speaker and simultaneously adapts a language model for each speaker role. We use a call center dialogue data set in speech recognition experiments to show the effectiveness of the proposed method.

Collaboration


Dive into the Ryo Masumura's collaboration.

Top Co-Authors

Avatar

Taichi Asami

Tokyo Institute of Technology

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Masashi Ito

Tohoku Institute of Technology

View shared research outputs
Top Co-Authors

Avatar

Shozo Makino

Tohoku Bunka Gakuen University

View shared research outputs
Top Co-Authors

Avatar

Yusuke Ijima

Tokyo Institute of Technology

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge