Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Koichiro Yoshino is active.

Publication


Featured researches published by Koichiro Yoshino.


spoken language technology workshop | 2016

The fifth dialog state tracking challenge

Seokhwan Kim; Luis Fernando D'Haro; Rafael E. Banchs; Jason D. Williams; Matthew Henderson; Koichiro Yoshino

Dialog state tracking - the process of updating the dialog state after each interaction with the user - is a key component of most dialog systems. Following a similar scheme to the fourth dialog state tracking challenge, this edition again focused on human-human dialogs, but introduced the task of cross-lingual adaptation of trackers. The challenge received a total of 32 entries from 9 research groups. In addition, several pilot track evaluations were also proposed receiving a total of 16 entries from 4 groups. In both cases, the results show that most of the groups were able to outperform the provided baselines for each task.


annual meeting of the special interest group on discourse and dialogue | 2014

Information Navigation System Based on POMDP that Tracks User Focus

Koichiro Yoshino; Tatsuya Kawahara

We present a spoken dialogue system for navigating information (such as news articles), and which can engage in small talk. At the core is a partially observable Markov decision process (POMDP), which tracks user’s state and focus of attention. The input to the POMDP is provided by a spoken language understanding (SLU) component implemented with logistic regression (LR) and conditional random fields (CRFs). The POMDP selects one of six action classes; each action class is implemented with its own module.


international conference on acoustics, speech, and signal processing | 2013

Incorporating semantic information to selection of web texts for language model of spoken dialogue system

Koichiro Yoshino; Shinsuke Mori; Tatsuya Kawahara

A novel text selection approach for training a language model (LM) with Web texts is proposed for automatic speech recognition (ASR) of spoken dialogue systems. Compared to the conventional approach based on perplexity criterion, the proposed approach introduces a semantic-level relevance measure with the back-end knowledge base used in the dialogue system. We focus on the predicate-argument (P-A) structure characteristic to the domain in order to filter semantically relevant sentences in the domain. Several choices of statistical models and combination methods with the perplexity measure are investigated in this paper. Experimental evaluations in two different domains demonstrate the effectiveness and generality of the proposed approach. The combination method realizes significant improvement not only in ASR accuracy but also in semantic and dialogue-level accuracy.


IWSDS | 2017

Active Learning for Example-Based Dialog Systems

Takuya Hiraoka; Graham Neubig; Koichiro Yoshino; Tomoki Toda; Satoshi Nakamura

While example-based dialog is a popular option for the construction of dialog systems, creating example bases for a specific task or domain requires significant human effort. To reduce this human effort, in this paper, we propose an active learning framework to construct example-based dialog systems efficiently. Specifically, we propose two uncertainty sampling strategies for selecting inputs to present to human annotators who create system responses for the selected inputs. We compare performance of these proposed strategies with a random selection strategy in simulation-based evaluation on 6 different domains. Evaluation results show that the proposed strategies are good alternatives to random selection in domains where the complexity of system utterances is low.


IEEE Transactions on Audio, Speech, and Language Processing | 2015

Automatic speech recognition for mixed dialect utterances by mixing dialect language models

Naoki Hirayama; Koichiro Yoshino; Katsutoshi Itoyama; Shinsuke Mori; Hiroshi G. Okuno

This paper presents an automatic speech recognition (ASR) system that accepts a mixture of various kinds of dialects. The system recognizes dialect utterances on the basis of the statistical simulation of vocabulary transformation and combinations of several dialect models. Previous dialect ASR systems were based on handcrafted dictionaries for several dialects, which involved costly processes. The proposed system statistically trains transformation rules between a common language and dialects, and simulates a dialect corpus for ASR on the basis of a machine translation technique. The rules are trained with small sets of parallel corpora to make up for the lack of linguistic resources on dialects. The proposed system also accepts mixed dialect utterances that contain a variety of vocabularies. In fact, spoken language is not a single dialect but a mixed dialect that is affected by the circumstances of speakers’ backgrounds (e.g., native dialects of their parents or where they live). We addressed two methods to combine several dialects appropriately for each speaker. The first was recognition with language models of mixed dialects with automatically estimated weights that maximized the recognition likelihood. This method performed the best, but calculation was very expensive because it conducted grid searches of combinations of dialect mixing proportions that maximized the recognition likelihood. The second was integration of results of recognition from each single dialect language model. The improvements with this model were slightly smaller than those with the first method. Its calculation cost was, however, inexpensive and it worked in real-time on general workstations. Both methods achieved higher recognition accuracies for all speakers than those with the single dialect models and the common language model, and we could choose a suitable model for use in ASR that took into consideration the computational costs and recognition accuracies.


Computer Speech & Language | 2015

Conversational system for information navigation based on POMDP with user focus tracking

Koichiro Yoshino; Tatsuya Kawahara

HighlightsWe address a spoken dialogue system which conducts information navigation.We formulate the problem of dialogue management as a module selection with POMDP.The reward function of POMDP is defined by the quality of interaction.The POMDP tracks users focus of attention to make appropriate actions.The proposed model outperformed the conventional systems without focus information. We address a spoken dialogue system which conducts information navigation in a style of small talk. The system uses Web news articles as an information source, and the user can receive information about the news of the day through interaction. The goal and procedure of this kind of dialogue are not well defined. An empirical approach based on a partially observable Markov decision process (POMDP) has recently been widely used for dialogue management, but it assumes a definite task goal and information slots, which does not hold in our application system. In this work, we formulate the problem of dialogue management as a selection of modules and optimize it with POMDP by tracking the dialogue state and focus of attention. The POMDP-based dialogue manager receives a user intention that is classified by a spoken language understanding (SLU) component based on logistic regression (LR). The manager also receives a user focus that is detected by the SLU component based on conditional random fields (CRFs). These dialogue states are used for selecting appropriate modules by policy function, which is optimized by reinforcement learning. The reward function is defined by the quality of interaction to encourage long interaction of information navigation with users. The module which responds to user queries is based on a similarity of predicate-argument (P-A) structures that are automatically defined from a domain corpus. It allows for flexible response generation even if the system cannot find exact matching information to the user query. The system also proactively presents information by following the user focus and retrieving a news article based on the similarity measure even if the user does not make any utterance. Experimental evaluations with real dialogue sessions demonstrate that the proposed system outperformed the conventional rule-based system in terms of dialogue state tracking and action selection. Effect of focus detection in the POMDP framework is also confirmed.


annual meeting of the special interest group on discourse and dialogue | 2016

Cultural communication idiosyncrasies in human-computer interaction

Juliana Miehle; Koichiro Yoshino; Louisa Pragst; Stefan Ultes; Satoshi Nakamura; Wolfgang Minker

Comunicacio presentada a: 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue; celebrada del 13 al 15 de setembre de 2016 a Los Angeles, USA


Natural Language Dialog Systems and Intelligent Assistants | 2015

News Navigation System Based on Proactive Dialogue Strategy

Koichiro Yoshino; Tatsuya Kawahara

This paper addresses the concept of information navigation and the system that navigates news articles updated day by day. In the information navigation, the system has a back-end knowledge base and users can access information through a natural interaction. It is composed of several modules that interact with users in different manners. Both the system and the user can take an initiative of dialogue depending on the specification of the user interest. The system allows ambiguous user queries and proactively presents information related to the user interest by tracking the user focus. An experimental result shows that the proposed system based on partially observable Markov decision process and user focus tracking can interact with users effectively by selecting the most appropriate dialogue modules.


spoken language technology workshop | 2016

Deep bottleneck features and sound-dependent i-vectors for simultaneous recognition of speech and environmental sounds

Sakriani Sakti; Seiji Kawanishi; Graham Neubig; Koichiro Yoshino; Satoshi Nakamura

In speech interfaces, it is often necessary to understand the overall auditory environment, not only recognizing what is being said, but also being aware of the location or actions surrounding the utterance. However, automatic speech recognition (ASR) becomes difficult when recognizing speech with environmental sounds. Standard solutions treat environmental sounds as noise, and remove them to improve ASR performance. On the other hand, most studies on environmental sounds construct classifiers for environmental sounds only, without interference of spoken utterances. But, in reality, such separate situations almost never exist. This study attempts to address the problem of simultaneous recognition of speech and environmental sounds. Particularly, we examine the possibility of using deep neural network (DNN) techniques to recognize speech and environmental sounds simultaneously, and improve the accuracy of both tasks under respective noisy conditions. First, we investigate DNN architectures including two parallel single-task DNNs, and a single multi-task DNN. However, we found direct multi-task learning of simultaneous speech and environmental recognition to be difficult. Therefore, we further propose a method that combines bottleneck features and sound-dependent i-vectors within this framework. Experimental evaluation results reveal that the utilizing bottleneck features and i-vectors as the input of DNNs can help to improve accuracy of each recognition task.


conference of the international speech communication association | 2016

Unsupervised Joint Estimation of Grapheme-to-Phoneme Conversion Systems and Acoustic Model Adaptation for Non-Native Speech Recognition.

Satoshi Tsujioka; Sakriani Sakti; Koichiro Yoshino; Graham Neubig; Satoshi Nakamura

Non-native speech differs significantly from native speech, often resulting in a degradation of the performance of automatic speech recognition (ASR). Hand-crafted pronunciation lexicons used in standard ASR systems generally fail to cover non-native pronunciations, and design of new ones by linguistic experts is time consuming and costly. In this work, we propose acoustic data-driven iterative pronunciation learning for non-native speech recognition, which automatically learns non-native pronunciations directly from speech using an iterative estimation procedure. Grapheme-to-Phoneme (G2P) conversion is used to predict multiple candidate pronunciations for each word, occurrence frequency of pronunciation variations is estimated from the acoustic data of non-native speakers, and these automatically estimated pronunciation variations are used to perform acoustic model adaptation. We investigate various cases such as learning (1) without knowledge of non-native pronunciation, and (2) when we adapt to the speaker’s proficiency level. In experiments on speech from non-native speakers of various levels, the proposed method was able to achieve an 8.9% average improvement in accuracy.

Collaboration


Dive into the Koichiro Yoshino's collaboration.

Top Co-Authors

Avatar

Satoshi Nakamura

Nara Institute of Science and Technology

View shared research outputs
Top Co-Authors

Avatar

Sakriani Sakti

Nara Institute of Science and Technology

View shared research outputs
Top Co-Authors

Avatar

Graham Neubig

Carnegie Mellon University

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Yu Suzuki

Nara Institute of Science and Technology

View shared research outputs
Top Co-Authors

Avatar

Masahiro Mizukami

Nara Institute of Science and Technology

View shared research outputs
Top Co-Authors

Avatar

Nurul Lubis

Nara Institute of Science and Technology

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge