Is this you? Create Your Porfile

Chiori Hori

National Institute of Information and Communications Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Chiori Hori is active.

Explore More

Publication

Featured researches published by Chiori Hori.

international conference on acoustics, speech, and signal processing | 2009

Statistical dialog management applied to WFST-based dialog systems

Chiori Hori; Kiyonori Ohtake; Teruhisa Misu; Hideki Kashioka; Satoshi Nakamura

We have proposed an expandable dialog scenario description and platform to manage dialog systems using a weighted finite-state transducer (WFST) in which user concept and system action tags are input and output of the transducer, respectively. In this paper, we apply this framework to statistical dialog management in which a dialog strategy is acquired from a corpus of human-to-human conversation for hotel reservation. A scenario WFST for dialog management was automatically created from an N-gram model of a tag sequence that was annotated in the corpus with Interchange Format (IF). Additionally, a word-to-concept WFST for spoken language understanding (SLU) was obtained from the same corpus. The acquired scenario WFST and SLU WFST were composed together and then optimized. We evaluated the proposed WFST-based statistic dialog management in terms of correctness to detect the next system actions and have confirmed the automatically acquired dialog scenario from a corpus can manage dialog reasonably on the WFST-based dialog management platform.

ACM Transactions on Speech and Language Processing | 2011

Modeling spoken decision support dialogue and optimization of its dialogue strategy

Teruhisa Misu; Komei Sugiura; Tatsuya Kawahara; Kiyonori Ohtake; Chiori Hori; Hideki Kashioka; Hisashi Kawai; Satoshi Nakamura

This article presents a user model for user simulation and a system state representation in spoken decision support dialogue systems. When selecting from a group of alternatives, users apply different decision-making criteria with different priorities. At the beginning of the dialogue, however, users often do not have a definite goal or criteria in which they place value, thus they can learn about new features while interacting with the system and accordingly create new criteria. In this article, we present a user model and dialogue state representation that accommodate these patterns by considering the users knowledge and preferences. To estimate the parameters used in the user model, we implemented a trial sightseeing guidance system, collected dialogue data, and trained a user simulator. Since the user parameters are not observable from the system, the dialogue is modeled as a partially observable Markov decision process (POMDP), and a dialogue state representation was introduced based on the model. We then optimized its dialogue strategy so that users can make better choices. The dialogue strategy is evaluated using a user simulator trained from a large number of dialogues collected using a trial dialogue system.

international conference on acoustics, speech, and signal processing | 2012

A comparison of dynamic WFST decoding approaches

Paul R. Dixon; Chiori Hori; Hideki Kashioka

In this paper we perform a comparison of lookahead composition and on-the-fly hypothesis rescoring using a common decoder. The results on a large vocabulary speech recognition task illustrate the differences in the behaviour of these algorithms in terms of error rate, real time factor, memory usage and internal statistics of the decoder. The evaluations were performed when the decoder was operated at either the state or arc level. The results show the dynamic approaches also work well at the state level even though there is greater dynamic construction cost.

international conference on robotics and automation | 2014

Non-monologue HMM-based speech synthesis for service robots: A cloud robotics approach

Komei Sugiura; Yoshinori Shiga; Hisashi Kawai; Teruhisa Misu; Chiori Hori

Robot utterances generally sound monotonous, unnatural, and unfriendly because their Text-to-Speech (TTS) systems are not optimized for communication but for text-reading. Here we present a non-monologue speech synthesis for robots. We collected a speech corpus in a non-monologue style in which two professional voice talents read scripted dialogues. Hidden Markov models (HMMs) were then trained with the corpus and used for speech synthesis. We conducted experiments in which the proposed method was evaluated by 24 subjects in three scenarios: text-reading, dialogue, and domestic service robot (DSR) scenarios. In the DSR scenario, we used a physical robot and compared our proposed method with a baseline method using the standard Mean Opinion Score (MOS) criterion. Our experimental results showed that our proposed methods performance was (1) at the same level as the baseline method in the text-reading scenario and (2) exceeded it in the DSR scenario. We deployed our proposed system as a cloud-based speech synthesis service so that it can be used without any cost.

ieee automatic speech recognition and understanding workshop | 2009

The Asian network-based speech-to-speech translation system

Sakriani Sakti; Noriyuki Kimura; Michael Paul; Chiori Hori; Eiichiro Sumita; Satoshi Nakamura; Jun Park; Chai Wutiwiwatchai; Bo Xu; Hammam Riza; Karunesh Arora; Chi Mai Luong; Haizhou Li

This paper outlines the first Asian network-based speech-to-speech translation system developed by the Asian Speech Translation Advanced Research (A-STAR) consortium. The system was designed to translate common spoken utterances of travel conversations from a certain source language into multiple target languages in order to facilitate multiparty travel conversations between people speaking different Asian languages. Each A-STAR member contributes one or more of the following spoken language technologies: automatic speech recognition, machine translation, and text-to-speech through Web servers. Currently, the system has successfully covered 9 languages— namely, 8 Asian languages (Hindi, Indonesian, Japanese, Korean, Malay, Thai, Vietnamese, Chinese) and additionally, the English language. The systems domain covers about 20,000 travel expressions, including proper nouns that are names of famous places or attractions in Asian countries. In this paper, we discuss the difficulties involved in connecting various different spoken language translation systems through Web servers. We also present speech-translation results on the first A-STAR demo experiments carried out in July 2009.

Proceedings of the 7th Workshop on Asian Language Resources | 2009

Annotating Dialogue Acts to Construct Dialogue Systems for Consulting

Kiyonori Ohtake; Teruhisa Misu; Chiori Hori; Hideki Kashioka; Satoshi Nakamura

This paper introduces a new corpus of consulting dialogues, which is designed for training a dialogue manager that can handle consulting dialogues through spontaneous interactions from the tagged dialogue corpus. We have collected 130 h of consulting dialogues in the tourist guidance domain. This paper outlines our taxonomy of dialogue act annotation that can describe two aspects of an utterances: the communicative function (speech act), and the semantic content of the utterance. We provide an overview of the Kyoto tour guide dialogue corpus and a preliminary analysis using the dialogue act tags.

Automatic Speech Recognition & Understanding, 2009. ASRU 2009. IEEE Workshop on | 2009

Weighted Finite State Transducer Based Statistical Dialog Management

Chiori Hori; Kiyonori Ohtake; Teruhisa Misu; Hideki Kashioka; Satoshi Nakamura

We proposed a dialog system using a weighted finite-state transducer (WFST) in which user concept and system action tags are input and output of the transducer, respectively. The WFST-based platform for dialog management enables us to combine various statistical models for dialog management (DM), user input understanding and system action generation, and then search the best system action in response to user inputs among multiple hypotheses. To test the potential of the WFST-based DM platform using statistical models, we constructed a dialog system using a human-to-human spoken dialog corpus for hotel reservation, which is annotated with Interchange Format (IF). A scenario WFST and a spoken language understanding (SLU) WFST were obtained from the corpus and then composed together and optimized. We evaluated the detection accuracy of the system next action tags using Mean Reciprocal Ranking (MRR). Finally, we constructed a full WFST-based dialog system by composing SLU, scenario and sentence generation (SG) WFSTs. Humans read the system responses in natural language and judged the quality of the responses. We confirmed that the WFST-based DM platform was capable of handling various spoken language and scenarios when the user concept and system action tags are consistent and distinguishable.

Computer Speech & Language | 2014

Incorporating local information of the acoustic environments to MAP-based feature compensation and acoustic model adaptation

Yu Tsao; Xugang Lu; Paul R. Dixon; Ting-yao Hu; Shigeki Matsuda; Chiori Hori

Designing suitable prior distributions is important for MAP-based methods.We propose a framework to characterize local information of acoustic environments.With the local information, suitable prior distributions can be designed.Four algorithms to specify hyper-parameters for prior distributions are derived.Results confirm the advantage of using local information to MAP-based methods. The maximum a posteriori (MAP) criterion is popularly used for feature compensation (FC) and acoustic model adaptation (MA) to reduce the mismatch between training and testing data sets. MAP-based FC and MA require prior densities of mapping function parameters, and designing suitable prior densities plays an important role in obtaining satisfactory performance. In this paper, we propose to use an environment structuring framework to provide suitable prior densities for facilitating MAP-based FC and MA for robust speech recognition. The framework is constructed in a two-stage hierarchical tree structure using environment clustering and partitioning processes. The constructed framework is highly capable of characterizing local information about complex speaker and speaking acoustic conditions. The local information is utilized to specify hyper-parameters in prior densities, which are then used in MAP-based FC and MA to handle the mismatch issue. We evaluated the proposed framework on Aurora-2, a connected digit recognition task, and Aurora-4, a large vocabulary continuous speech recognition (LVCSR) task. On both tasks, experimental results showed that with the prepared environment structuring framework, we could obtain suitable prior densities for enhancing the performance of MAP-based FC and MA.

international conference on acoustics, speech, and signal processing | 2015

Speaker adaptive training for deep neural networks embedding linear transformation networks

Tsubasa Ochiai; Shigeki Matsuda; Hideyuki Watanabe; Xugang Lu; Chiori Hori; Shigeru Katagiri

Recently, a novel speaker adaptation method was proposed that applied the Speaker Adaptive Training (SAT) concept to a speech recognizer consisting of a Deep Neural Network (DNN) and a Hidden Markov Model (HMM), and its utility was demonstrated. This method implements the SAT scheme by allocating one Speaker Dependent (SD) module for each training speaker to one of the intermediate layers of the front-end DNN. It then jointly optimizes the SD modules and the other part of network, which is shared by all the speakers. In this paper, we propose an improved version of the above SAT-based adaptation scheme for a DNN-HMM recognizer. Our new training adopts a Linear Transformation Network (LTN) for the SD module, and such LTN employment leads to more appropriate regularization in both the SAT and adaptation stages by replacing an empirically selected anchorage of a network for regularization in the preceding SAT-DNN-HMM with a SAT-optimized anchorage. We elaborate the effectiveness of our proposed method over TED Talks corpus data. Our experimental results show that a speaker-adapted recognizer using our method achieves a significant word error rate reduction of 9.2 points from a baseline SI-DNN recognizer and also steadily outperforms speaker-adapted recognizers, each of which originates from the preceding SAT-based DNN-HMM.

international symposium on chinese spoken language processing | 2014

Mandarin speech recognition using convolution neural network with augmented tone features

Xinhui Hu; Xugang Lu; Chiori Hori

Due to its ability of reducing spectral variations and modeling spectral correlations existed in speech signals, the convolutional neural network (CNN) has been shown effective in modeling speech compared to deep neural network (DNN). In this study, we explore applying CNN to Mandarin speech recognitions. Besides exploring appropriate CNN architecture for recognition performance, focuses are on investigating the effective acoustic features, and effectivenesses of applying tonal information which have been verified helpful in other types of acoustic models to the acoustic features in the CNN. We conduct speech recognition experiments on Mandarin broadcast speech recognition to test the effectivenesses of the proposed approaches. The CNN shows its clear superiority to the DNN, with relative reductions of character error rate (CER) among 7.7-13.1% for broadcast news speech (BN), and 5.4-9.9% for broadcast conversation speech (BC). Like in the Gaussian Mixture Model (GMM) and DNN systems, the tonal information characterized by the fundamental frequency (F0) and fundamental frequency variations (FFV) are found still helpful in CNN models, they achieve relative CER reductions over 6.7% for BN and 4.3% for BC respectively when compared with the baseline Mel-filter bank feature.

Explore More