Kouichi Katsurada | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Kouichi Katsurada is active.

Explore More

Publication

Featured researches published by Kouichi Katsurada.

international conference on multimodal interfaces | 2003

XISL: a language for describing multimodal interaction scenarios

Kouichi Katsurada; Yusaku Nakamura; Hirobumi Yamada; Tsuneo Nitta

This paper outlines the latest version of XISL (eXtensible Interaction Scenario Language). XISL is an XML-based markup language for web-based multimodal interaction systems. XISL enables to describe synchronization of multimodal inputs/outputs, dialog flow/transition, and some other descriptions required for multimodal interaction. XISL inherits these features from VoiceXML and SMIL. The original feature of XISL is that XISL has enough modality-extensibility. We present the basic XISL tags, outline of XISL execution systems, and then make a comparison with other languages.

international conference on acoustics, speech, and signal processing | 2012

Improvement of animated articulatory gesture extracted from speech for pronunciation training

Yurie Iribe; Silasak Manosavan; Kouichi Katsurada; Ryoko Hayashi; Chunyue Zhu; Tsuneo Nitta

Computer-assisted pronunciation training (CAPT) was introduced for language education in recent years. CAPT scores the learners pronunciation quality and points out wrong phonemes by using speech recognition technology. However, although the learner can thus realize that his/her speech is different from the teachers, the learner still cannot control the articulation organs to pronounce correctly. The learner cannot understand how to correct the wrong articulatory gestures precisely. We indicate these differences by visualizing a learners wrong pronunciation movements and the correct pronunciation movements with CG animation. We propose a system for generating animated pronunciation by estimating a learners pronunciation movements from his/her speech automatically. The proposed system maps speech to coordinate values that are needed to generate the animations by using multilayer perceptron neural networks (MLP). We use MRI data to generate smooth animated pronunciations. Additionally, we verify whether the vocal tract area and articulatory features are suitable as characteristics of pronunciation movement through experimental evaluation.

international workshop on machine learning for signal processing | 2013

Voice conversion for arbitrary speakers using articulatory-movement to vocal-tract parameter mapping

Narpendyah Wisjnu Ariwardhani; Yurie Iribe; Kouichi Katsurada; Tsuneo Nitta

In this paper, we propose voice conversion based on articulatory-movement (AM) to vocal tract parameter (VTP) mapping. An artificial neural network (ANN) is applied to map AM to VTP and to convert the source speakers voice to the target speakers voice. The proposed system is not only text independent voice conversion, but can also be used for an arbitrary source speaker. This means that our approach requires no source speaker data to build the voice conversion model and hence source speaker data is only required during testing phase. Preliminary cross-lingual voice conversion experiments are also conducted. The results of voice conversion were evaluated using subjective and objective measures to compare the performance of our proposed ANN-based voice conversion (VC) with the state-of-the-art Gaussian mixture model (GMM)-based VC. The experimental results show that the converted voice is intelligible and has speaker individuality of the target speaker.

IEICE Transactions on Information and Systems | 2006

PS-ZCPA Based Feature Extraction with Auditory Masking, Modulation Enhancement and Noise Reduction for Robust ASR

Muhammad Ghulam; Takashi Fukuda; Kouichi Katsurada; Junsei Horikawa; Tsuneo Nitta

A pitch-synchronous (PS) auditory feature extraction method based on ZCPA (Zero-Crossings Peak-Amplitudes) was proposed previously and showed more robustness over a conventional ZCPA and MFCC based features. In this paper, firstly, a non-linear adaptive threshold adjustment procedure is introduced into the PS-ZCPA method to get optimal results in noisy conditions with different signal-to-noise ratio (SNR). Next, auditory masking, a well-known auditory perception, and modulation enhancement that simulates a strong relationship between modulation spectrums and intelligibility of speech are embedded into the PS-ZCPA method. Finally, a Wiener filter based noise reduction procedure is integrated into the method to make it more noise-robust, and the performance is evaluated against ETSI ES202 (WI008), which is a standard front-end for distributed speech recognition. All the experiments were carried out on Aurora-2J database. The experimental results demonstrated improved performance of the PS-ZCPA method by embedding auditory masking into it, and a slightly improved performance by using modulation enhancement. The PS-ZCPA method with Wiener filter based noise reduction also showed better performance than ETSI ES202 (WI008).

robot and human interactive communication | 2004

Activities of Interactive Speech Technology Consortium (ISTC) targeting open software development for MMI systems

Tsuneo Nitta; Shigeki Sagayama; Yoichi Yamashita; Tatsuya Kawahara; Shigeo Morishima; Shizuka Nakamura; Atsushi Yamada; Koji Ito; M. Kai; A. Li; Masato Mimura; Keikichi Hirose; Takao Kobayashi; Keiichi Tokuda; Nobuaki Minematsu; Yasuharu Den; Takehito Utsuro; Tatsuo Yotsukura; Hiroshi Shimodaira; M. Araki; Takuya Nishimoto; N. Kawaguchi; H. Banno; Kouichi Katsurada

Interactive Speech Technology Consortium (ISTC), established on November 2003 after three years activity of the Galatea project supported by Information-technology Promotion Agency (IPA) of Japan, aims at supporting open-source free software development of multi-modal interaction (MMI) for human-like agents. The software named Galatea-toolkit developed by 24 researchers of 16 research institutes in Japan includes a Japanese speech recognition engine, a Japanese speech synthesis engine, and a facial image synthesis engine used for developing an anthropomorphic agent, as well as dialogue manager that can integrates multiple modalities, interprets them, and decides an action with differentiating it to multiple media of voice and facial expression. ISTC provides members a one-day technical seminar and one-week training course to master Galatea-toolkit, as well as a software set (CDROM) every year.

asian conference on computer vision | 2016

Lip Reading from Multi View Facial Images Using 3D-AAM

Takuya Watanabe; Kouichi Katsurada; Yasushi Kanazawa

Lip reading is a technique to recognize the spoken words base on lip movement. In this process, it is important to detect the correct features of the facial images. However, detection is not easy in the real situations because the facial images may be taken from various angles. To cope with this problem, lip reading from multi view facial images has been conducted in several research institutes. In this paper, we propose a lip reading approach using the 3D Active Appearance Models (AAM) features and the Hidden Markov Model (HMM)-based recognition model. The AAM is a parametric model constructed from both shape and appearance parameters. The parameters are compressed into the combination parameters in the AAM, and are used in lip reading or some other facial image processing applications. The 3D-AAM extends the traditional 2D shape model to 3D shape model built from three different view angles (frontal, left, and right profile). It provides an effective algorithm to align the model with the RGB and the 3D range images obtained by the RGBD-camera. The benefit of using 3D-AAM in lip reading is that it enables to recognize the spoken words from any angle of the facial images. In the experiment, we compared the accuracy of lip reading using 3D-AAM with that of the traditional 2D-AAM on various angles of facial images. Based on the result, we confirmed that 3D-AAM is effective in cross view lip reading despite using only the frontal images in the HMM training phase.

2014 International Conference of Advanced Informatics: Concept, Theory and Application (ICAICTA) | 2014

Novel two-stage model for grapheme-to-phoneme conversion using new grapheme generation rules

Seng Kheang; Kouichi Katsurada; Yurie Iribe; Tsuneo Nitta

The quality of a grapheme-to-phoneme (G2P) conversion plays an important role in developing high quality speech synthesis systems. Because many problems regarding the G2P conversion have been reported, we propose a novel two-stage model-based approach, which is implemented using an existing Weighted Finite-State Transducer-based G2P conversion framework, to improve the performance of the G2P conversion model. The first stage model is built for automatic conversion of word to phonemes, while the second stage model utilizes the input graphemes and output phonemes obtained from the first-stage to determine the best final output phoneme sequence. Additionally, we design new grapheme generation rules, which enable extra detail for the vowel graphemes appearing within a word. When compared with previous approaches, the evaluation results show that our approach slightly improves the accuracy of the out-of-vocabulary dataset and consistently increases the accuracy of the in-vocabulary dataset.

international conference on multimodal interfaces | 2008

A browser-based multimodal interaction system

Kouichi Katsurada; Teruki Kirihata; Masashi Kudo; Junki Takada; Tsuneo Nitta

In this paper, we propose a system that enables users to have multimodal interactions (MMI) with an anthropomorphic agent via a web browser. By using the system, a user can interact simply by accessing a web site from his/her web browser. A notable characteristic of the system is that the anthropomorphic agent is synthesized from a photograph of a real human face. This makes it possible to construct a web site whose owners facial agent speaks with visitors to the site. This paper describes the structure of the system and provides a screen shot.

IEICE Transactions on Information and Systems | 2008

Canonicalization of Feature Parameters for Robust Speech Recognition Based on Distinctive Phonetic Feature (DPF) Vectors

Mohammad Nurul Huda; Muhammad Ghulam; Takashi Fukuda; Kouichi Katsurada; Tsuneo Nitta

This paper describes a robust automatic speech recognition (ASR) system with less computation. Acoustic models of a hidden Markov model (HMM)-based classifier include various types of hidden factors such as speaker-specific characteristics, coarticulation, and an acoustic environment, etc. If there exists a canonicalization process that can recover the degraded margin of acoustic likelihoods between correct phonemes and other ones caused by hidden factors, the robustness of ASR systems can be improved. In this paper, we introduce a canonicalization method that is composed of multiple distinctive phonetic feature (DPF) extractors corresponding to each hidden factor canonicalization, and a DPF selector which selects an optimum DPF vector as an input of the HMM-based classifier. The proposed method resolves gender factors and speaker variability, and eliminates noise factors by applying the canonicalzation based on the DPF extractors and two-stage Wiener filtering. In the experiment on AURORA-2J, the proposed method provides higher word accuracy under clean training and significant improvement of word accuracy in low signal-to-noise ratio (SNR) under multi-condition training compared to a standard ASR system with mel frequency ceptral coeffient (MFCC) parameters. Moreover, the proposed method requires a reduced, two-fifth, Gaussian mixture components and less memory to achieve accurate ASR.

Archive | 2005

XISL: A Modality-Independent MMI Description Language

Kouichi Katsurada; Hirobumi Yamada; Yusaku Nakamura; Satoshi Kobayashi; Tsuneo Nitta

In this chapter we outline a multimodal interaction description language XISL (eXtensible Interaction Scenario Language) that has been developed to describe MMI scenarios. The main feature of XISL is that it allows modalities to be described flexibly, which makes it easy to add new modalities or to modify existing modalities on MMI systems. Moreover, XISL is separately described from XML or HTML contents, thus making both the XISL and XML (HTML) documents more reusable. We constructed three types of XISL execution systems, namely a PC terminal, a PDA terminal, and a mobile phone terminal, and show the descriptive power of XISL by implementing an online shopping application on these terminals.

Explore More