Is this you? Create Your Porfile

Hiromichi Kawanami

Nara Institute of Science and Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Hiromichi Kawanami is active.

Explore More

Publication

Featured researches published by Hiromichi Kawanami.

Speech Communication | 2002

Temporal rate exchange of dialogue speech in prosodic units as compared to read speech

Keikichi Hirose; Hiromichi Kawanami

A comparative study on speech rate was conducted between dialogue speech and read speech of Japanese. Based on a model of fundamental frequency contour generation, four prosodic units, prosodic sentence, clause, phrase and word, are defined. Speech rate was analyzed with respect to these units, especially to prosodic phrases. In order to suppress various factors affecting speech rate, and to clarify features of dialogue speech, mora reduction rate of dialogue speech as compared to its read speech counterpart was defined and was used for the analysis. Through the analysis of speech samples recorded during simulated dialogues, it was found that, in a prosodic phrase, dialogue speech rate starts with a value slightly larger than that of read speech. Then, it gradually increases and after passing through the middle of the phrase, decreases. This result was also supported through a linear regression analysis and a hearing test of synthetic speech.

international conference on spoken language processing | 1996

Synthesizing dialogue speech of Japanese based on the quantitative analysis of prosodic features

Keikichi Hirose; Mayumi Sakata; Hiromichi Kawanami

Through analyses of the fundamental frequency contours and speech rates of dialogue speech and also of read speech, prosodic rules were derived for the synthesis of spoken dialogue. The fundamental frequency contours were first decomposed into phrase and accent components based on the superpositional model, and then their command magnitudes/amplitudes were analyzed by the method of multiple regression analysis. For the speech rate, the reduction rate of mora duration from reading-style to dialogue-style was calculated. After normalizing the sentence length, the mean reduction rate was calculated as an average over utterances without a complicated syntactic structure. The results of the above analyses were incorporated into prosodic rules for dialogue speech synthesis. Using a previously-developed formant speech synthesiser, synthesis was conducted using both the former rules of read speech and the newly developed rules. A hearing test showed that the new rules can produce better prosody as dialogue speech.

international conference on robot communication and coordination | 2007

Voice activity detection applied to hands-free spoken dialogue robot based on decoding using acoustic and language model

Hiroyuki Sakai; Tobias Cincarek; Hiromichi Kawanami; Hiroshi Saruwatari; Kiyohiro Shikano; Akinobu Lee

Speech recognition and speech-based dialogue are means for realizing communication between humans and robots. In case of conventional system setup a headset or a directional microphone is used to collect speech with high signal-to-noise ratio (SNR). However, the user must wear a microphone or has to approach the system closely for interaction. Therefore its preferable to develop a hands-free speech recognition system which enables the user to speak to the system from a distant point. To collect speech from distant speakers a microphone array is usually employed. However, the SNR will degrade in a real environment because of the presence of various kinds of background noise besides the users utterance. This will most often decrease speech recognition performance and no reliable speech dialogue would be possible. Voice Activity Detection (VAD) is a method to detect the user utterance part in the input signal. If VAD fails, all following processing steps including speech recognition and dialogue will not work. Conventional VAD based on amplitude level and zero cross count is difficult to apply to hands-free speech recognition, because speech detection will most often fail due to low SNR. This paper proposes a VAD method based on the acoustic model (AM) for background noise and the speech recognition algorithm applied to hands-free speech recognition. There will always be non-speech segments at the beginning and end of each user utterance. The proposed VAD approach compares the likelihood of phoneme and silence segments in the top recognition hypotheses during decoding. We implemented the proposed method for the open-source speech recognition engine Julius. Experimental results for various SNRs conditions show that the proposed method attains a higher VAD accuracy and higher recognition rate than conventional VAD.

IEICE Transactions on Information and Systems | 2008

Development, Long-Term Operation and Portability of a Real-Environment Speech-Oriented Guidance System

Tobias Cincarek; Hiromichi Kawanami; Ryuichi Nisimura; Akinobu Lee; Hiroshi Saruwatari; Kiyohiro Shikano

In this paper, the development, long-term operation and portability of a practical ASR application in a real environment is investigated. The target application is a speech-oriented guidance system installed at the local community center. The system has been exposed to ordinary people since November 2002. More than 300 hours or more than 700,000 inputs have been collected during four years. The outcome is a rare example of a large scale real-environment speech database. A simulation experiment is carried out with this database to investigate how the systems performance improves during the first two years of operation. The purpose is to determine empirically the amount of real-environment data which has to be prepared to build a system with reasonable speech recognition performance and response accuracy. Furthermore, the relative importance of developing the main system components, i.e. speech recognizer and the response generation module, is assessed. Although depending on the systems modeling capacities and domain complexity, experimental results show that overall performance stagnates after employing about 10-15 k utterances for training the acoustic model, 40–50 k utterances for training the language model and 40 k–50 k utterances for compiling the question and answer database. The Q&A database was most important for improving the systems response accuracy. Finally, the portability of the well-trained first system prototype for a different environment, a local subway station, is investigated. Since collection and preparation of large amounts of real data is impractical in general, only one month of data from the new environment is employed for system adaptation. While the speech recognition component of the first prototype has a high degree of portability, the response accuracy is lower than in the first environment. The main reason is a domain difference between the two systems, since they are installed in different environments. This implicates that it is imperative to take the behavior of users under real conditions into account to build a system with high user satisfaction.

international conference on acoustics, speech, and signal processing | 2009

Hands-free speech recognition challenge for real-world speech dialogue systems

Hiroshi Saruwatari; Hiromichi Kawanami; Shota Takeuchi; Yu Takahashi; Tobias Cincarek; Kiyohiro Shikano

In this paper, we describe and review our recent development of hands-free speech dialogue system which is used for railway station guidance. In the application at the real railway station, robustness against reverberation and noise is the most essential issue for the dialogue system. To address the problem, we introduce two key techniques in our proposed hands-free system; (a) speech dialogue system construction with real speech database collection and language/acoustic model improvement, and (b) microphone array preprocessing using blind spatial subtraction array which can solve the reverberation-naiveness problem inherent in conventional microphone arrays. The experimental assessment of the proposed dialogue system reveals that our system can provide the recognition accuracy of more than 80% under realistic railway-station conditions.

international conference on multimedia and expo | 2016

Chat robot coupling machine responses and social media comments for continuous conversation

Hidekazu Minami; Hiromichi Kawanami; Masayuki Kanbara; Norihiro Hagita

In this paper we propose a communicative robot that facilitates talk with a user who lacks a chance of verbal communication with others for any reasons (e.g. the elderly who live alone). As our first trial, this system is designed to support a user to talk while watching a TV program. It features two conversation techniques: realizing natural timing of simultaneous response and providing interesting utterances from social media networks related to the TV program the user is watching. To achieve natural response timing, the proposed system includes three response functions: backchannel, repetition and machine answering. While the system keeps talking using the three kinds of responses, it searches human text comments about the TV program from social network services (SNS). When a comment is found, the robot outputs the comment by synthetic speech. A preliminary experiment is conducted to evaluate how the proposed response technique encourages a user to speak to the chat robot. The results show that average number of user utterances with the proposed robot using social media networks and the above three functions is significantly higher than that when the chat robot only outputs social media comments.

international conference on social robotics | 2017

A TV Chat Robot with Time-Shifting Function for Daily-Use Communication

Shogo Nishimura; Hiromichi Kawanami; Masayuki Kanbara; Norihiro Hagita

This paper attempts to improve a chatting robot that enables to chat with a user while the user is watching television programs. Previously developed robot interacts with two problems because it utters chat sources from social media such as Twitter. Firstly, it is delayed with a few seconds between the contents of TV program and social media comments, so what robot’s utterance doesn’t relate to the scene being broadcast. Secondly, there is overlap between the utterance of TV and the robot. These problems often make users discomfort when they want to continue to daily-use communication. Therefore, this paper introduces the time-shifting method allowing to improve both problems simultaneously. The experiments for using subjective evaluation by 12 subjects and three kinds of TV program are conducted to compare the robot applied time-shifted and previous one. The results show that the robot with time shifting improves user satisfaction in terms of the synchronization of TV scenes and chat sources.

2009 Oriental COCOSDA International Conference on Speech Database and Assessments | 2009

Unknown example detection for example-based spoken dialog system

Shota Takeuchi; Hiromichi Kawanami; Hiroshi Saruwatari; Kiyohiro Shikano

In a spoken dialog system, the example-based response generation method generates a response by searching a dialog example database for the example question most similar to an input user utterance. That method has the advantage of ease of system expansion. It requires, however, a number of utterance examples whose correct responses are labeled. In this paper, we propose an approach to reducing the system expansion cost. This approach employs a detection method that screens the unknown examples, the utterances to be added to the database with their correct responses. The experimental results show that the method can reduce the number of utterances required to be labeled while maintaining the system response accuracy improvement as well as full labeling.

Natural Interaction with Robots, Knowbots and Smartphones, Putting Spoken Dialog Systems into Practice | 2014

Evaluation of Invalid Input Discrimination Using Bag-of-Words for Speech-Oriented Guidance System

Haruka Majima; Rafael Torres; Hiromichi Kawanami; Sunao Hara; Tomoko Matsui; Hiroshi Saruwatari; Kiyohiro Shikano

We investigate a discrimination method for invalid and valid inputs, received by a speech-oriented guidance system operating in a real environment. Invalid inputs include background voices, which are not directly uttered to the system, and nonsense utterances. Such inputs should be rejected beforehand. We have reported methods using not only the likelihood values of Gaussian mixture models (GMM) but also other information in inputs such as bag-of-words, utterance duration, and signal-to-noise ratio to discriminate invalid inputs from valid ones. To deal with these multiple information, we used support vector machine (SVM) with radial basis function kernel and maximum entropy (ME) method and compare the performance. In this paper, we compare the performance changing the amount of training data. In the experiments, we achieve 87.01% of F-measure for SVM and 83.73% for ME using 3,000 training data, while F-measure for GMM-based baseline method is 81.73%.

Natural Interaction with Robots, Knowbots and Smartphones, Putting Spoken Dialog Systems into Practice | 2014

Topic Classification of Spoken Inquiries Using Transductive Support Vector Machine

Rafael Torres; Hiromichi Kawanami; Tomoko Matsui; Hiroshi Saruwatari; Kiyohiro Shikano

In this work, we address the topic classification of spoken inquiries in Japanese that are received by a guidance system operating in a real environment, with a semi-supervised learning approach based on a transductive support vector machine (TSVM). Manual data labeling, which is required for supervised learning, is a costly process, and unlabeled data are usually abundant and cheap to obtain. TSVM allows to treat partially labeled data for semi-supervised learning, including labeled and unlabeled samples in the training set. We are interested in evaluating the influence of including unlabeled samples in the training of the topic classification models, as well as the amount of them that could be necessary for improving performance. Experimental results show that this approach can be useful for taking advantage of unlabeled samples, especially when using larger unlabeled datasets. In particular, we found gains in classification performance for specific topics, such as city information, with a 6.30% F-measure improvement in the case of children’s inquiries and 7.63% for access information in the case of adults’ inquiries.

Explore More