Yoshinori Kitahara | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Yoshinori Kitahara is active.

Explore More

Publication

Featured researches published by Yoshinori Kitahara.

international conference on acoustics, speech, and signal processing | 2006

Scalable Implementation Of Unit Selection Based Text-To-Speech System For Embedded Solutions

Nobuo Nukaga; Ryota Kamoshida; Kenji Nagamatsu; Yoshinori Kitahara

In this paper we propose two methods in order to implement unit selection-based text-to-speech engine into resource-limited embedded systems. While we have achieved improving the quality of synthesized speech by unit selection-based text-to-speech technology, there is a practical problem regarding the trade-off between the size of database and the quality of synthesized speech. That is, we need large database and expensive computation in order to generate highly natural sounding voices, and the text-to-speech system is required to meet the specification of target system. For this problem, we introduced frequency-based approaches to reduce the size of speech database. The experimental results showed the step-by-step downsizing method was better than the direct one in terms of the cumulative join cost and the target cost. Furthermore, some techniques were introduced and evaluated in order to implement our text-to-speech engine into an embedded system. From experimental results, it developed that the run-time work load for the test sentences was 80 MIPS approximately and the implemented engine was useful and scalable for mid-class embedded system

Journal of the Acoustical Society of America | 1988

Prosodic components of speech in the expression of emotions

Yoshinori Kitahara; Yoh'ichi Tohkura

For the purpose of a natural and high‐quality speech synthesis, the role of prosody in speech perception has been studied. Prosodic components, which contribute to the expression of emotions and their intensity, were clarified by analyzing emotional speech and by performing listening tests on synthetic speech. It has been confirmed that prosodic components, which are composed of pitch structure, temporal structure, and amplitude structure, contribute to the expression of emotions more than the spectral structure of speech. Listening test results also showed that the temporal structure was the most important for the expression of anger, while both amplitude structure and pitch structure were much more important for the intensity of anger. Pitch structure also played a significant role in the expression of joy and its intensity. These results suggest the possibility of converting a neutral utterance (i.e., one with no particular emotion) into utterances expressing various kinds of emotions. These results ca...

multimedia signal processing | 1999

Sophisticated speech processing middleware on microprocessor

Nobuo Hataoka; Hiroaki Kokubo; Nobuo Nukaga; Yasunari Obuchi; Akio Amano; Yoshinori Kitahara

This paper describes speech processing middleware which has been developed on RISC microprocessors for embedded speech applications. This middleware consists of a speech recognition module and a speech synthesis module, and especially the speech recognition middleware has advantages of robustness for environmental noise and speaker differences. The speech middleware provides sophisticated user interfaces to multimedia systems using microprocessors as CPUs, such as car navigation systems, mobile information equipment, and game machines.

Journal of the Acoustical Society of America | 1988

Nonlinear time‐scale modification of speech signal with varied segmental duration characteristics

Yoh'ichi Tohkura; Yoshinori Kitahara

Segmental duration of each phoneme changes depending upon the speaking rate. Generally, vowel parts are easier to be compressed or expanded than consonant parts are in fast or slow speech, respectively. Questions raised in this paper include how the speaking rate can be extracted from the speech signal without knowing the content (i.e., phonetic information) and what kind of time‐scale modification can be chosen in order to control speaking rate. First, the segmental duration compressibility of the speech signal was defined by path slopes in DTW spectral matching when utterances with various kinds of speaking rates were matched to a reference utterance of a normal speaking rate. On the assumption that the compressibility is inversely proportional to segmental spectrum changes, the relationship between the compressibility and the average cepstral time difference Δcep [S. Furui, IEEE Trans. Acoust. Speech Signal Process. ASSP‐34, 52–59 (1986)] was studied. The results showed that the Δcep is an efficient pa...

Archive | 1993