Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Yasuhiro Hamada is active.

Publication


Featured researches published by Yasuhiro Hamada.


asia pacific signal and information processing association annual summit and conference | 2015

Emotional speech synthesis system based on a three-layered model using a dimensional approach

Yawen Xue; Yasuhiro Hamada; Masato Akagi

This paper proposes an emotional speech synthesis system based on a three-layered model using a dimensional approach. Most previous studies related to emotional speech synthesis using the dimensional approach focused on the relationship between acoustic features and emotion dimensions (valence and activation) only. However, people do not perceive emotion directly from acoustic features. Hence, the acoustic features have being particularly difficult to predict, and the affectiveness of the synthesized sound is far from that intended. The ultimate goal of this research is to improve the accuracy of acoustic feature estimation and modification rules in order to synthesize affective speech more similar to that intended in the dimensional emotion space. The proposed system is composed by three layers: acoustic features, semantic primitives, and emotion dimensions. Fuzzy Inference System (FIS) is used to connect the three layers. The related acoustic features of each semantic primitive are selected for synthesizing the emotional speech. On the basis of morphing rules, the estimated acoustic features can be applied to synthesize emotional speech. Listening tests were carried out to verify whether the synthesized speech can give the intended impression in the dimensional emotion space. Results show that not only is the accuracy of estimated acoustic features raised but also the modification rules work well for the synthesized speech, resulting in the proposed method improving the quality of synthesized speech.


asia pacific signal and information processing association annual summit and conference | 2014

A method for emotional speech synthesis based on the position of emotional state in Valence-Activation space

Yasuhiro Hamada; Reda Elbarougy; Masato Akagi

Speech to Speech translation (S2ST) systems are very important for processing by which a spoken utterance in one language is used to produce a spoken output in another language. In S2ST techniques, so far, linguistic information has been mainly adopted without para- and non-linguistic information (emotion, individuality and gender, etc.). Therefore, this systems have a limitation in synthesizing affective speech, for example emotional speech, instead of neutral one. To deal with affective speech, a system that can recognize and synthesize emotional speech is required. Although most studies focused on emotions categorically, emotional styles are not categorical but continuously spread in emotion space that are spanned by two dimensions (Valence and Activation). This paper proposes a method for synthesizing emotional speech based on the positions in Valence-Activation (V-A) space. In order to model relationships between acoustic features and V-A space, Fuzzy Inference Systems (FISs) were constructed. Twenty-one acoustic features were morphed using FISs. To verify whether synthesized speech can be perceived as the same intended position in V-A space, listening tests were carried out. The results indicate that the synthesized speech can give the same impression in the V-A space as the intended speech does.


asia pacific signal and information processing association annual summit and conference | 2016

Voice conversion to emotional speech based on three-layered model in dimensional approach and parameterization of dynamic features in prosody

Yawen Xue; Yasuhiro Hamada; Masato Akagi

This paper proposes a system to convert neutral speech to emotional with controlled intensity of emotions. Most of previous researches considering synthesis of emotional voices used statistical or concatenative methods that can synthesize emotions in categorical emotional states such as joy, angry, sad, etc. While humans sometimes enhance or relieve emotional states and intensity during daily life, synthesized emotional speech in categories is not enough to describe these phenomena precisely. A dimensional approach which can represent emotion as a point in a dimensional space can express emotions with continuous intensity. Employing the dimensional approach to describe emotion, we conduct a three-layered model to estimate displacement of the acoustic features of the target emotional speech from that of source (neutral) speech and propose a rule-based conversion method to modify acoustic features of source (neutral) speech to synthesize the target emotional speech. To convert the source speech freely and easily, we introduce two methods to parameterize dynamic features in prosody, that is, Fujisaki model for f0 contour and target prediction model for power envelope. Evaluation results show that subjects can perceive intended emotion with satisfactory order of emotional intensity and naturalness. This fact means that this system not only has the ability to synthesize emotional speech in category but also can control the order of emotional intensity in dimensional space even in the same emotion category.


Speech Communication | 2018

Voice conversion for emotional speech: Rule-based synthesis with degree of emotion controllable in dimensional space

Yawen Xue; Yasuhiro Hamada; Masato Akagi

Abstract This paper proposes a rule-based voice conversion system for emotion which is capable of converting neutral speech to emotional speech using dimensional space (arousal and valence) to control the degree of emotion on a continuous scale. We propose an inverse three-layered model with acoustic features as output at the top layer, semantic primitives at the middle layer and emotion dimension as input at the bottom layer; an adaptive-based fuzzy inference system acts as connectors to extract the non-linear rules among the three layers. The rules are applied by modifying the acoustic features of neutral speech to create the different types of emotional speech. The prosody-related acoustic features of F0 and power envelope are parameterized using the Fujisaki model and target prediction model separately. Perceptual evaluation results show that the degree of emotion can be perceived well in the dimensional space of valence and arousal.


Journal of the Acoustical Society of America | 2018

Acoustic features in speech for emergency perception

Maori Kobayashi; Yasuhiro Hamada; Masato Akagi

Previous studies have reported that the acoustic features such as the speech rate, fundamental frequency (F0), amplitude, and voice gender are related to emergency perception in speech. However, the most critical factor influencing the emergency perception in speech remains unknown. In this study, we compared influences of three acoustic features (speech rate, F0, and spectral sequence (amplitude)) to determine the acoustic feature that has the most influence on emergency perception in speech. Prior to conducting our experiments, we selected five speech phrases with different level of perceived emergency among various speech phrase spoken by TV casters during real emergencies. We then created synthesized voices by replacing three acoustic features separately among the selected five voices. In experiment 1, we presented these synthesized voices to 10 participants and asked them to evaluate levels of the perceived emergency of each voice by the magnitude estimation method. The results from experiment 1 showed that F0 was most influential on emergency perception. In experiment 2, we examined influences of the three acoustic features on auditory impression related to the perceived emergency by the SD method. The results suggested that emotional effects of some words such as “tense” or/and “rush” were influenced by the fundamental frequency.Previous studies have reported that the acoustic features such as the speech rate, fundamental frequency (F0), amplitude, and voice gender are related to emergency perception in speech. However, the most critical factor influencing the emergency perception in speech remains unknown. In this study, we compared influences of three acoustic features (speech rate, F0, and spectral sequence (amplitude)) to determine the acoustic feature that has the most influence on emergency perception in speech. Prior to conducting our experiments, we selected five speech phrases with different level of perceived emergency among various speech phrase spoken by TV casters during real emergencies. We then created synthesized voices by replacing three acoustic features separately among the selected five voices. In experiment 1, we presented these synthesized voices to 10 participants and asked them to evaluate levels of the perceived emergency of each voice by the magnitude estimation method. The results from experiment 1 show...


Journal of the Acoustical Society of America | 2016

Emotional voice conversion system for multiple languages based on three-layered model in dimensional space

Yawen Xue; Yasuhiro Hamada; Masato Akagi

This paper proposes a system to convert neutral speech to emotional one with controlled intensity of emotions for multiple languages. Most of previous researches considering synthesis of emotional voices used statistical or concatenative methods that can synthesize emotions in category. While humans sometimes enhance or relieve emotional states and intensity during daily life, synthesized emotional speech in categories is not enough. A dimensional approach which can represent emotion as a point in a dimensional space can express emotions with continuous intensity. Employing the dimensional approach, a three-layered model to estimate displacement of the acoustic features of the target emotional speech from that of source (neutral) speech and a rule-based conversion method to modify acoustic features of source (neutral) speech to synthesize the target emotional speech are proposed. To convert the source speech freely and easily, Fujisaki model for f0 contour and target prediction model for power envelope ar...


Journal of the Acoustical Society of America | 2016

Singing voice synthesizer with non-filter waveform generation using spectral phase reconstruction

Yasuhiro Hamada; Nobutaka Ono; Shigeki Sagayama

This paper discusses a singing voice synthesis system based on non-filter waveform generation using spectral phase reconstruction as an alternative method to replace the conventional source-filter model. Source-filter model has been considered as an essential technique in the long history of speech synthesis as it simulates the human process of speech production consisting of excitation and resonances, even after hidden Markov model was introduced in the 90s toward statistically trainable speech synthesis. Filter (particularly, recursive filter), however, may cause serious problems in undesired amplitude and long time decay when sharp resonances in recursive filter match harmonics of F0 of the excitation. As the ultimate purpose of usage of filter is to transfer the spectrogram designed in the TTS system to the listener’s hearing organs, an alternative solution can be brought from “phase reconstruction” without using filters. We propose a spectral phase reconstruction, instead of using filter, to generate...


9th ISCA Speech Synthesis Workshop | 2016

Non-filter waveform generation from cepstrum using spectral phase reconstruction.

Yasuhiro Hamada; Nobutaka Ono; Shigeki Sagayama

This paper discusses non-filter waveform generation from cepstral features using spectral phase reconstruction as an alternative method to replace the conventional source-filter model in text-to-speech (TTS) systems. As the primary purpose of the use of filters is considered as producing a waveform from the desired spectrum shape, one possible alternative of the sourcefilter framework is to directly convert the designed spectrum into a waveform by utilizing a recently developed“ phase reconstruction”from the power spectrogram. Given cepstral features and fundamental frequency (F0) as desired spectrum from a TTS system, the spectrum to be heard by the listener is calculated by converting the cepstral features into a linear-scale power spectrum and multiplying with the pitch structure of F0. The signal waveform is generated from the power spectrogram by spectral phase reconstruction. An advantageous property of the proposed method is that it is free from undesired amplitude and long time decay often caused by sharp resonances in recursive filters. In preliminary experiments, we compared temporal and gain characteristics of the synthesized speech using the proposed method and mel-log spectrum approximation (MLSA) filter. Results show the proposed method performed better than the MLSA filter in the both characteristics of the synthesized speech, and imply a desirable properties of the proposed method for speech synthesis.


2016 Conference of The Oriental Chapter of International Committee for Coordination and Standardization of Speech Databases and Assessment Techniques (O-COCOSDA) | 2016

Voice conversion system to emotional speech in multiple languages based on three-layered model for dimensional space

Yawen Xue; Yasuhiro Hamada; Reda Elbarougy; Masato Akagi

Commonalities and differences of human perception for perceiving emotions in speech among different languages in dimensional space have been investigated in previous work. Results show that human perception for different languages is identical in dimensional space. Directions from neutral voice to other emotional states are common among languages. According to this result, we assume that, given the same direction in dimensional space, we can convert the neutral voices in multiple languages to emotional ones with the same impression of emotion. It means that the emotion conversion system could work for other languages even if it is trained with a databases in one language. We try to convert neutral speech in two different languages, English and Chinese using an emotion conversion system trained with Japanese database. Chinese is a tone language, English is a stress language and Japanese is an accent language. We find that all converted voices can convey the same impression as Japanese voices. On the case, we can make a conclusion that given the same direction in dimensional space, the synthesized speech among multiple language can convey the same impression of emotion. In a word, the Japanese emotion conversion system is compatible to other languages.


asia pacific signal and information processing association annual summit and conference | 2014

Toward affective speech-to-speech translation: Strategy for emotional speech recognition and synthesis in multiple languages

Masato Akagi; Xiao Han; Reda Elbarougy; Yasuhiro Hamada; Junfeng Li

Collaboration


Dive into the Yasuhiro Hamada's collaboration.

Top Co-Authors

Avatar

Masato Akagi

Japan Advanced Institute of Science and Technology

View shared research outputs
Top Co-Authors

Avatar

Yawen Xue

Japan Advanced Institute of Science and Technology

View shared research outputs
Top Co-Authors

Avatar

Reda Elbarougy

Japan Advanced Institute of Science and Technology

View shared research outputs
Top Co-Authors

Avatar

Nobutaka Ono

National Institute of Informatics

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Xiao Han

Japan Advanced Institute of Science and Technology

View shared research outputs
Top Co-Authors

Avatar

Junfeng Li

Japan Advanced Institute of Science and Technology

View shared research outputs
Top Co-Authors

Avatar

Maori Kobayashi

Japan Advanced Institute of Science and Technology

View shared research outputs
Top Co-Authors

Avatar

Junfeng Li

Japan Advanced Institute of Science and Technology

View shared research outputs
Researchain Logo
Decentralizing Knowledge