Yoshinori Shiga | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Yoshinori Shiga is active.

Explore More

Publication

Featured researches published by Yoshinori Shiga.

IEEE Journal of Selected Topics in Signal Processing | 2014

Parameter Generation Methods With Rich Context Models for High-Quality and Flexible Text-To-Speech Synthesis

Shinnosuke Takamichi; Tomoki Toda; Yoshinori Shiga; Sakriani Sakti; Graham Neubig; Satoshi Nakamura

In this paper, we propose parameter generation methods using rich context models as yet another hybrid method combining Hidden Markov Model (HMM)-based speech synthesis and unit selection synthesis. Traditional HMM-based speech synthesis enables flexible modeling of acoustic features based on a statistical approach. However, the speech parameters tend to be excessively smoothed. To address this problem, several hybrid methods combining HMM-based speech synthesis and unit selection synthesis have been proposed. Although they significantly improve quality of synthetic speech, they usually lose flexibility of the original HMM-based speech synthesis. In the proposed methods, we use rich context models, which are statistical models that represent individual acoustic parameter segments. In training, the rich context models are reformulated as Gaussian Mixture Models (GMMs). In synthesis, initial speech parameters are generated from probability distributions over-fitted to individual segments, and the speech parameter sequence is iteratively generated from GMMs using a parameter generation method based on the maximum likelihood criterion. Since the basic framework of the proposed methods is still the same as the traditional framework, the capability of flexibly modeling acoustic features remains. The experimental results demonstrate: (1) the use of approximation with a single Gaussian component sequence yields better synthetic speech quality than the use of EM algorithm in the proposed parameter generation method, (2) the state-based model selection yields quality improvements at the same level as the frame-based model selection, (3) the use of the initial parameters generated from the over-fitted speech probability distributions is very effective to further improve speech quality, and (4) the proposed methods for spectral and F0 components yields significant improvements in synthetic speech quality compared with the traditional HMM-based speech synthesis.

mobile data management | 2013

Multilingual Speech-to-Speech Translation System: VoiceTra

Shigeki Matsuda; Xinhui Hu; Yoshinori Shiga; Hideki Kashioka; Chiori Hori; Keiji Yasuda; Hideo Okuma; Masao Uchiyama; Eiichiro Sumita; Hisashi Kawai; Satoshi Nakamura

This study presents an overview of VoiceTra, which was developed by NICT and released as the worlds first network-based multilingual speech-to-speech translation system for smartphones, and describes in detail its multilingual speech recognition, its multilingual translation, and its multilingual speech synthesis in regards to field experiments. We show the effects of system updates using the data collected from field experiments to improve our acoustic and language models.

conference of the international speech communication association | 2016

Model Integration for HMM- and DNN-Based Speech Synthesis Using Product-of-Experts Framework.

Kentaro Tachibana; Tomoki Toda; Yoshinori Shiga; Hisashi Kawai

In this paper, we propose a model integration method for hidden Markov model (HMM) and deep neural network (DNN) based acoustic models using a product-of-experts (PoE) framework in statistical parametric speech synthesis. In speech parameter generation, DNN predicts a mean vector of the probability density function of speech parameters frame by frame while keeping its covariance matrix constant over all frames. On the other hand, HMM predicts the covariance matrix as well as the mean vector but they are fixed within the same HMM state, i.e., they can actually vary state by state. To make it possible to predict a better probability density function by leveraging advantages of individual models, the proposed method integrates DNN and HMM as PoE, generating a new probability density function satisfying conditions of both DNN and HMM. Furthermore, we propose a joint optimization method of DNN and HMM within the PoE framework by effectively using additional latent variables. We conducted objective and subjective evaluations, demonstrating that the proposed method significantly outperforms the DNN-based speech synthesis as well as the HMM-based speech synthesis.

conference of the international speech communication association | 2016

Using Zero-Frequency Resonator to Extract Multilingual Intonation Structure.

Jinfu Ni; Yoshinori Shiga; Hisashi Kawai

Human uses expressive intonation to convey linguistic and paralinguistic meaning, especially making focal prominence to give emphasis that highlights the focus of speech. Automatic extraction of dynamic intonation feature from a speech corpus and representing it in a continuous form are desired in multilingual speech synthesis. This paper presents a method to extract dynamic prosodic structure from speech signal using zerofrequency resonator to detect glottal cycle epoch and filter both voice amplitude and fundamental frequency (F0) contours. We choose stable voice F0 segments free from micro-prosodic effect to recover relevant F0 trajectory of an utterance, taking into consideration of inter-correlation of micro-prosody with phonetic segments and syllable structure of the utterance, and further filter out long-term global pitch movements. The method is evaluated by objective tests upon multilingual speech corpora including Chinese, Japanese, Korean, and Myanmar. Our experiment results show that the extracted intonation contour can match F0 contour by conventional approach in very high accuracy and the estimated long-term pitch movements demonstrate regular characteristics of intonation across languages. The proposed method is language-independent and robust to noisy speech.

Journal of the Acoustical Society of America | 2000