Yusuke Ijima | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Yusuke Ijima is active.

Explore More

Publication

Featured researches published by Yusuke Ijima.

international conference on acoustics, speech, and signal processing | 2017

Generative adversarial network-based postfilter for statistical parametric speech synthesis

Takuhiro Kaneko; Hirokazu Kameoka; Nobukatsu Hojo; Yusuke Ijima; Kaoru Hiramatsu; Kunio Kashino

We propose a postfilter based on a generative adversarial network (GAN) to compensate for the differences between natural speech and speech synthesized by statistical parametric speech synthesis. In particular, we focus on the differences caused by over-smoothing, which makes the sounds muffled. Over-smoothing occurs in the time and frequency directions and is highly correlated in both directions, and conventional methods based on heuristics are too limited to cover all the factors (e.g., global variance was designed only to recover the dynamic range). To solve this problem, we focus on “spectral texture”, i.e., the details of the time-frequency representation, and propose a learning-based postfilter that captures the structures directly from the data. To estimate the true distribution, we utilize a GAN composed of a generator and a discriminator. This optimizes the generator to produce samples imitating the dataset according to the adversarial discriminator. This adversarial process encourages the generator to fit the true data distribution, i.e., to generate realistic spectral texture. Objective evaluation of experimental results shows that the GAN-based postfilter can compensate for detailed spectral structures including modulation spectrum, and subjective evaluation shows that its generated speech is comparable to natural speech.

Speech Communication | 2014

Prosodic variation enhancement using unsupervised context labeling for HMM-based expressive speech synthesis

Yu Maeno; Takashi Nose; Takao Kobayashi; Tomoki Koriyama; Yusuke Ijima; Hideharu Nakajima; Hideyuki Mizuno; Osamu Yoshioka

This paper proposes an unsupervised labeling technique using phrase-level prosodic contexts for HMM-based expressive speech synthesis, which enables users to manually enhance prosodic variations of synthetic speech without degrading the naturalness. In the proposed technique, HMMs are first trained using the conventional labels including only linguistic information, and prosodic features are generated from the HMMs. The average difference of original and generated prosodic features for each accent phrase is then calculated and classified into three classes, e.g., low, neutral, and high in the case of fundamental frequency. The created prosodic context label has a practical meaning such as high/low of relative pitch at the phrase level, and hence it is expected that users can modify the prosodic characteristic of synthetic speech in an intuitive way by manually changing the proposed labels. In the experiments, we evaluate the proposed technique in both ideal and practical conditions using speech of sales talk and fairy tale recorded under a realistic domain. In the evaluation under the practical condition, we evaluate whether the users achieve their intended prosodic modification by changing the proposed context label of a certain accent phrase for a given sentence.

international conference on acoustics, speech, and signal processing | 2009

Emotional speech recognition based on style estimation and adaptation with multiple-regression HMM

Yusuke Ijima; Makoto Tachibana; Takashi Nose; Takao Kobayashi

This paper proposes a technique for emotional speech recognition which enables us to extract paralinguistic information as well as linguistic information contained in speech signal. The technique is based on style estimation and style adaptation using multiple-regression HMM. Recognition process consists of two stages. In the first stage, a style vector that represents the emotional expression category and intensity of its variation of input speech is estimated on a sentence-by-sentence basis. Then the acoustic models are adapted using the estimated style vector and standard HMM-based speech recognition is performed in the second stage. We assess the performance of the proposed technique on the recognition of acted emotional speech uttered by both professional narrators and non-professional speakers and show the effectiveness of the technique.

conference of the international speech communication association | 2016

An Investigation of DNN-Based Speech Synthesis Using Speaker Codes.

Nobukatsu Hojo; Yusuke Ijima; Hideyuki Mizuno

Recent studies have shown that DNN-based speech synthesis can produce more natural synthesized speech than the conventional HMM-based speech synthesis. However, an open problem remains as to whether the synthesized speech quality can be improved by utilizing a multi-speaker speech corpus. To address this problem, this paper proposes DNN-based speech synthesis using speaker codes as a simple method to improve the performance of the conventional speaker dependent DNN-based method. In order to model speaker variation in the DNN, the augmented feature (speaker codes) is fed to the hidden layer(s) of the conventional DNN. The proposed method trains connection weights of the whole DNN using a multispeaker speech corpus. When synthesizing a speech parameter sequence, a target speaker is chosen from the corpus and the speaker code corresponding to the selected target speaker is fed to the DNN to generate the speaker’s voice. We investigated the relationship between the prediction performance and architecture of the DNNs by changing the input hidden layer for speaker codes. Experimental results showed that the proposed model outperformed the conventional speaker-dependent DNN when the model architecture was set at optimal for the amount of training data of the selected target speaker.

international conference on acoustics, speech, and signal processing | 2013

HMM-based expressive speech synthesis based on phrase-level F0 context labeling

Yu Maeno; Takashi Nose; Takao Kobayashi; Tomoki Koriyama; Yusuke Ijima; Hideharu Nakajima; Hideyuki Mizuno; Osamu Yoshioka

This paper proposes a technique for adding more prosodic variations to the synthetic speech in HMM-based expressive speech synthesis. We create novel phrase-level F0 context labels from the residual information of F0 features between original and synthetic speech for the training data. Specifically, we classify the difference of average log F0 values between the original and synthetic speech into three classes which have perceptual meanings, i.e., high, neutral, and low of relative pitch at the phrase level. We evaluate both ideal and practical cases using appealing and fairy tale speech recorded under a realistic condition. In the ideal case, we examine the potential of our technique to modify the F0 patterns under a condition where the original F0 contours of test sentences are known. In the practical case, we show how the users intuitively modify the pitch by changing the initial F0 context labels obtained from the input text.

conference of the international speech communication association | 2016

Objective Evaluation Using Association Between Dimensions Within Spectral Features for Statistical Parametric Speech Synthesis.

Yusuke Ijima; Taichi Asami; Hideyuki Mizuno

This paper presents a novel objective evaluation technique for statistical parametric speech synthesis. One of its novel features is that it focuses on the association between dimensions within the spectral features. We first use a maximal information coefficient to analyze the relationship between subjective scores and associations of spectral features obtained from natural and various types of synthesized speech. The analysis results indicate that the scores improve as the association becomes weaker. We then describe the proposed objective evaluation technique, which uses a voice conversion method to detect the associations within spectral features. We perform subjective and objective experiments to investigate the relationship between subjective scores and objective scores. The proposed objective scores are compared to the mel-cepstral distortion. The results indicate that our objective scores achieve dramatically higher correlation to subjective scores than the mel-cepstral distortion.

conference of the international speech communication association | 2011