Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Shinnosuke Takamichi is active.

Publication


Featured researches published by Shinnosuke Takamichi.


international conference on acoustics, speech, and signal processing | 2014

A postfilter to modify the modulation spectrum in HMM-based speech synthesis

Shinnosuke Takamichi; Tomoki Toda; Graham Neubig; Sakriani Sakti; Satoshi Nakamura

In this paper, we propose a postfilter to compensate modulation spectrum in HMM-based speech synthesis. In order to alleviate over-smoothing effects which is a main cause of quality degradation in HMM-based speech synthesis, it is necessary to consider features that can capture over-smoothing. Global Variance (GV) is one well-known example of such a feature, and the effectiveness of parameter generation algorithm considering GV have been confirmed. However, the quality gap between natural speech and synthetic speech is still large. In this paper, we introduce the Modulation Spectrum (MS) of speech parameter trajectory as a new feature to effectively capture the over-smoothing effect, and we propose a postfilter based on the MS. The MS is represented as a power spectrum of the parameter trajectory. The generated speech parameter sequence is filtered to ensure that its MS has a pattern similar to natural speech. Experimental results show quality improvements when the proposed methods are applied to spectral and F0 components, compared with conventional methods considering GV.


IEEE Transactions on Audio, Speech, and Language Processing | 2016

Postfilters to modify the modulation spectrum for statistical parametric speech synthesis

Shinnosuke Takamichi; Tomoki Toda; Alan W. Black; Graham Neubig; Sakriani Sakti; Satoshi Nakamura

This paper presents novel approaches based on modulation spectrum (MS) for high-quality statistical parametric speech synthesis, including text-to-speech (TTS) and voice conversion (VC). Although statistical parametric speech synthesis offers various advantages over concatenative speech synthesis, the synthetic speech quality is still not as good as that of concatenative speech synthesis or the quality of natural speech. One of the biggest issues causing the quality degradation is the over-smoothing effect often observed in the generated speech parameter trajectories. Global variance (GV) is known as a feature well correlated with the over-smoothing effect, and the effectiveness of keeping the GV of the generated speech parameter trajectories similar to those of natural speech has been confirmed. However, the quality gap between natural speech and synthetic speech is still large. In this paper, we propose using the MS of the generated speech parameter trajectories as a new feature to effectively quantify the over-smoothing effect. Moreover, we propose postfilters to modify the MS utterance by utterance or segment by segment to make the MS of synthetic speech close to that of natural speech. The proposed postfilters are applicable to various synthesizers based on statistical parametric speech synthesis. We first perform an evaluation of the proposed method in the framework of hidden Markov model (HMM)-based TTS, examining its properties from different perspectives. Furthermore, effectiveness of the proposed postfilters are also evaluated in Gaussian mixture model (GMM)-based VC and classification and regression trees (CART)-based TTS (a.k.a., CLUSTERGEN). The experimental results demonstrate that 1) the proposed utterance-level postfilter achieves quality comparable to the conventional generation algorithm considering the GV, and yields significant improvements by applying to the GV-based generation algorithm in HMM-based TTS, 2) the proposed segment-level postfilter capable of achieving low-delay synthesis also yields significant improvements in synthetic speech quality, and 3) the proposed postfilters are also effective in not only HMM-based TTS but also GMM-based VC and CLUSTERGEN.


IEEE Journal of Selected Topics in Signal Processing | 2014

Parameter Generation Methods With Rich Context Models for High-Quality and Flexible Text-To-Speech Synthesis

Shinnosuke Takamichi; Tomoki Toda; Yoshinori Shiga; Sakriani Sakti; Graham Neubig; Satoshi Nakamura

In this paper, we propose parameter generation methods using rich context models as yet another hybrid method combining Hidden Markov Model (HMM)-based speech synthesis and unit selection synthesis. Traditional HMM-based speech synthesis enables flexible modeling of acoustic features based on a statistical approach. However, the speech parameters tend to be excessively smoothed. To address this problem, several hybrid methods combining HMM-based speech synthesis and unit selection synthesis have been proposed. Although they significantly improve quality of synthetic speech, they usually lose flexibility of the original HMM-based speech synthesis. In the proposed methods, we use rich context models, which are statistical models that represent individual acoustic parameter segments. In training, the rich context models are reformulated as Gaussian Mixture Models (GMMs). In synthesis, initial speech parameters are generated from probability distributions over-fitted to individual segments, and the speech parameter sequence is iteratively generated from GMMs using a parameter generation method based on the maximum likelihood criterion. Since the basic framework of the proposed methods is still the same as the traditional framework, the capability of flexibly modeling acoustic features remains. The experimental results demonstrate: (1) the use of approximation with a single Gaussian component sequence yields better synthetic speech quality than the use of EM algorithm in the proposed parameter generation method, (2) the state-based model selection yields quality improvements at the same level as the frame-based model selection, (3) the use of the initial parameters generated from the over-fitted speech probability distributions is very effective to further improve speech quality, and (4) the proposed methods for spectral and F0 components yields significant improvements in synthetic speech quality compared with the traditional HMM-based speech synthesis.


international conference on acoustics, speech, and signal processing | 2015

Modulation spectrum-constrained trajectory training algorithm for GMM-based Voice Conversion

Shinnosuke Takamichi; Tomoki Toda; Alan W. Black; Satoshi Nakamura

This paper presents a novel training algorithm for Gaussian Mixture Model (GMM)-based Voice Conversion (VC). One of the advantages of GMM-based VC is computationally efficient conversion processing enabling to achieve real-time VC applications. On the other hand, the quality of the converted speech is still significantly worse than that of natural speech. In order to address this problem while preserving the computationally efficient conversion processing, the proposed training method enables 1) to use a consistent optimization criterion between training and conversion and 2) to compensate a Modulation Spectrum (MS) of the converted parameter trajectory as a feature sensitively correlated with over-smoothing effects causing quality degradation of the converted speech. The experimental results demonstrate that the proposed algorithm yields significant improvements in term of both the converted speech quality and the conversion accuracy for speaker individuality compared to the basic training algorithm.


international conference on acoustics, speech, and signal processing | 2015

Parameter generation algorithm considering Modulation Spectrum for HMM-based speech synthesis

Shinnosuke Takamichi; Tomoki Toda; Alan W. Black; Satoshi Nakamura

This paper proposes a novel parameter generation algorithm for high-quality speech generation in Hidden Markov Model (HMM)-based speech synthesis. One of the biggest issues causing significant quality degradation is the over-smoothing effect often observed in generated parameter trajectories. Global Variance (GV) is known as a feature well correlated with the over-smoothing effect and a metric on the GV of the generated parameters is effectively used as a penalty term in the conventional parameter generation. However, the quality of the synthetic speech is far from that of the natural speech. Recently, we have found that a Modulation Spectrum (MS) of the generated parameters, which is also regarded as an extension of the GV, is more sensitively correlated with the over-smoothing effect than the GV. This paper incorporates a metric on the MS as a new penalty term in the proposed parameter generation algorithm. The experimental results demonstrate that the proposed parameter generation algorithm considering the MS yields significant improvements in synthetic speech quality compared to the conventional parameter generation algorithm considering the GV.


asia pacific signal and information processing association annual summit and conference | 2014

Modulation spectrum-based post-filter for GMM-based Voice Conversion

Shinnosuke Takamichi; Tomoki Toda; Alan W. Black; Satoshi Nakamura

This paper addresses an over-smoothing effect in Gaussian Mixture Model (GMM)-based Voice Conversion (VC). The flexible use of the statistical approach is one of the major reason why this approach is widely applied to the speech-based systems. However, quality degradation by over-smoothed speech parameter converted is unavoidable problem of statistical modeling. One of common approaches to this over-smoothness in conversion step is to compensate generated features, such as Global Variance (GV), that explicitly express the over-smoothing effect. In statistical Text-To-Speech (TTS) synthesis, we have recently introduced a Modulation Spectrum (MS) which is an extended form of GV, and have proposed MS-based Post-Filter (MSPF) in Hidden Markov Model (HMM)-based TTS synthesis. In this paper, we apply the MSPF to GMM-based VC. Because the MS of speech parameters is degraded through GMM-based conversion process, we perform the post-filter due to MS modification of converted parameters. The experimental evaluation yields the quality benefits by the proposed post-filter.


ieee global conference on signal and information processing | 2014

Modified post-filter to recover modulation spectrum for HMM-based speech synthesis

Shinnosuke Takamichi; Tomoki Toda; Alan W. Black; Satoshi Nakamura

This paper proposes a modified post-filter to recover a Modulation Spectrum (MS) in HMM-based speech synthesis. To alleviate the over-smoothing effect which is one of the major problems in HMM-based speech synthesis, the MS-based post-filter has been proposed. It recovers the utterance-level MS of the generated speech trajectory, and we have reported its benefit to the quality improvement. However, this post-filter is not applicable to various lengths of speech parameter trajectories, such as phrases or segments, which are shorter than an utterance. To address this problem, we propose two modified post-filters, (1) the time-invariant filter with a simplified conversion form and (2) the segment-level post-filter which applicable to a short-term parameter sequence. Furthermore, we also propose (3) the post-filter to recover the phoneme-level MS of HMM-state duration. Experimental results show that the modified post-filters also yield significant quality improvements in synthetic speech as yielded by the conventional post-filter.


conference of the international speech communication association | 2016

The NU-NAIST Voice Conversion System for the Voice Conversion Challenge 2016.

Kazuhiro Kobayashi; Shinnosuke Takamichi; Satoshi Nakamura; Tomoki Toda

This paper presents the NU-NAIST voice conversion (VC) system for the Voice Conversion Challenge 2016 (VCC 2016) developed by a joint team of Nagoya University and Nara Institute of Science and Technology. Statistical VC based on a Gaussian mixture model makes it possible to convert speaker identity of a source speaker’ voice into that of a target speaker by converting several speech parameters. However, various factors such as parameterization errors and over-smoothing effects usually cause speech quality degradation of the converted voice. To address this issue, we have proposed a direct waveform modification technique based on spectral differential filtering and have successfully applied it to singing voice conversion where excitation features are not necessary converted. In this paper, we propose a method to apply this technique to a standard voice conversion task where excitation feature conversion is needed. The result of VCC 2016 demonstrates that the NU-NAIST VC system developed by the proposed method yields the best conversion accuracy for speaker identity (more than 70% of the correct rate) and quite high naturalness score (more than 3 of the mean opinion score). This paper presents detail descriptions of the NU-NAIST VC system and additional results of its performance evaluation.


Machine Translation | 2018

An end-to-end model for cross-lingual transformation of paralinguistic information

Takatomo Kano; Shinnosuke Takamichi; Sakriani Sakti; Graham Neubig; Tomoki Toda; Satoshi Nakamura

Speech translation is a technology that helps people communicate across different languages. The most commonly used speech translation model is composed of automatic speech recognition, machine translation and text-to-speech synthesis components, which share information only at the text level. However, spoken communication is different from written communication in that it uses rich acoustic cues such as prosody in order to transmit more information through non-verbal channels. This paper is concerned with speech-to-speech translation that is sensitive to this paralinguistic information. Our long-term goal is to make a system that allows users to speak a foreign language with the same expressiveness as if they were speaking in their own language. Our method works by reconstructing input acoustic features in the target language. From the many different possible paralinguistic features to handle, in this paper we choose duration and power as a first step, proposing a method that can translate these features from input speech to the output speech in continuous space. This is done in a simple and language-independent fashion by training an end-to-end model that maps source-language duration and power information into the target language. Two approaches are investigated: linear regression and neural network models. We evaluate the proposed methods and show that paralinguistic information in the input speech of the source language can be reflected in the output speech of the target language.


IWSLT | 2012

A Method for Translation of Paralinguistic Information

Takatomo Kano; Sakriani Sakti; Shinnosuke Takamichi; Graham Neubig; Tomoki Toda; Satoshi Nakamura

Collaboration


Dive into the Shinnosuke Takamichi's collaboration.

Top Co-Authors

Avatar

Satoshi Nakamura

Nara Institute of Science and Technology

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Sakriani Sakti

Nara Institute of Science and Technology

View shared research outputs
Top Co-Authors

Avatar

Graham Neubig

Carnegie Mellon University

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Alan W. Black

Carnegie Mellon University

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Takatomo Kano

Nara Institute of Science and Technology

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Daichi Kitamura

Graduate University for Advanced Studies

View shared research outputs
Researchain Logo
Decentralizing Knowledge