Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Masami Akamine is active.

Publication


Featured researches published by Masami Akamine.


international conference on acoustics, speech, and signal processing | 2012

Complex cepstrum as phase information in statistical parametric speech synthesis

Ranniery Maia; Masami Akamine; Mark J. F. Gales

Statistical parametric synthesizers usually rely on a simplified model of speech production where a minimum-phase filter is driven by a zero or random phase excitation signal. However, this procedure does not take into account the natural mixed-phase characteristics of the speech signal. This paper addresses this issue by proposing the use of the complex cepstrum for modeling phase information in statistical parametric speech synthesizers. Here a frame-based complex cepstrum is calculated through the interpolation of pitch-synchronous magnitude and unwrapped phase spectra. The noncausal part of the frame-based complex cepstrum is then modeled as phase features in the statistical parametric synthesizer. At synthesis time, the generated phase parameters are used to derive coefficients of a glottal filter. Experimental results show that the proposed approach effectively embeds phase information in the synthetic speech, resulting in close-to-natural waveforms and better speech quality.


international conference on acoustics, speech, and signal processing | 2011

Continuous F0 in the source-excitation generation for HMM-based TTS: Do we need voiced/unvoiced classification?

Javier Latorre; Mark J. F. Gales; Sabine Buchholz; Katherine Mary Knill; Masatsune Tamura; Yamato Ohtani; Masami Akamine

Most HMM-based TTS systems use a hard voiced/unvoiced classification to produce a discontinuous F0 signal which is used for the generation of the source-excitation. When a mixed source excitation is used, this decision can be based on two different sources of information: the state-specific MSD-prior of the F0 models, and/or the frame-specific features generated by the aperiodicity model. This paper examines the meaning of these variables in the synthesis process, their interaction, and how they affect the perceived quality of the generated speech The results of several perceptual experiments show that when using mixed excitation, subjects consistently prefer samples with very few or no false unvoiced errors, whereas a reduction in the rate of false voiced errors does not produce any perceptual improvement. This suggests that rather than using any form of hard voiced/unvoiced classification, e.g., the MSD-prior, it is better for synthesis to use a continuous F0 signal and rely on the frame-level soft voiced/unvoiced decision of the aperiodicity model.


Speech Communication | 2013

Complex cepstrum for statistical parametric speech synthesis

Ranniery Maia; Masami Akamine; Mark J. F. Gales

Highlights? Complex cepstrum is applied to statistical parametric speech synthesis. ? At synthesis time, phase features derived from the allpass component of the complex cepstrum are used to implement a glottal pulse filter. ? Experimental results show that the addition of the phase features results in better synthetic speech quality. Statistical parametric synthesizers have typically relied on a simplified model of speech production. In this model, speech is generated using a minimum-phase filter, implemented from coefficients derived from spectral parameters, driven by a zero or random phase excitation signal. This excitation signal is usually constructed from fundamental frequencies and parameters used to control the balance between the periodicity and aperiodicity of the signal. The application of this approach to statistical parametric synthesis has partly been motivated by speech coding theory. However, in contrast to most real-time speech coders, parametric speech synthesizers do not require causality. This allows the standard simplified model to be extended to represent the natural mixed-phase characteristics of speech signals. This paper proposes the use of the complex cepstrum to model the mixed phase characteristics of speech through the incorporation of phase information in statistical parametric synthesis. The phase information is contained in the anti-causal portion of the complex cepstrum. These parameters have a direct connection with the shape of the glottal pulse of the excitation signal. Phase parameters are extracted on a frame-basis and are modeled in the same fashion as the minimum-phase synthesis filter parameters. At synthesis time, phase parameter trajectories are generated and used to modify the excitation signal. Experimental results show that the use of such complex cepstrum-based phase features results in better synthesized speech quality. Listening test results yield an average preference of 60% for the system with the proposed phase feature on both female and male voices.


international conference on acoustics speech and signal processing | 1999

CELP speech coding based on an adaptive pulse position codebook

Tadashi Amada; Kimio Miseki; Masami Akamine

CELP coders using pulse codebooks for excitations such as ACELP have the advantages of low complexity and high speech quality. At low bit rates, however, the decrease of pulse position candidates and the number of pulses degrades reconstructed speech quality. This paper describes a method for adaptively allocating of pulse position candidates. In the proposed method, N efficient candidates of pulse positions are selected out of all possible positions in a subframe. The amplitude envelope of an adaptive code vector is used for selecting N efficient candidates. The larger the amplitude is, the more pulse positions are assigned. Using an adaptive code vector for the adaptation, the proposed method requires no additional bits for the adaptation. Experimental results show that the proposed method increases WSNRseg by 0.3 dB and MOS by 0.15.


Journal of the Acoustical Society of America | 1998

Speech encoding apparatus utilizing stored code data

Masami Akamine; Masahiro Oshikiri; Kimio Miseki

A learning-type speech encoding apparatus comprises an adaptive code book storing driving signal vectors, a minimum distortion searching circuit for searching the adaptive code book for an optimum driving signal vector on the basis of the input speech signal, a synthesizing filter for synthesizing a speech signal using the optimum driving signal vector retrieved, a buffer for storing the optimum driving signal vector retrieved, a training vector creating section for producing a training vector by segmenting the stored driving signal vector in units of a specified length, and a learning section for learning by constantly updating the driving signal vectors in the code book on the basis of the training vector.


IEEE Signal Processing Letters | 2016

Voice Activity Detection: Merging Source and Filter-based Information

Thomas Drugman; Yannis Stylianou; Yusuke Kida; Masami Akamine

Voice Activity Detection (VAD) refers to the problem of distinguishing speech segments from background noise. Numerous approaches have been proposed for this purpose. Some are based on features derived from the power spectral density, others exploit the periodicity of the signal. The goal of this letter is to investigate the joint use of source and filter-based features. Interestingly, a mutual information-based assessment shows superior discrimination power for the source-related features, especially the proposed ones. The features are further the input of an artificial neural network-based classifier trained on a multi-condition database. Two strategies are proposed to merge source and filter information: feature and decision fusion. Our experiments indicate an absolute reduction of 3% of the equal error rate when using decision fusion. The final proposed system is compared to four state-of-the-art methods on 150 minutes of data recorded in real environments. Thanks to the robustness of its source-related features, its multi-condition training and its efficient information fusion, the proposed system yields over the best state-of-the-art VAD a substantial increase of accuracy across all conditions (24% absolute on average).


international conference on acoustics, speech, and signal processing | 1990

CELP coding with an adaptive density pulse excitation model

Masami Akamine; Kimio Miseki

An approach to dynamic bit-allocation to excitation vectors to improve the performance of a code excited linear prediction (CELP) coder is proposed. The method is based on an adaptive density pulse (ADP) excitation. By using the ADP, bit-allocation to the excitation vector can be easily varied. Also, the number of samples of excitation can be reduced. The effects of the ADP parameters on the synthetic speech quality are discussed. The ADP-CELP coder is described. The benefit of introducing the ADP excitation model to the CELP coder is evaluated. The segmental signal-to-noise ratio gains of the ADP-CELP coder over the conventional CELP are about 2 dB both at 8 kbit/s and at 4.8 kbit/s.<<ETX>>


international conference on acoustics, speech, and signal processing | 2009

Bayesian feature enhancement using a mixture of unscented transformation for uncertainty decoding of noisy speech

Yusuke Shinohara; Masami Akamine

A new parameter estimation method for the Model-Based Feature Enhancement (MBFE) is presented. The conventional MBFE uses the vector Taylor series to calculate the parameters of non-linearly transformed distributions, though the linearization leads to a degraded performance. We use the unscented transformation to estimate the parameters, where a minimal number of samples propagated through the nonlinear transformation are used. By avoiding the linearization, the parameters are estimated more accurately. Experimental results on Aurora2 show that the proposed method reduces the word error rate by 8.48% relatively, while the computational cost is just modestly higher, compared with the conventional MBFE.


IEEE Journal of Selected Topics in Signal Processing | 2014

Building HMM-TTS Voices on Diverse Data

Vincent Wan; Javier Latorre; Kayoko Yanagisawa; Norbert Braunschweiler; Langzhou Chen; Mark J. F. Gales; Masami Akamine

The statistical models of hidden Markov model based text-to-speech (HMM-TTS) systems are typically built using homogeneous data. It is possible to acquire data from many different sources but combining them leads to a non-homogeneous or diverse dataset. This paper describes the application of average voice models (AVMs) and a novel application of cluster adaptive training (CAT) with multiple context dependent decision trees to create HMM-TTS voices using diverse data: speech data recorded in studios mixed with speech data obtained from the internet. Training AVM and CAT models on diverse data yields better quality speech than training on high quality studio data alone. Tests show that CAT is able to create a voice for a target speaker with as little as 7 seconds; an AVM would need more data to reach the same level of similarity to target speaker. Tests also show that CAT produces higher quality voices than AVMs irrespective of the amount of adaptation data. Lastly, it is shown that it is beneficial to model the data using multiple context clustering decision trees.


international conference on acoustics, speech, and signal processing | 2010

Covariance clustering on Riemannian manifolds for acoustic model compression

Yusuke Shinohara; Takashi Masuko; Masami Akamine

A new method of covariance clustering for acoustic model compression is proposed. Since covariance matrices do not form a Euclidean vector space, standard vector clustering algorithms cannot be used effectively for covariance clustering. In this paper, we propose a novel clustering algorithm based on a Riemannian framework, where the covariance space is considered as a Riemannian manifold equipped with the Fisher information metric, and notions of distance and mean are defined on the manifold. The LBG clustering algorithm is naturally extended to the covariance space under the Riemannian framework. Experimental results show the effectiveness of the proposed method, reducing the acoustic model size nearly to the half without noticeable loss in recognition performance.

Collaboration


Dive into the Masami Akamine's collaboration.

Researchain Logo
Decentralizing Knowledge