Ranniery Maia
Toshiba
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Ranniery Maia.
international conference on acoustics, speech, and signal processing | 2012
Ranniery Maia; Masami Akamine; Mark J. F. Gales
Statistical parametric synthesizers usually rely on a simplified model of speech production where a minimum-phase filter is driven by a zero or random phase excitation signal. However, this procedure does not take into account the natural mixed-phase characteristics of the speech signal. This paper addresses this issue by proposing the use of the complex cepstrum for modeling phase information in statistical parametric speech synthesizers. Here a frame-based complex cepstrum is calculated through the interpolation of pitch-synchronous magnitude and unwrapped phase spectra. The noncausal part of the frame-based complex cepstrum is then modeled as phase features in the statistical parametric synthesizer. At synthesis time, the generated phase parameters are used to derive coefficients of a glottal filter. Experimental results show that the proposed approach effectively embeds phase information in the synthetic speech, resulting in close-to-natural waveforms and better speech quality.
ieee international telecommunications symposium | 2006
Denilson C. Silva; Amaro A. de Lima; Ranniery Maia; Daniela Braga; João Moraes; João Alfredo Moraes; Fernando Gil Resende
This paper presents a grapheme-phone converter and stress determination algorithm based on rules. The proposed set of rules was implemented and tested on a randomly chosen extract of the CETEN-Folha text database. Computer experiments show it is achieve an accuracy of 97.44% and 98.58%, respectively, for the grapheme-phone converter and the stress determination algorithm.
Speech Communication | 2013
Ranniery Maia; Masami Akamine; Mark J. F. Gales
Highlights? Complex cepstrum is applied to statistical parametric speech synthesis. ? At synthesis time, phase features derived from the allpass component of the complex cepstrum are used to implement a glottal pulse filter. ? Experimental results show that the addition of the phase features results in better synthetic speech quality. Statistical parametric synthesizers have typically relied on a simplified model of speech production. In this model, speech is generated using a minimum-phase filter, implemented from coefficients derived from spectral parameters, driven by a zero or random phase excitation signal. This excitation signal is usually constructed from fundamental frequencies and parameters used to control the balance between the periodicity and aperiodicity of the signal. The application of this approach to statistical parametric synthesis has partly been motivated by speech coding theory. However, in contrast to most real-time speech coders, parametric speech synthesizers do not require causality. This allows the standard simplified model to be extended to represent the natural mixed-phase characteristics of speech signals. This paper proposes the use of the complex cepstrum to model the mixed phase characteristics of speech through the incorporation of phase information in statistical parametric synthesis. The phase information is contained in the anti-causal portion of the complex cepstrum. These parameters have a direct connection with the shape of the glottal pulse of the excitation signal. Phase parameters are extracted on a frame-basis and are modeled in the same fashion as the minimum-phase synthesis filter parameters. At synthesis time, phase parameter trajectories are generated and used to modify the excitation signal. Experimental results show that the use of such complex cepstrum-based phase features results in better synthesized speech quality. Listening test results yield an average preference of 60% for the system with the proposed phase feature on both female and male voices.
Computer Speech & Language | 2014
Cassia Valentini-Botinhao; Junichi Yamagishi; Simon King; Ranniery Maia
This paper describes speech intelligibility enhancement for Hidden Markov Model (HMM) generated synthetic speech in noise. We present a method for modifying the Mel cepstral coefficients generated by statistical parametric models that have been trained on plain speech. We update these coefficients such that the glimpse proportion - an objective measure of the intelligibility of speech in noise - increases, while keeping the speech energy fixed. An acoustic analysis reveals that the modified speech is boosted in the region 1-4kHz, particularly for vowels, nasals and approximants. Results from listening tests employing speech-shaped noise show that the modified speech is as intelligible as a synthetic voice trained on plain speech whose duration, Mel cepstral coefficients and excitation signal parameters have been adapted to Lombard speech from the same speaker. Our proposed method does not require these additional recordings of Lombard speech. In the presence of a competing talker, both modification and adaptation of spectral coefficients give more modest gains.
international conference on acoustics, speech, and signal processing | 2012
Cassia Valentini-Botinhao; Ranniery Maia; Junichi Yamagishi; Simon King; Heiga Zen
In this paper we introduce a new cepstral coefficient extraction method based on an intelligibility measure for speech in noise, the Glimpse Proportion measure. This new method aims to increase the intelligibility of speech in noise by modifying the clean speech, and has applications in scenarios such as public announcement and car navigation systems. We first explain how the Glimpse Proportion measure operates and further show how we approximated it to integrate it into an existing spectral envelope parameter extraction method commonly used in the HMM-based speech synthesis framework. We then demonstrate how this new method changes the modelled spectrum according to the characteristics of the noise and show results for a listening test with vocoded and HMM-based synthetic speech. The test indicates that the proposed method can significantly improve intelligibility of synthetic speech in speech shaped noise.
international conference on acoustics, speech, and signal processing | 2014
Qiong Hu; Yannis Stylianou; Korin Richmond; Ranniery Maia; Junichi Yamagishi; Javier Latorre
This paper presents a fixed- and low-dimensional, perceptually based dynamic sinusoidal model of speech referred to as PDM (Perceptual Dynamic Model). To decrease and fix the number of sinusoidal components typically used in the standard sinusoidal model, we propose to use only one dynamic sinusoidal component per critical band. For each band, the sinusoid with the maximum spectral amplitude is selected and associated with the centre frequency of that critical band. The model is expanded at low frequencies by incorporating sinusoids at the boundaries of the corresponding bands while at the higher frequencies a modulated noise component is used. A listening test is conducted to compare speech reconstructed with PDM and state-of-the-art models of speech, where all models are constrained to use an equal number of parameters. The results show that PDM is clearly preferred in terms of quality over the other systems.
international conference on acoustics, speech, and signal processing | 2008
Ranniery Maia; Tomoki Toda; Keiichi Tokuda; Shinsuke Sakai; Satoshi Nakamura
One of the issues of speech synthesizers based on hidden Markov models concerns the vocoded quality of the synthesized speech. From the principle of analysis-by-synthesis speech coders a trainable excitation model has been proposed to improve naturalness, where the method consists in the design of a set of state-dependent filters in a way to minimize the distortion between residual and synthetic excitation. Although this approach seems successful, state definition still represents an open issue. This paper describes a method for state definition wherein bottom-up clustering is performed on full context decision trees, using the likelihood of the residual database as merging criterion. Experiments have shown that improvement on residual modeling through better filter design can be achieved.
international conference on acoustics, speech, and signal processing | 2015
Qiong Hu; Yannis Stylianou; Ranniery Maia; Korin Richmond; Junichi Yamagishi
Sinusoidal vocoders can generate high quality speech, but they have not been extensively applied to statistical parametric speech synthesis. This paper presents two ways for using dynamic sinusoidal models for statistical speech synthesis, enabling the sinusoid parameters to be modelled in HMM-based synthesis. In the first method, features extracted from a fixed- and low-dimensional, perception-based dynamic sinusoidal model (PDM) are statistically modelled directly. In the second method, we convert both static amplitude and dynamic slope from all the harmonics of a signal, which we term the Harmonic Dynamic Model (HDM), to intermediate parameters (regularised cepstral coefficients) for modelling. During synthesis, HDM is then used to reconstruct speech. We have compared the voice quality of these two methods to the STRAIGHT cepstrum-based vocoder with mixed excitation in formal listening tests. Our results show that HDM with intermediate parameters can generate comparable quality as STRAIGHT, while PDM direct modelling seems promising in terms of producing good speech quality without resorting to intermediate parameters such as cepstra.
international conference on acoustics, speech, and signal processing | 2014
Ranniery Maia; Yannis Stylianou
This paper presents a study on complex cepstrum-based speech factorization for acoustic modeling in statistical parametric synthesizers. The factorization is conducted assuming that both vocal tract resonance and glottal flow effect are fully represented by the complex cepstrum. We investigated four different forms to represent the complex cepstrum in the acoustic models and compared their performances in terms of objective measures between reconstructed and natural waveforms and final quality of the synthesized speech. According to experimental results, the all-pass/minimum-phase and real cepstrum/phase cepstrum decompositions are the best ones in terms of preserving the complex cepstrum information after the parameter generation process.
international conference on acoustics, speech, and signal processing | 2014
Vassilios Tsiaras; Ranniery Maia; Vassilios Diakoloukas; Yannis Stylianou; Vassilios Digalakis
Hidden Markov models (HMMs) are becoming the dominant approach for text-to-speech synthesis (TTS). HMMs provide an attractive acoustic modeling scheme which has been exhaustively investigated and developed for many years. Modern HMM-based speech synthesizers have approached the quality of the best state-of-the-art unit selection systems. However, we believe that statistical parametric speech synthesis has not reached its potential, since HMMs are limited by several assumptions which do not apply to the properties of speech. We, therefore, propose in this paper to use Linear Dynamical Models (LDMs) instead of HMMs. LDMs can better model the dynamics of speech and can produce a naturally smoother trajectory of the synthesized speech. We perform a series of experiments using different system configurations to check on the performance of LDMs for speech synthesis. We show that LDM-based synthesizers can outperform HMM-based ones in terms of cepstral distance and are a very promising acoustic modeling alternative for statistical parametric TTS.
Collaboration
Dive into the Ranniery Maia's collaboration.
National Institute of Information and Communications Technology
View shared research outputsNational Institute of Information and Communications Technology
View shared research outputs