Javier Latorre | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Javier Latorre is active.

Explore More

Publication

Featured researches published by Javier Latorre.

IEEE Transactions on Audio, Speech, and Language Processing | 2012

Statistical Parametric Speech Synthesis Based on Speaker and Language Factorization

Heiga Zen; Norbert Braunschweiler; Sabine Buchholz; Mark J. F. Gales; Katherine Mary Knill; Sacha Krstulovic; Javier Latorre

An increasingly common scenario in building speech synthesis and recognition systems is training on inhomogeneous data. This paper proposes a new framework for estimating hidden Markov models on data containing both multiple speakers and multiple languages. The proposed framework, speaker and language factorization, attempts to factorize speaker-/language-specific characteristics in the data and then model them using separate transforms. Language-specific factors in the data are represented by transforms based on cluster mean interpolation with cluster-dependent decision trees. Acoustic variations caused by speaker characteristics are handled by transforms based on constrained maximum-likelihood linear regression. Experimental results on statistical parametric speech synthesis show that the proposed framework enables data from multiple speakers in different languages to be used to: train a synthesis system; synthesize speech in a language using speaker characteristics estimated in a different language; and adapt to a new language.

Speech Communication | 2006

New approach to the polyglot speech generation by means of an HMM-based speaker adaptable synthesizer

Javier Latorre; Koji Iwano; Sadaoki Furui

Abstract In this paper we present a new method for synthesizing multiple languages with the same voice, using HMM-based speech synthesis. Our approach, which we call HMM-based polyglot synthesis, consists of mixing speech data from several speakers in different languages, to create a speaker- and language-independent (SI) acoustic model. We then adapt the resulting SI model to a specific speaker in order to create a speaker dependent (SD) acoustic model. Using the SD model it is possible to synthesize any of the languages used to train the SI model, with the voice of the speaker, regardless of the speaker’s language. We show that the performance obtained with our method is better than that of methods based on phone mapping for both adaptation and synthesis. Furthermore, for languages not included during training the performance of our approach also equals or surpasses the performance of any monolingual synthesizers based on the languages used to train the multilingual one. This means that our method can be used to create synthesizers for languages where no speech resources are available.

international conference on acoustics, speech, and signal processing | 2005

Polyglot synthesis using a mixture of monolingual corpora

Javier Latorre; Koji Iwano; Sadaoki Furui

The paper proposes a new approach to multilingual synthesis based on an HMM synthesis technique. The idea consists of combining data from different monolingual speakers in different languages to create a single polyglot average voice. This average voice is then transformed into any real speakers voice of one of these languages. The speech synthesized in this way has the same intelligibility and retains the same individuality for all the languages mixed to create the average voice, regardless of the target speakers own language.

international conference on acoustics, speech, and signal processing | 2011

Continuous F0 in the source-excitation generation for HMM-based TTS: Do we need voiced/unvoiced classification?

Javier Latorre; Mark J. F. Gales; Sabine Buchholz; Katherine Mary Knill; Masatsune Tamura; Yamato Ohtani; Masami Akamine

Most HMM-based TTS systems use a hard voiced/unvoiced classification to produce a discontinuous F0 signal which is used for the generation of the source-excitation. When a mixed source excitation is used, this decision can be based on two different sources of information: the state-specific MSD-prior of the F0 models, and/or the frame-specific features generated by the aperiodicity model. This paper examines the meaning of these variables in the synthesis process, their interaction, and how they affect the perceived quality of the generated speech The results of several perceptual experiments show that when using mixed excitation, subjects consistently prefer samples with very few or no false unvoiced errors, whereas a reduction in the rate of false voiced errors does not produce any perceptual improvement. This suggests that rather than using any form of hard voiced/unvoiced classification, e.g., the MSD-prior, it is better for synthesis to use a continuous F0 signal and rely on the frame-level soft voiced/unvoiced decision of the aperiodicity model.

IEEE Journal of Selected Topics in Signal Processing | 2014

Building HMM-TTS Voices on Diverse Data

Vincent Wan; Javier Latorre; Kayoko Yanagisawa; Norbert Braunschweiler; Langzhou Chen; Mark J. F. Gales; Masami Akamine

The statistical models of hidden Markov model based text-to-speech (HMM-TTS) systems are typically built using homogeneous data. It is possible to acquire data from many different sources but combining them leads to a non-homogeneous or diverse dataset. This paper describes the application of average voice models (AVMs) and a novel application of cluster adaptive training (CAT) with multiple context dependent decision trees to create HMM-TTS voices using diverse data: speech data recorded in studios mixed with speech data obtained from the internet. Training AVM and CAT models on diverse data yields better quality speech than training on high quality studio data alone. Tests show that CAT is able to create a voice for a target speaker with as little as 7 seconds; an AVM would need more data to reach the same level of similarity to target speaker. Tests also show that CAT produces higher quality voices than AVMs irrespective of the amount of adaptation data. Lastly, it is shown that it is beneficial to model the data using multiple context clustering decision trees.

international conference on acoustics, speech, and signal processing | 2014

A fixed dimension and perceptually based dynamic sinusoidal model of speech

Qiong Hu; Yannis Stylianou; Korin Richmond; Ranniery Maia; Junichi Yamagishi; Javier Latorre

This paper presents a fixed- and low-dimensional, perceptually based dynamic sinusoidal model of speech referred to as PDM (Perceptual Dynamic Model). To decrease and fix the number of sinusoidal components typically used in the standard sinusoidal model, we propose to use only one dynamic sinusoidal component per critical band. For each band, the sinusoid with the maximum spectral amplitude is selected and associated with the centre frequency of that critical band. The model is expanded at low frequencies by incorporating sinusoids at the boundaries of the corresponding bands while at the higher frequencies a modulated noise component is used. A listening test is conducted to compare speech reconstructed with PDM and state-of-the-art models of speech, where all models are constrained to use an equal number of parameters. The results show that PDM is clearly preferred in terms of quality over the other systems.

international conference on acoustics, speech, and signal processing | 2007

Combining Gaussian Mixture Model with Global Variance Term to Improve the Quality of an HMM-Based Polyglot Speech Synthesizer

Javier Latorre; Koji Iwano; Sadaoki Furui

This paper proposes a new method to calculate the cepstral coefficients for an HMM-based synthesizer. It consists of a direct maximization of the log-likelihood function of a Gaussian mixture model using a gradient ascent algorithm. The method permits to integrate efficiently the global variance factor with a Gaussian mixture acoustic model. The perceptual experiments confirmed that these two factors produce significant improvements on the speech quality, which are independent from each other. By using the proposed method, it is possible to get the benefits of both factors. This paper also proposes a 2-class model for the global variance that discriminates between consonants and vowels. Such 2-class global variance model produces more stable cepstral coefficients than the single-class one.

international conference on acoustics, speech, and signal processing | 2014

CLUSTER ADAPTIVE TRAINING OF AVERAGE VOICE MODELS

Vincent Wan; Javier Latorre; Kayoko Yanagisawa; Mark J. F. Gales; Yannis Stylianou

Hidden Markov model based text-to-speech systems may be adapted so that the synthesised speech sounds like a particular person. The average voice model (AVM) approach uses linear transforms to achieve this while multiple decision tree cluster adaptive training (CAT) represents different speakers as points in a low dimensional space. This paper describes a novel combination of CAT and AVM for modelling speakers. CAT yields higher quality synthetic speech than AVMs but AVMs model the target speaker better. The resulting combination may be interpreted as a more powerful version of the AVM. Results show that the combination achieves better target speaker similarity when compared with both AVM and CAT while the speech quality is in-between AVM and CAT.

international conference on acoustics, speech, and signal processing | 2013

Training a supra-segmental parametric F0 model without interpolating F0

Javier Latorre; Mark J. F. Gales; Katherine Mary Knill; Masami Akamine

Combining multiple intonation models at different linguistic levels is an effective way to improve the naturalness of the predicted F0. In many of these approaches, the intonation models for suprasegmental levels are based on a parametrization of the log-F0 contours over the units of that level. However, many of these parametrisations are not stable when applied to discontinuous signals. Therefore, the F0 signal has to be interpolated. These interpolated values introduce a distortion in the coefficients that degrades the quality of the model. This paper proposes two methods that eliminate the need for such interpolation, one based on regularization and the other on factor analysis. Subjective evaluations show that, for a Discrete-cosine-transform (DCT) syllable-level model, both approaches result in a significant improvement w.r.t. a baseline using interpolated F0. The approach based on regularization yields the best results.

Journal of the Acoustical Society of America | 2006

Clustering strategies for hidden Markov model‐based polyglot synthesis

Javier Latorre; Koji Iwano; Sadaoki Furui

A speaker‐adaptable polyglot synthesizer is a system that can synthesize multiple languages with different voices. In our approach, we mix data from multiple speakers in different languages to create an HMM‐based multilingual average voice. This average voice is then adapted to a target speaker with a small amount of speech data from that speaker. Using this adapted voice our system is able to synthesize any of the languages for which it was trained, with a voice that mimics that of the target speaker. The main technical difficulty of this approach is how to cluster the acoustic models to create a voice that sounds equal for all the languages. In this paper we analyze different possible strategies for clustering the HMMs and compared them on a perceptual test. The results showed that questions that refer only to the phonetic features and ignore the original language of the sounds produced voices with more foreign accent but that sound clearly more consistent across languages. Surprisingly, this foreign ac...

Explore More