Damien Lolive
University of Rennes
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Damien Lolive.
text speech and dialogue | 2014
David Guennec; Damien Lolive
Speech synthesis systems usually use the Viterbi algorithm as a basis for unit selection, while it is not the only possible choice. In this paper, we study a speech synthesis system relying on the A * algorithm, which is a general pathfinding strategy developing a graph rather than a lattice. Using state of the art techniques, we propose and analyze different selection strategies and evaluate them using a subjective evaluation on the N-best paths returned. The best strategy achieves a MOS score of 3.29 (±0.18). More interesting, the proposed system enables an in-depth analysis of unit selection.
conference of the international speech communication association | 2016
Marie Tahon; Raheel Qader; Gwénolé Lecorvé; Damien Lolive
Text-to-speech (TTS) systems are built on speech corpora which are labeled with carefully checked and segmented phonemes. However, phoneme sequences generated by automatic grapheme-to-phoneme converters during synthesis are usually inconsistent with those from the corpus, thus leading to poor quality synthetic speech signals. To solve this problem , the present work aims at adapting automatically generated pronunciations to the corpus. The main idea is to train corpus-specific phoneme-to-phoneme conditional random fields with a large set of linguistic, phonological, articulatory and acoustic-prosodic features. Features are first selected in cross-validation condition, then combined to produce the final best feature set. Pronunciation models are evaluated in terms of phoneme error rate and through perceptual tests. Experiments carried out on a French speech corpus show an improvement in the quality of speech synthesis when pronunciation models are included in the phonetization process. Appart from improving TTS quality, the presented pronunciation adaptation method also brings interesting perspectives in terms of expressive speech synthesis.
international conference on acoustics, speech, and signal processing | 2015
Gwénolé Lecorvé; Damien Lolive
Traditional utterance phonetization methods concatenate pronunciations of uncontextualized constituent words. This approach is too weak for some languages, like French, where transitions between words imply pronunciation modifications. Moreover, it makes it difficult to consider global pronunciation strategies, for instance to model a specific speaker or a specific accent. To overcome these problems, this paper presents a new original phonetization approach for French to generate pronunciation variants of utterances. This approach offers a statistical and highly adaptive framework by relying on conditional random fields and weighted finite state transducers. The approach is evaluated on a corpus of isolated words and a corpus of spoken utterances.
SLSP 2015 Proceedings of the Third International Conference on Statistical Language and Speech Processing - Volume 9449 | 2015
Raheel Qader; Gwénolé Lecorvé; Damien Lolive; Pascale Sébillot
Pronunciation adaptation consists in predicting pronunciation variants of words and utterances based on their standard pronunciation and a target style. This is a key issue in text-to-speech as those variants bring expressiveness to synthetic speech, especially when considering a spontaneous style. This paper presents a new pronunciation adaptation method which adapts standard pronunciations to the style of individual speakers in a context of spontaneous speech. Its originality and strength are to solely rely on linguistic features and to consider a probabilistic machine learning framework, namely conditional random fields, to produce the adapted pronunciations. Features are first selected in a series of experiments, then combined to produce the final adaptation method. Backend experiments on the Buckeye conversational English speech corpus show that adapted pronunciations significantly better reflect spontaneous speech than standard ones, and that even better could be achieved if considering alternative predictions.
text speech and dialogue | 2006
Damien Lolive; Nelly Barbot; Olivier Boëffard
This article describes a new approach to estimate F0 curves using B-spline and Spline models characterized by a knot sequence and associated control points The free parameters of the model are the number of knots and their location The free-knot placement, which is a NP-hard problem, is done using a global MLE (Maximum Likelihood Estimation) within a simulated-annealing strategy Experiments are conducted in a speech processing context on a 7000 syllables french corpus We estimate the two challenging models for increasing values of the number of free parameters We show that a B-spline model provides a slightly better improvement than the Spline model in terms of RMS error.
text speech and dialogue | 2017
Raheel Qader; Gwénolé Lecorvé; Damien Lolive; Marie Tahon; Pascale Sébillot
To bring more expressiveness into text-to-speech systems, this paper presents a new pronunciation variant generation method which works by adapting standard, i.e., dictionary-based, pronunciations to a spontaneous style. Its strength and originality lie in exploiting a wide range of linguistic, articulatory and prosodic features, and in using a probabilistic machine learning framework, namely conditional random fields and phoneme-based n-gram models. Extensive experiments on the Buckeye corpus of English conversational speech demonstrate the effectiveness of the approach through objective and perceptual evaluations.
conference of the international speech communication association | 2016
David Guennec; Damien Lolive
Unit selection speech synthesis systems generally rely on target and concatenation costs for selecting the best unit sequence. The role of the concatenation cost is to insure that joining two voice segments will not cause any acoustic artefact to appear. For this task, acoustic distances (MFCC, F0) are typically used but in many cases, this is not enough to prevent concatenation artefacts. Among other strategies, the improvement of corpus covering by favouring units that naturally support well the joining process (vocalic sandwiches) seems to be effective on TTS. In this paper, we investigate if vocalic sandwiches can be used directly in the unit selection engine when the corpus was not created using that principle. First, the sandwich approach is directly transposed in the unit selection engine with a penalty that greatly favours concatenation on sandwich boundaries. Second, a derived fuzzy version is proposed to relax the penalty based on the concatenation cost, with respect to the cost distribution. We show that the sandwich approach, very efficient at the corpus creation step, seems to be inefficient when directly transposed in the unit selection engine. However, we observe that the fuzzy approach enhances synthesis quality, especially on sentences with high concatenation costs.
International Conference on Statistical Language and Speech Processing (SLSP) | 2016
Marie Tahon; Raheel Qader; Gwénolé Lecorvé; Damien Lolive
Text-to-Speech (TTS) systems rely on a grapheme-to-phoneme converter which is built to produce canonical, or statically stylized, pronunciations. Hence, the TTS quality drops when phoneme sequences generated by this converter are inconsistent with those labeled in the speech corpus on which the TTS system is built, or when a given expressivity is desired. To solve this problem, the present work aims at automatically adapting generated pronunciations to a given style by training a phoneme-to-phoneme conditional random field (CRF). Precisely, our work investigates (i) the choice of optimal features among acoustic, articulatory, phonological and linguistic ones, and (ii) the selection of a minimal data size to train the CRF. As a case study, adaptation to a TTS-dedicated speech corpus is performed. Cross-validation experiments show that small training corpora can be used without much degrading performance. Apart from improving TTS quality, these results bring interesting perspectives for more complex adaptation scenarios towards expressive speech synthesis.
european signal processing conference | 2015
Jonathan Chevelu; Damien Lolive
TTS voice building generally relies on a script extracted from a big text corpus while optimizing the coverage of linguistic and phonological events supposedly related to voice acoustic quality. Previous works have shown differences on objective measures between smartly reduced and random corpora, but not when subjective evaluations are performed. For us, those results do not come from corpus reduction utility but from evaluations that smooth differences. In this article, we high-light those differences in a subjective test, by clustering test corpora according to a distance between signals so as to focus on different synthesized stimuli. The results show that covering appropriate features has a real impact on the perceived quality.
IEEE Journal of Selected Topics in Signal Processing | 2010
Damien Lolive; Nelly Barbot; Olivier Boëffard
In the speech processing field, stylization of fundamental frequency F 0 has been subjected to numerous works. Models proposed in the literature rely on knowledge stemming from phonology and linguistics. We propose an approach that deals with the issue of F 0 curve stylization requiring as few linguistic assumptions as possible and in the framework of B-spline models. A B-spline model, characterized by a sequence of knots with which control points are associated, enables the formalization of discontinuities in the derivatives of the observed values sequence. Beyond the implementation of a B-spline model to stylize an open curve sampled using a constant step, we address the problem of the optimal model order choice. We propose to use a parsimony criterion based on a minimum description length (MDL) approach, in order to optimize the number of knots. We derive several criteria relying on bounds estimated from parameter values. We demonstrate the optimality of these choices in the theoretical MDL framework. We introduce a notion of variable precision of parameters which enables a good compromise between the modeling precision and degrees of freedom of the estimated models. Experiments are performed on a French speech corpus and compare three MDL criteria. The use of both B-spline model and MDL methodology enables an efficient modeling of F 0 curves and provides an RMS error around 1 Hz while allowing a relatively high compression rate about 40%.