[PDF] Alternate Endings: Improving Prosody for Incremental Neural TTS with Predicted Future Text Input

Abstract

The prosody of a spoken word is determined by its surrounding context. In incremental text-to-speech synthesis, where the synthesizer produces an output before it has access to the complete input, the full context is often unknown which can result in a loss of naturalness in the synthesized speech. In this paper, we investigate whether the use of predicted future text can attenuate this loss. We compare several test conditions of next future word: (a) unknown (zero-word), (b) language model predicted, (c) randomly predicted and (d) ground-truth. We measure the prosodic features (pitch, energy and duration) and find that predicted text provides significant improvements over a zero-word lookahead, but only slight gains over random-word lookahead. We confirm these results with a perceptive test.

Full PDF

AAlternate Endings: Improving Prosody for Incremental Neural TTS withPredicted Future Text Input

Brooke Stephenson , , Thomas Hueber , Laurent Girin , Laurent Besacier , Universit´e Grenoble Alpes, CNRS, Grenoble INP, GIPSA-lab, 38000 Grenoble, France LIG, UGA, G-INP, CNRS, INRIA, Grenoble, France NAVER LABS Europe, France [email protected], [email protected],[email protected], [email protected]

Abstract

The prosody of a spoken word is determined by its surroundingcontext. In incremental text-to-speech synthesis, where the syn-thesizer produces an output before it has access to the completeinput, the full context is often unknown which can result in aloss of naturalness in the synthesized speech. In this paper, weinvestigate whether the use of predicted future text can attenu-ate this loss. We compare several test conditions of next futureword: (a) unknown (zero-word), (b) language model predicted,(c) randomly predicted and (d) ground-truth. We measure theprosodic features (pitch, energy and duration) and ﬁnd that pre-dicted text provides signiﬁcant improvements over a zero-wordlookahead, but only slight gains over random-word lookahead.We conﬁrm these results with a perceptive test.

Index Terms : incremental text-to-speech, prosody, languagemodels

1. Introduction

In incremental text-to-speech synthesis (iTTS), the system startsto output chunks of synthetic audio before the full text input isknown [1, 2, 3, 4]. The missing input information often hin-ders the ability to produce a natural sounding speech sequence,mostly because prosodic features that will be determined by thefuture context (i.e. the remaining words in the sentence) havenot yet been speciﬁed. Fortunately, the future input is not com-pletely random; human language is characterized by several lex-ical and syntactic patterns, which can be statistically learnt andthen predicted.Recent advances in language modelling techniques, namelythe use of transformer models such as BERT [5] and GPT-2 [6]give us accurate representations of the probability distributionof future words. If this information can be mobilized to ﬁll inthe missing data for an iTTS system, it may be possible to retainnaturalness while minimizing latency.Recent research in iTTS has focused on end-to-end neuralmodels and has investigated the trade-off between speech qual-ity and synthesis speed by testing the effects of different degreesof lookahead [7, 8, 9, 10, 11]. In [12], a language model wasused to predict the next ﬁve word sequence in an iTTS system;ground truth and pseudo lookahead representations were learntusing a context encoder. Other studies in conventional (i.e. nonincremental) TTS have incorporated language model represen-tations into the training of neural TTS systems and found thatthey could help speed up training time [13, 14]. In the ﬁeld ofsimultaneous translation, where future context is also unknown,[15] hallucinated future words to balance the latency/qualitytrade-off. In the present work, similar to [12], we propose an iTTSsystem that incorporates a language model. Our approach (de-scribed in Figure 1) however only predicts the next word, whichallows for ﬁner control and evaluation of the effects of correctand incorrect predictions. We evaluate our system by contrast-ing different future word contexts: (a) unknown, (b) languagemodel predicted, (c) randomly predicted (a control group) and(d) ground-truth (see Table 1). Differences are measured at theTTS encoder level and from the generated speech signal througha listening test.

2. Method

For each token in our corpus, we prepare different sequenceswhich are used as input to the TTS model, FastSpeech 2 [16].• x n = x , x , ..., x n is the sequence of tokens up to n . In theproposed iTTS system, the tokenization policy is to split thesentence on space characters, and then synthesis is triggeredwhen a space character is encountered.• k is the lookahead parameter (number of future tokens avail-able when synthesizing token x n ).• s n + k = { x , x , ..., x n , ˆ x n +1 , ..., ˆ x n + k } = { x n , ˆ x n +1: n + k } is the sequence used for the synthe-sis of token x n , where for the ground-truth condition (GT) ˆ x n +1: n + k = x n +1: n + k , for the prediction condition (Pred) ˆ x n +1: n + k is given by the language model, and for therandom condition (Rand) ˆ x n +1: n + k is random. The randomtokens are generated using the procedure outlined in Section3.1.• s n is the input prompt used to generate language model pre-dictions.• In practice, near the end of the sequence, we replace n + k with min( n + k, N ) where N is the length of the full utter-ance. We use the GPT-2 lan-guage model for our study. This is an auto-regressive modeltrained to predict the next word given a sequence of past words(causal language modeling task), based on a Transformer archi-tecture. The original GPT-2 [6] is large (1.5B parameters) andsince our intended use requires fast predictions, we opted touse a smaller version of GPT-2, called “distilled GPT-2” [17]. https://huggingface.co/distilgpt2 a r X i v : . [ c s . C L ] F e b igure 1: Utilizing language model predictions to improve incremental TTS quality while keeping limited latency.

Table 1:

Examples of input sequences with unknown, ground-truth, predicted and random future context. In each sequence, the wordin bold is the word which is synthesized from the sequence.

Input Type Lookahead Input Sequences

Ground Truth Full sentence, k = N − n Do you think that you could manage, Tidy?

Unknown (future) k = 0 word s GT n +0 = Do , Do you , Do you think , ...Ground Truth k = 1 word s GT n +1 = Do you, Do you think, Do you think that, ...GPT-2 prediction k = 1 word s Pred n +1 = Do they, Do you agree, Do you think this, ...Random k = 1 word s Rand n +1 = Do dance, Do you until, Do you think art, ...This model has been trained to produce the same output prob-ability distribution as the original GPT-2 but using fewer lay-ers/parameters. TTS model.

For TTS we select a fast and high-quality end-to-end model: FastSpeech 2. The implementation we use [18] ,trained on the LJ Speech Dataset [19], takes characters as in-put and converts them to phonemes. Phoneme embeddings arepassed through several self-attention layers before the modelmakes duration, pitch and energy predictions for each phoneme.These predictions are then passed to the decoder (more self-attention layers) which produces a Mel-spectrogram. The Mel-spectrogram is then input into a parallel WaveGAN vocoder[20] for waveform generation. This model is well suited toiTTS because (1) it is fast which is desirable when the objec-tive is to reduce latency. The speed is achieved by predictingall Mel-spectrogram frames in parallel and (2) it makes explicitduration predictions for each phoneme, which makes it possibleto segment words and only synthesize the word(s) of interest. We implement an incremental synthesis procedure where eachtoken x n is synthesized from the input sequence s n + k . Mel-spectrogram frames corresponding to individual tokens areidentiﬁed using the internal duration predictions made by Fast-Speech 2. Successive word-level Mel-spectrograms are inputinto the Parallel WaveGAN vocoder on a word-by-word basis.Resulting waveforms are concatenated together using a 1-mscrossfade to eliminate glitches (audio samples in supplementarymaterials).

3. Experiments

The English corpus we use for analysis consists of 1,000 sen-tences from LibriTTS [21]. Sentence length ranges from to words and there are a total of 16,965 tokens and 62,556phonemes.For each token x n in the corpus, we sampled ﬁve GPT-2 and ﬁve random next word predictions ( ˆ x n +1 ). The GPT-2 https://github.com/espnet/espnet predictions are constrained to the 30 most likely next words.The random words were selected from a list of 1,266 of themost common words in English [22]. Importantly, we forceGPT-2 predictions and random predictions to have comparablelengths in term of characters/phonemes because 1) GPT-2 tendsto predict shorter words because they are more frequent, 2) [8]found that longer future words have more inﬂuence on the cur-rent token’s internal representation (in a seq-to-seq model) thanshorter ones, 3) otherwise, our results may be biased by the factthat the random condition simply has more future context. Tocontrol for word length in the random condition, we 1) tookthe word-length distribution of GPT-2 predictions, 2) randomlysampled a word-length category from this distribution (e.g. 2-4 characters), 3) limited our most-common list to only wordsin this category and 4) randomly sampled a word from this listusing a uniform distribution.GPT-2 uses byte pair encoding which breaks words downinto subword units to better handle out-of-vocabulary tokens.As such, some of its predictions extend the ﬁnal prompt wordrather than predicting a new token (e.g. previous → previously).To avoid such distortions to our input text, we sample until theﬁrst character in the predicted text is a space. This also preventserroneous punctuation marks from being predicted. We aim at evaluating theprosody obtained in the different test conditions: no context( k = 0 ), ground-truth context (GT), predicted context (Pred),random context (Rand). For this aim, we compare the pitch, du-ration and energy values produced in those conditions with thevalues produced in the reference condition (Ref) where the fullcontext (full sentence input) is used. In the present paper, weconcentrate on the case k = 1 (one-word lookahead).As for duration and energy, they are ﬁrst computed at thephoneme level, using the FastSpeech 2 internal predictions (seeFigure 2 for a plot of duration values from an example sen-tence). A phoneme duration is deﬁned as (the log of) the num-ber of Mel-spectrogram frames of that phoneme. The energyis the squared magnitude of the short-time Fourier transform(STFT), averaged across all frequency bins and across the du-ration of the phoneme. Then the mean absolute error (MAE) isigure 2: Duration prediction from FastSpeech2 (in number ofMel-spectrogram frames on a log scale) for each phoneme inthe sentence “Are there bears up here?” and for the differenttested prediction conditions.

Table 2:

MAE (and standard dev. across phonemes) betweenduration (resp. energy) obtained with full context and with lim-ited context. *unit = number of Mel-spectrogram frames on alog scale; **arbitrary unit: signal is digital, normalized andaveraged.

Input type

Duration* Energy** k = 0 . ± .

30 0 . ± . GT k = 1 . ± .

13 0 . ± . Pred k = 1 × . ± .

20 0 . ± . Rand k = 1 × . ± .

21 0 . ± . Correct 38,274 . ± .

13 0 . ± . Incorrect 274,506 . ± .

20 0 . ± . computed by averaging the absolute value of the difference ofduration values obtained in each test condition and in the refer-ence condition across all phonemes of the dataset, and the samefor the energy feature. The results are reported in Table 2.Pitch is evaluated at the sentence level. We ﬁrst align theMel-spectrograms obtained in the test and reference conditionswith Dynamic Time Warping using the Librosa library [23].Then we extract the pitch curves from the concatenated au-dio (see Section 2.3) using Praat/Parselmouth [24, 25] and wecompute the MAE in cents between the aligned f trajectories: MAE = 1200 /T T (cid:88) t =1 (cid:12)(cid:12)(cid:12) log (cid:0) f Test ( t ) /f Ref ( t ) (cid:1)(cid:12)(cid:12)(cid:12) . (1)Then the sentence-level MAEs are averaged across all sentencesof the dataset. The results are reported in Table 3. Perceptive test.

Finally, we evaluate the global qualityusing 40, native English speaking, human evaluators and aMUSHRA test [26]. We selected 20 sentences from our cor-pus and for each sentence, we presented the listeners with areference audio clip (generated with the full sentence context)and then asked them to assign a similarity score to ﬁve testclips: the hidden reference (identical to the reference and used We did not evaluate error in the internal FastSpeech 2 pitch predic-tions because we observed a few extreme prediction values which didnot materialize in the resultant audio. Table 3:

MAE between the pitch curves obtained with the fullcontext and with limited context (none, true, predicted or ran-dom).

Input type k = 0 . ± . GT k = 1 . ± . Pred k = 1 . ± . Rand k = 1 . ± . as the MUSHRA high anchor), k = 0 (used as the low anchor),Ground-Truth k = 1 , GPT-2 prediction k = 1 and randomprediction k = 1 . We then compare the distributions of thesimilarity scores. The responses from four of the participantswere removed because these listeners consistently failed to as-sign a high similarity score to the high anchor. See Figure 3 forresults.

4. Discussion

For all metrics, with regards to the mean, we see a clear rankingin the similarity to the full sentence reference: k = 0 is farthestaway, GT k = 1 is the closest and Pred and

Rand are in be-tween, the former being slightly closer to full context than thelatter.Statistical tests (t-test for pitch, duration and energy mea-sures and Wilcoxon for the listening test) conﬁrmed that

Pred and

Rand do not belong to the same distribution and that

Pred is better by a small but signiﬁcant margin.We notice that duration predictions for k = 0 are almostalways longer than the other conditions (Figure 2). We also ob-serve pitch drops for k = 0 words. This is because all words areinterpreted as the end of a sentence (as they are the ﬁnal word inthe FastSpeech 2 input, hence sentence ﬁnal characteristics arepredicted by the model). Both the prediction and the randomconditions reduce this effect thanks to the additional paddingwords.When we separate the correct from the incorrect GPT-2next word predictions (Table 2 - Correct and Incorrect), we seethat the MAE for the incorrect predictions is almost identical tothe MAE for the random condition. This suggests that the im-proved syntactical accuracy gained from the GPT-2 predictions(the POS of the predicted token matches that of the GT next to-ken 43.5% of the time vs. 18.0% for random) does not translateinto improved prosodic features. Since we only see improve-ment when the exact next word is predicted, it is clear that theminor difference between GPT-2 and random is explainable bythe low exact-word prediction rate. We observe that 76% of theGPT-2 sequences have a prediction rate lower than 10%, and97% have a rate lower than 21% (mean: GPT-2 = 6.8%; ran-dom = 0.09%). It is likely that as language models continue toimprove [27], we will see greater gains in naturalness from theproposed method.

5. Conclusion

The results from all metrics show that the language model pre-dicted text does improve prosody when compared to the k = 0 condition. Slight improvements over the random text conditionare also observed. In future work, we propose to extract moreabstract information from language models (e.g. predictions ofdistance to phrase and sentence end) and condition iTTS modelsigure 3: Violin plots of the distribution of similarity scores be-tween signals generated with full context and signals generatedwith limited context (none, true, predicted or random) for the20 sentences in the MUSHRA test. The middle bars show themean value. on these predictions.

6. References [1] T. Baumann and D. Schlangen, “Evaluating prosodic processingfor incremental speech synthesis,” in

Proc. of Interspeech , Port-land, OR, USA, 2012, pp. 438–441.[2] T. Baumann, “Partial representations improve the prosody of in-cremental speech synthesis,” in

Proc. of Interspeech , Singapore,2014, pp. 2932–2936.[3] M. Pouget, T. Hueber, G. Bailly, and T. Baumann, “HMM train-ing strategy for incremental speech synthesis,” in

Proc. of Inter-speech , Dresden, Germany, 2015, pp. 1201–1205.[4] M. Pouget, O. Nahorna, T. Hueber, and G. Bailly, “Adaptivelatency for part-of-speech tagging in incremental text-to-speechsynthesis,” in

Proc. of Interspeech , San Francisco, CA, USA,2016, pp. 2846–2850.[5] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language under-standing,” in

Proc. of ACL , Florence, Italy, 2019, pp. 4171–4186.[6] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, andI. Sutskever, “Language models are unsupervised multitask learn-ers,”

OpenAI blog , vol. 1, no. 8.[7] T. Yanagita, S. Sakti, and S. Nakamura, “Neural iTTS: Towardsynthesizing speech in real-time with end-to-end neural text-to-speech framework,” in

Proc. of SSW , Vienna, Austria, pp. 183–188.[8] B. Stephenson, L. Besacier, L. Girin, and T. Hueber, “What the fu-ture brings: Investigating the impact of lookahead for incrementalneural TTS,” in

Proc. of Interspeech , Shanghai, China, 2020.[9] M. Ma, B. Zheng, K. Liu, R. Zheng, H. Liu, K. Peng,K. Church, and L. Huang, “Incremental text-to-speech synthesiswith preﬁx-to-preﬁx framework,” in

Findings of ACL: EMNLP2020

Proc. of Interspeech , Shanghai, China, 2020, pp. 3186–3190.[11] K. Sudoh, T. Kano, S. Novitasari, T. Yanagita, S. Sakti, andS. Nakamura, “Simultaneous speech-to-speech translation sys-tem with neural incremental asr, mt, and tts,” arXiv preprintarXiv:2011.04845 , 2020. [12] T. Saeki, S. Takamichi, and H. Saruwatari, “Incremental text-to-speech synthesis using pseudo lookahead with large pretrainedlanguage model,” arXiv preprint arXiv:2012.12612 , 2020.[13] W. Fang, Y.-A. Chung, and J. Glass, “Towards transfer learningfor end-to-end speech synthesis from deep pre-trained languagemodels,” arXiv preprint arXiv:1906.07307 , 2019.[14] Y.-A. Chung, Y. Wang, W.-N. Hsu, Y. Zhang, and R. Skerry-Ryan,“Semi-supervised training for improving data efﬁciency in end-to-end speech synthesis,” in

Proc. of ICASSP , Brighton, UnitedKingdom, 2019, pp. 6940–6944.[15] R. Zheng, M. Ma, B. Zheng, and L. Huang, “Speculative beamsearch for simultaneous translation,” in

Proc. of EMNLP-IJCNLP arXivpreprint arXiv:2006.04558 , 2020.[17] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi,P. Cistac, and colleagues, “Transformers: State-of-the-art naturallanguage processing,” in

Proc. of ACL EMNLP

Proc. of ICASSP , Barcelona, Spain, pp.7654–7658.[19] K. Ito, “The LJ speech dataset,” https://keithito.com/LJ-Speech-Dataset/, 2017.[20] R. Yamamoto, E. Song, and J.-M. Kim, “Parallel WaveGAN: Afast waveform generation model based on generative adversarialnetworks with multi-resolution spectrogram,” in

Proc. of ICASSP ,Barcelona, Spain, 2020, pp. 6199–6203.[21] H. Zen, V. Dang, R. Clark, Y. Zhang, R. J. Weiss, Y. Jia, Z. Chen,and Y. Wu, “LibriTTS: A corpus derived from LibriSpeech fortext-to-speech,” in

Proc. of Interspeech , Graz, Austria, 2019, pp.1526–1530.[22] R. Speer, J. Chin, A. Lin, S. Jewett, and L. Nathan, “Luminosoin-sight/wordfreq: v2.2 [Computer software],” Oct. 2018. [Online].Available: https://doi.org/10.5281/zenodo.1443582[23] Brian McFee, Colin Raffel, Dawen Liang, Daniel P.W. Ellis, MattMcVicar, Eric Battenberg, and Oriol Nieto, “Librosa: Audio andmusic signal analysis in Python,” in

Proc. of the Python in ScienceConference

Journal of Phonetics ,vol. 71, pp. 1–15, 2018.[26] ITU-R, “Recommandation BS.1534 and BS.1116: Methods forthe subjective assessment of small impairments in audio systems.”[27] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhari-wal, A. Neelakantan, and colleagues, “Language models are few-shot learners,” arXiv preprint arXiv:2005.14165arXiv preprint arXiv:2005.14165