Rich Prosody Diversity Modelling with Phone-level Mixture Density Network
MMIXTURE DENSITY NETWORK FOR PHONE-LEVEL PROSODY MODELLING IN SPEECHSYNTHESIS
Chenpeng Du and Kai Yu
MoE Key Lab of Artificial IntelligenceSpeechLab, Department of Computer Science and EngineeringAI Institute, Shanghai Jiao Tong University, Shanghai, China { duchenpeng, kai.yu } @sjtu.edu.cn ABSTRACT
Recent researches on both utterance-level and phone-levelprosody modelling successfully improve the voice qualityand naturalness in text-to-speech synthesis. However, mostof them model the prosody with a unimodal distribution suchlike a single Gaussian, which is not reasonable enough. In thiswork, we focus on phone-level prosody modelling where weintroduce a Gaussian mixture model(GMM) based mixturedensity network. Our experiments on the LJSpeech datasetdemonstrate that GMM can better model the phone-levelprosody than a single Gaussian. The subjective evaluationssuggest that our method not only significantly improves theprosody diversity in synthetic speech without the need ofmanual control, but also achieves a better naturalness. Wealso find that using the additional mixture density networkhas only very limited influence on inference speed.
Index Terms — Mixture Density Network, Prosody Mod-elling, Speech Synthesis
1. INTRODUCTION
Neural text-to-speech(TTS) synthesis models with sequence-to-sequence architecture [1, 2, 3] can be applied to generatenaturally sounding speech. Recently, non-autoregressive TTSmodels such as FastSpeech [4] and FastSpeech2 [5] are pro-posed for fast generation speed without frame-by-frame gen-eration.Besides the progress of acoustic modelling, prosody mod-elling is also widely investigated. Utterance level prosodymodelling in TTS is proposed in [6], in which a global(utterance-level) prosody embedding is extracted from areference speech for controlling the prosody of TTS out-put. [7] factorizes the prosody embedding with several globalstyle tokens(GST). Variational auto-encoder(VAE) is used forprosody modelling in [8], which enables us to sample var-ious prosody embeddings from the standard Gaussian priorin inference. In addition to the global prosody modelling,fine-grained prosody is also analyzed in recent works. [9]extracts frame-level prosody information and uses an atten-tion module to align it with each phoneme encodings. [10] directly models phone-level(PL) prosody with a VAE, thusimproving the stability compared with [9]. Hierarchical andquantized versions of VAE for PL prosody modelling is alsoinvestigated in [11, 12, 13], which improves the interpretabil-ity and naturalness in synthetic speech. However, all the priorworks for phone-level prosody modelling assumes that theprior distribution of prosody embeddings is a standard singleGaussian, which is not reasonable enough.The process of mapping a phoneme sequence to its corre-sponding mel-spectrogram is a one-to-many mapping. Hence,it is natural to use multimodal distribution. In traditional ASRsystems, one of the most dominant techniques is HMM-GMM[14, 15, 16], in which the distribution of acoustic features foreach HMM state is modeled with a GMM. Similarly, GMMis also used to model acoustic features in traditional statisticalparametric speech synthesis(SPSS) [17, 18], thus improvingthe voice quality.Inspired by the previous works above, we use GMM inthis paper to model the PL prosody, whose parameters arepredicted by a mixture density network(MDN) [19]. We usea prosody extractor to extract PL prosody embeddings fromground-truth mel-spectrograms and use a prosody predictoras the MDN to predict the GMM distribution of the embed-dings. In inference, the prosody of each phoneme is ran-domly sampled from the predicted GMM distribution, thusgenerating speech with diverse prosodies. Our experimentson LJSpeech [20] dataset demonstrate that GMM can bettermodel the phone-level prosody than a single Gaussian. Thesubjective evaluations suggest that our method not only sig-nificantly improves the prosody diversity in synthetic speechwithout the need of manual control, but also achieves a bet-ter naturalness. We also find that using the additional mixturedensity network has only very limited influence on inferencespeed.In the rest of this paper, we first review the MDN in Sec-tion 2 and introduce the proposed model in Section 3. Section4 gives experiments comparison and results analysis, and Sec-tion 5 concludes the paper. a r X i v : . [ c s . S D ] F e b a) Overal architecture based on FastSpeech2 (b) Prosody extractor (c) Prosody predictor Fig. 1 . Model architectures. “SG” represents the stop gradient operation. “OR” selects the extracted “ground-truth” e in trainingand the sampled ˆ e in inference. We use red lines in loss calculation and dash lines in inference.
2. MIXTURE DENSITY NETWORK
In this section, we briefly review the mixture density network[19] which is defined as the combined structure of a neuralnetwork and a mixture model. We focus on GMM-basedMDN in this work to predict the parameters of the GMM dis-tribution, including the means µ i , variances σ i , and mixtureweights α i . It should be noted that the sum of the mixtureweights is constrained to 1, which can be achieved by apply-ing a Softmax function, formalized as α i = exp ( z αi ) (cid:80) Mj =1 exp (cid:0) z αj (cid:1) (1)where M is the number of Gaussian components and z αi is thecorresponding neural network output. The mean and varianceof Gaussian components are presented as µ i = z µi , σ i = exp ( z σi ) (2)where z µi and z σi are the neural network outputs correspond-ing to the mean and variance of the i -th Gaussian component.Equation 2 constrains the σ i to be positive.The criterion for training the MDN in this work is the neg-ative log-likelihood of the observation e k given its input h and e k − . We will detail these variables in Section 3. Here we canformulate the loss function as L MDN = − log p ( e k ; h , e k − )= − log (cid:32) M (cid:88) i =1 α i · N (cid:0) e k ; µ i , σ i (cid:1) ; h , e k − (cid:33) (3) Therefore, the mixture density network is optimized to predictGMM parameters that maximize the likelihood of e k .
3. GMM-BASED PHONE-LEVEL PROSODYMODELLING3.1. Overall architecture
The TTS model in this paper is based on the recent proposedFastSpeech2[5], where the input phoneme sequence is firstconverted into a hidden state sequence h by the encoder andthen passed through a variance adaptor and a decoder for pre-dicting the output mel-spectrogram. Compared with the orig-inal FastSpeech[4], FastSpeech2 is optimized to minimize themean square error(MSE) L MEL between the predicted and theground-truth mel-spectrograms, instead of applying a teacher-student training. Moreover, the duration target is not extractedfrom the attention map of an autoregressive teacher model,but from the forced alignment of speech and text. Addition-ally, [5] condition the prediction of mel-spectrogram on thevariance information such as pitch and energy with a varianceadaptor. The adaptor is trained to predict the variance infor-mation with an MSE loss L VAR .In this work, we introduce a prosody extractor and aprosody predictor as demonstrated in Figure 1(a), bothjointly trained with the FastSpeech2 architecture. Phone-levelprosody embeddings e are extracted from the ground-truthmel-spectrogram segments with the prosody extractor, andthen projected and added to the hidden state sequence h .Therefore, the prosody extractor is optimized to extract effec-tive prosody information in e in order to better reconstruct theel-spectrogram. Similar prior works [10, 11, 12] model thedistribution of e with a single Gaussian in VAE. In this work,we model the distribution of e with GMM whose parametersare predicted by an MDN. Here, the MDN is the prosodypredictor, which takes the hidden state sequence h as inputand predicts the z α , z µ and z σ for each phoneme. A GRUis designed in it to condition the prediction of the currentprosody distribution on the previous prosodies. During infer-ence, we autoregressively predict the GMM distributions andsample the prosody embedding ˆ e k for each phoneme. Thesampled embedding sequence ˆ e is then projected and addedto the corresponding hidden state sequence h .The overall architecture is optimized with the loss func-tion L = β · L MDN + L FastSpeech2 = β · L MDN + ( L MEL + L VAR ) (4)where L MDN is the negative log-likelihood of e defined inEquation (3), L FastSpeech2 is the loss function of Fast-Speech2 which is the sum of variance prediction loss L VAR and mel-spectrogram reconstruction loss L MEL as describedin [5], and β is the relative weight between the two terms.It should be noted that we use a stop gradient operation on e in calculating the L MDN , so the prosody extractor is notoptimized with L MDN directly.
The detailed architecture of the prosody extractor is shownin Figure 1(b). It contains 2 layers of 2D convolution with akernel size of 3 ×
3, each followed by a batch normalizationlayer and a ReLU activation function. A bidirectional GRUwith a hidden size of 32 is designed after the above modules.The concatenated forward and backward states from the GRUlayer is the output of the prosody extractor, which is referredto as the prosody embedding of the phoneme.
Figure 1(c) demonstrates the detailed architecture of theprosody predictor. The hidden state h is passed through 2layers of 1D convolution with the kernel size of 3, each fol-lowed by a ReLU, layer normalization and dropout layer.The output of the above modules is then concatenated withthe previous prosody embedding e k − and sent to the GRUwith a hidden size of 384. Then we project the GRU outputto obtain the z α , z µ and z σ .
4. EXPERIMENT AND RESULT4.1. Experimental setup
LJSpeech[20] is a single speaker English datset, containingabout 24 hours speech and 13100 utterances. We select 50utterances for validation, another 50 utterances for testing, and the remaining utterances for training. The speech is re-sampled to 16kHz for simplicity. Before training TTS, wecompute the phoneme alignment of the training data with anHMM-GMM ASR model trained on Librispeech[21], andthen extract the duration of each phoneme from the alignmentfor FastSpeech2 training.All the FastSpeech2-based TTS models in this work takea phoneme sequence as input and the corresponding 320-dimensional mel-spectrogram as output. The frame shift isset to 12.5ms and the frame length is set to 50ms. The β inEquation (4) is set to 0.02. Wavenet[22] is used as the vocoderto reconstruct the waveform from the mel-spectrogram. In this section, we verify whether using the extracted PLprosody embeddings e is better than using a global VAE[8] inreconstruction. In the global VAE system, 256-dimensionalglobal prosody embeddings are sampled from the VAE la-tent posterior for each utterance, and then broadcasted andadded to the encoder output of FastSpeech2 for reconstruct-ing the mel-spectrogram. In our PL model, the number ofGaussian components in the prosody predictor is 10 and theextracted e is used as described in Section 3.1. The mel-cepstral distortion(MCD)[23] on the test set is computed withan open-source tool to measure the distance between the re-constructed speech and the ground-truth speech. The resultsare demonstrated in Table 1, where a lower MCD is better.We can find that using the extracted phone-level prosody e improves the reconstruction performance. Table 1 . Reconstruction performance on the test set
Prosody information MCD
Global 5.16PL 3.64
In this section, we try to figure out how many Gaussian com-ponents are needed to model the distribution of the extracted e . We plot the log-likelihood curves on both the training setand the validation set in Figure 2 with several different num-bers of Gaussian components. It can be observed that the gapbetween the training and validation curves in the single Gaus-sian is larger than that in the GMMs. Moreover, increasing thenumber of components provides higher log-likelihood, thusimproving the PL prosody modelling. Therefore, we use 10components in all the following GMM experiments. https://github.com/MattShannon/mcd ig. 2 . Log-likelihood curves of the extracted “ground-truth”PL prosody embeddings e with different numbers of Gaussiancomponents We perform subjective evaluations on three FastSpeech2-based TTS systems with different prosody modelling: 1)Global, the global VAE as described in Section 4.2; 2) PL1,PL prosody modelling with a single Gaussian; 3) PL10, PLprosody modelling with 10 Gaussian components. In orderto provide better voice quality in the synthetic speech, wescale the predicted standard deviations of the Gaussians witha factor of 0.2 when sampling, following [12].
We synthesize the speech of the test set 3 times for each ut-terance with various sampled prosodies ˆ e . We perform ABpreference tests where two groups of synthetic speech fromtwo different TTS models is presented and 20 listeners needto select the better one in terms of prosody diversity. The re-sults in Figure 3 show that PL10 can provide better prosodydiversity in the synthetic speech than both PL1 and globalVAE.
Fig. 3 . AB preference test in terms of prosody diversity. Audio examples are available at https://cpdu.github.io/gmm_prosody_examples . We also evaluate the naturalness of the synthetic speech witha Mean Opinion Score (MOS) test, in which the listenersare asked to rate each utterance using a 5-point numericalscale. The speech converted back from the ground-truth mel-spectrogram with the Wavenet vocoder is also rated and pre-sented as “ground-truth”. The results are reported in Table2. Similar to the observation in [12], autoregressive sam-pling PL prosody from a single Gaussian sometimes gener-ates very unnatural speech, leading to a lower MOS in PL1.We can find that the naturalness of PL10 is better than that ofPL1, which demonstrates that GMM can better model the PLprosody than a single Gaussian. The global VAE system alsoachieves a good naturalness, very close to the result of PL10.
Table 2 . Evaluate TTS systems in terms of naturalness andinference speed. The confidence interval of MOS is 95%.
Prosody MOS TimeModelling Cost
Ground-truth - 4.54 ± ± × Global 3.95 ± × PL1 3.22 ± × PL10 4.05 ± × FastSpeech2 is proposed as a non-autoregressive TTS modelto avoid frame-by-frame generation and speed up the infer-ence. In this work, we only autoregressively predict the dis-tributions of the PL prosody embeddings, hoping to keep thefast inference speed. We evaluate all our systems on the testset with an Intel Xeon Gold 6240 CPU. As shown in Table 2,the time cost of the proposed model is only 1.11 times morethan the baseline. Therefore, using autoregressive PL prosodyprediction has very limited influence on inference speed.
5. CONCLUSION
In this work, we have proposed a novel approach that usesa GMM-based mixture density network to model the phone-level prosody which is denoted as e . Our experiments firstprove that the extracted e can provide effective informationfor reconstruction, which is better than using a global VAE.Then we find that the log-likelihood of e increases when moreGaussian components are used, indicating that GMM can bet-ter model the PL prosody than a single Gaussian. Subjectiveevaluations suggest that our method not only significantly im-proves the prosody diversity in synthetic speech without theneed of manual control, but also achieves a better naturalness.We also find that using the additional mixture density networkhas only very limited influence on inference speed. . REFERENCES [1] Y. Wang, R. J. Skerry-Ryan, D. Stanton et al. ,“Tacotron: Towards end-to-end speech synthesis,” in Proc. ISCA Interspeech , 2017, pp. 4006–4010.[2] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly,Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Ryan, R. A.Saurous, Y. Agiomyrgiannakis, and Y. Wu, “NaturalTTS synthesis by conditioning wavenet on MEL spec-trogram predictions,” in
Proc. IEEE ICASSP , 2018, pp.4779–4783.[3] N. Li, S. Liu, Y. Liu, S. Zhao, M. Liu, and M. Zhou,“Close to human quality tts with transformer,” arXivpreprint arXiv:1809.08895 , 2018.[4] Y. Ren, Y. Ruan, X. Tan, T. Qin, S. Zhao, Z. Zhao, andT. Liu, “Fastspeech: Fast, robust and controllable text tospeech,” in
Proc. NIPS , 2019, pp. 3165–3174.[5] Y. Ren, C. Hu, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu,“Fastspeech 2: Fast and high-quality end-to-end text-to-speech,” arXiv preprint arXiv:2006.04558 , 2020.[6] R. J. Skerry-Ryan, E. Battenberg, Y. Xiao, Y. Wang,D. Stanton, J. Shor, R. J. Weiss, R. Clark, and R. A.Saurous, “Towards end-to-end prosody transfer for ex-pressive speech synthesis with tacotron,” in
Proc. ICML ,ser. Proceedings of Machine Learning Research, vol. 80,2018, pp. 4700–4709.[7] Y. Wang, D. Stanton, Y. Zhang, R. J. Skerry-Ryan,E. Battenberg, J. Shor, Y. Xiao, Y. Jia, F. Ren, and R. A.Saurous, “Style tokens: Unsupervised style modeling,control and transfer in end-to-end speech synthesis,” in
Proc. ICML , ser. Proceedings of Machine Learning Re-search, vol. 80, 2018, pp. 5167–5176.[8] K. Akuzawa, Y. Iwasawa, and Y. Matsuo, “Expressivespeech synthesis via modeling expressions with varia-tional autoencoder,” in
Proc. ISCA Interspeech , 2018,pp. 3067–3071.[9] Y. Lee and T. Kim, “Robust and fine-grained prosodycontrol of end-to-end speech synthesis,” in
Proc. IEEEICASSP , 2019, pp. 5911–5915.[10] V. Klimkov, S. Ronanki, J. Rohnke, and T. Drugman,“Fine-grained robust prosody transfer for single-speakerneural text-to-speech,” in
Proc. ISCA Interspeech , 2019,pp. 4440–4444.[11] G. Sun, Y. Zhang, R. J. Weiss, Y. Cao, H. Zen, andY. Wu, “Fully-hierarchical fine-grained prosody model-ing for interpretable speech synthesis,” in , 2020, pp. 6264–6268. [12] G. Sun, Y. Zhang, R. J. Weiss, Y. Cao, H. Zen, A. Rosen-berg, B. Ramabhadran, and Y. Wu, “Generating diverseand natural text-to-speech samples using a quantizedfine-grained VAE and autoregressive prosody prior,” in
Proc. IEEE ICASSP , 2020, pp. 6699–6703.[13] Y. Hono, K. Tsuboi, K. Sawada, K. Hashimoto, K. Oura,Y. Nankaku, and K. Tokuda, “Hierarchical multi-grained generative model for expressive speech synthe-sis,” in
Proc. ISCA Interspeech , 2020, pp. 3441–3445.[14] P. C. Woodland, C. Leggetter, J. Odell, V. Valtchev,and S. Young, “The development of the 1994 htk largevocabulary speech recognition system,” in
ProceedingsARPA workshop on spoken language systems technol-ogy , 1995, pp. 104–109.[15] J. Gauvain and C. Lee, “Maximum a posteriori estima-tion for multivariate gaussian mixture observations ofmarkov chains,”
IEEE Trans. Speech Audio Process. ,vol. 2, no. 2, pp. 291–298, 1994.[16] M. J. F. Gales, “Maximum likelihood linear transfor-mations for hmm-based speech recognition,”
Computerspeech & language , vol. 12, no. 2, pp. 75–98, 1998.[17] H. Zen and A. W. Senior, “Deep mixture density net-works for acoustic modeling in statistical parametricspeech synthesis,” in
Proc. IEEE ICASSP , 2014, pp.3844–3848.[18] X. Wang, S. Takaki, and J. Yamagishi, “An autoregres-sive recurrent mixture density network for parametricspeech synthesis,” in
Proc. IEEE ICASSP , 2017, pp.4895–4899.[19] C. M. Bishop, “Mixture density networks,”
Aston Uni-versity , 1994.[20] K. Ito, “The lj speech dataset,” https://keithito.com/LJ-Speech-Dataset/, 2017.[21] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur,“Librispeech: An ASR corpus based on public domainaudio books,” in
Proc. IEEE ICASSP , 2015, pp. 5206–5210.[22] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan,O. Vinyals, A. Graves, N. Kalchbrenner, A. W. Senior,and K. Kavukcuoglu, “Wavenet: A generative model forraw audio,” in
The 9th ISCA Speech Synthesis Workshop,Sunnyvale, CA, USA, 13-15 September 2016 , 2016, p.125.[23] R. Kubichek, “Mel-cepstral distance measure for objec-tive speech quality assessment,” in