Multi-speaker Text-to-speech Synthesis Using Deep Gaussian Processes
aa r X i v : . [ ee ss . A S ] A ug Multi-speaker Text-to-speech Synthesis Using Deep Gaussian Processes
Kentaro Mitsui, Tomoki Koriyama, Hiroshi Saruwatari
The University of Tokyo, Japan [kentaro mitsui, tomoki koriyama, hiroshi saruwatari]@ipc.i.u-tokyo.ac.jp
Abstract
Multi-speaker speech synthesis is a technique for modelingmultiple speakers’ voices with a single model. Although manyapproaches using deep neural networks (DNNs) have been pro-posed, DNNs are prone to overfitting when the amount of train-ing data is limited. We propose a framework for multi-speakerspeech synthesis using deep Gaussian processes (DGPs); a DGPis a deep architecture of Bayesian kernel regressions and thusrobust to overfitting. In this framework, speaker information isfed to duration/acoustic models using speaker codes. We alsoexamine the use of deep Gaussian process latent variable mod-els (DGPLVMs). In this approach, the representation of eachspeaker is learned simultaneously with other model parame-ters, and therefore the similarity or dissimilarity of speakers isconsidered efficiently. We experimentally evaluated two situa-tions to investigate the effectiveness of the proposed methods.In one situation, the amount of data from each speaker is bal-anced (speaker-balanced), and in the other, the data from cer-tain speakers are limited (speaker-imbalanced). Subjective andobjective evaluation results showed that both the DGP and DG-PLVM synthesize multi-speaker speech more effective than aDNN in the speaker-balanced situation. We also found that theDGPLVM outperforms the DGP significantly in the speaker-imbalanced situation.
Index Terms : deep Gaussian process, statistical speech synthe-sis, multi-speaker modeling, latent variable model
1. Introduction
With the development of machine learning in recent years, text-to-speech (TTS) synthesis has a greater variety of applicationsthan ever before. Recent studies have shown that multi-speakermodeling, a technique that models the voices of multiple speak-ers with a single model, is effective for synthesizing multi-ple speakers’ voices. Multi-speaker modeling can benefit frommulti-task learning [1], which means this technique requiresless training data to achieve high-quality speech synthesis.Statistical parametric speech synthesis (SPSS) is one pos-sible method for multi-speaker speech synthesis. HiddenMarkov model (HMM)-based methods such as the averagevoice model [2] were widely used until the emergence of deepneural network (DNN)-based speech synthesis [3]. For multi-speaker modeling in DNN-based speech synthesis, Fan et al.introduced a shared hidden-layer structure, which shares thehidden-layer parameters of a DNN among different speakers,and reported that this structure improved the quality of syn-thetic speech relative to the speaker-dependent DNNs [4]. An-other successful method for multi-speaker modeling is basedon speaker codes, which are the representation of speakers in aform such as a one-hot vector or randomly assigned vector. Lu-ong et al. investigated the optimal form for speaker codes [5].The method proposed by Hojo et al. outperformed the sharedhidden-layer structure by feeding one-hot speaker codes to thehidden layers of a DNN [6]. In addition, the method using speaker representation has recently been applied to end-to-endspeech synthesis frameworks, and the method has achieved highspeech quality [7, 8]. However, most of the DNN-based meth-ods only consider data fitting while training, and thus overfittingoften becomes a problem.In this paper, we focus on the SPSS framework using deepGaussian processes (DGPs) [9]. In this framework, the rela-tionship between linguistic features and phoneme durations oracoustic features are modeled using DGPs [10]. A DGP is adeep architecture of Bayesian kernel regressions, so it can ex-press complicated non-linear transformation with a small num-ber of hyperparameters. Both data fitting and model complex-ity are considered in the training of a DGP, which makes themodel less vulnerable to overfitting than a DNN. Previous workhas shown that DGP-based TTS performs better than a feed-forward DNN for single-speaker modeling [9]. However, theDGP’s effectiveness for multi-speaker TTS is yet to be verified.Therefore, we propose multi-speaker TTS based on DGP.We introduce two methods: one method using a general DGPand feeding one-hot speaker codes to its hidden layers, sim-ilarly to the DNN-based method [6]; and the other based onlearning latent representation of speakers using deep Gaussianprocess latent variable models (DGPLVMs) [10]. The sec-ond method incorporates a GPLVM [11], a Bayesian generativemodel shown to be effective in prosody modeling [12], into thegeneral DGP to obtain speaker representation. The differencebetween DGPs and DGPLVMs is the representation of speakersimilarity used for kernel regression. A DGPLVM can explicitlyexpress the similarity using the latent representation whereasthe speaker codes used in a general DGP cannot. In addition,the use of DGPLVM enables an analysis of speakers in the la-tent space.In the experimental evaluations, we investigate the per-formance of our methods in speaker-balanced and speaker-imbalanced situations. In the speaker-imbalanced situation, wefirst selected target speakers and used limited data for thosespeakers while training. We conducted objective and subjectiveevaluations in both situations to evaluate the effectiveness ofthe proposed methods. Experimental results showed that in thespeaker-balanced situation, both proposed methods improvedthe speech quality relative to the DNN-based method; and inthe speaker-imbalanced situation where only five training utter-ances were used for the target speakers, the DGPLVM improvednaturalness and speaker similarity of synthetic speech.
2. Conventional methods
We give an overview of DNN-based multi-speaker TTS usingspeaker codes [6], a simple yet highly effective method withinthe SPSS framework. Single-speaker models use only contex-tual factors as the inputs of duration/acoustic models, but thismethod uses speaker codes as auxiliary inputs to model speakervariation. Here, speaker code S is a one-hot vector represen-ation of the speaker ID. We apply linear transformation to thisvector and add the result to hidden layers: h ℓ +1 = ϕ ( W ℓ +1 ( h ℓ + W ℓ S S ) + b ℓ +1 ) (1)where ϕ ( · ) is an activation function, h ℓ is the component of the ℓ -th hidden layer, W ℓ and W ℓ S are the connection weight ofthe hidden layers and speaker codes, respectively, and b ℓ is thebias. Training is conducted by minimizing the mean squarederror between the natural and generated acoustic features. In the DGP-based speech synthesis framework [9], a DGPmodel takes linguistic features as inputs and predicts phonemedurations or acoustic features. A DGP is a model defined as acascade of Gaussian process regressions (GPRs).GPRs model the relation between input x and output y as: y = f ( x ) + ǫ (2) f ∼ GP ( m ( x ) , k ( x , x ′ )) (3)and infer the posterior distribution p ( y ∗ | x ∗ , X , y ) against thenew input x ∗ by using the training data ( X , y ) . Here ǫ is ran-dom noise, and m ( x ) and k ( x , x ′ ) are mean and kernel func-tions, respectively. We consider multiple GPRs when the outputis multidimensional.Although a single GPR can represent complicated non-linear functions, its expressiveness is limited by the kernel func-tion. A DGP overcomes this limitation by stacking multipleGPRs; this method is based on the assumption that the overallfunction f can be decomposed into multiple functions in thefollowing manner: f = f L +1 ◦ f L ◦ · · · ◦ f (4)where L is the number of hidden layers, and each function f ℓ isa sample of a Gaussian process. An approximation techniquecalled doubly stochastic variational inference [13] is used inthis framework, so training is conducted by maximizing the ev-idence lower bound (ELBO) of log marginal likelihood: log p ( Y ) ≥ N s N s X j =1 N X i =1 ( D L +1 X d =1 E q ( f di,j ) h log p (cid:16) y di | f di,j (cid:17)i − N s N L +1 X ℓ =1 KL h q ( U ℓ ) k p ( U ℓ | Z ℓ ) i) , L (5)where N , N s are the number of training data and Monte Carlosamples, respectively, and D ℓ is the dimensionality of the out-put of the ℓ -th GPR. y di is the d -th dimension of the i -th ob-served output y i , and f di,j represents the corresponding latentfunction predicted from the j -th sample point. Z ℓ and U ℓ de-note the inducing inputs and outputs, respectively, which aresparse representations of input and output data. While Z ℓ is amodel parameter by itself, U ℓ itself is not a parameter but a ran-dom variable, in which we impose q ( U ℓ ) = Π D ℓ d =1 q ( u ℓ,d ) =Π D ℓ d =1 N ( u ℓ,d ; m ℓ,d , S ℓ,d ) and regard mean m ℓ,d and variance S ℓ,d as model parameters for each layer ℓ and dimension d .
3. DGP-based multi-speaker TTS usingspeaker codes
We introduce the model architecture shown in Fig. 1 to ap-ply the DGP-based speech synthesis framework [9] to multi-speaker TTS. In this architecture, speaker IDs are representedusing one-hot speaker codes in a manner similar to the DNN-based method described in Section 2.1. We apply a single-layerGPR to these speaker codes before feeding them to the hidden Figure 1:
Architecture of DGP-based acoustic model for multi-speaker TTS with three hidden layers. layers. Therefore, the values of the ℓ -th hidden layer h ℓ can bewritten as: h ℓ = f ℓ ( h ℓ − ) + f ℓ S ( S ) (6)where S denotes the speaker code, f ℓ is the ℓ -th GPR in theDGP (hereinafter called the hidden GP), and f ℓ S is the ℓ -th GPRto transform speaker codes (hereinafter called the speaker GP).Speaker GPs have inducing inputs Z ℓ S and corresponding out-puts U ℓ S as well as hidden GPs, so we must optimize these pa-rameters jointly with other model parameters. This can be doneby maximizing the new ELBO: L = L − L X ℓ =1 KL h q ( U ℓ S ) k p ( U ℓ S | Z ℓ S ) i . (7)
4. DGPLVM for multi-speaker TTS
In this section, we propose another approach for multi-speakerTTS using a DGPLVM [10]. The DGP-based approach illus-trated in the previous section is straightforward, but becauseone-hot speaker codes are orthogonal to each other betweenspeakers, we cannot fully make use of the similarity or dissimi-larity of speakers. In the DGPLVM-based approach, we aim toutilize speaker similarity for multi-speaker TTS.We express K speakers by using latent variable R =( r , ..., r K ) , and use the latent variable as the input of function f ℓ as follows: f ℓ ∼ GP ( m ( x , r k ) , k ([ x ⊤ , r ⊤ k ] ⊤ , [ x ′⊤ , r ⊤ k ′ ] ⊤ )) . (8)From Bayes’ theorem, the distribution of r k conditioned on in-put x and output y can be written as: p ( r k | x , y ) ∝ p ( y | x , r k ) p ( r k ) . (9)When we consider acoustic modeling, the left-hand side of (9)is conditioned not only on linguistic feature x but also on acous-tic feature y . Since the kernel function uses latent variable r k as input, r k is learned to express the similarity of acoustic fea-tures among different speakers. We assign a prior given by thestandard normal distribution to r k : p ( r k ) = N ( r k ; , I ) . (10)Also, we consider the latent variable for k -th speaker r k to havea variational distribution q ( r k ) = N ( r k ; µ k , Σ k ) (11)where µ k is a mean vector and Σ k is a diagonal covariancematrix. This latent variable is fed to an arbitrary hidden layerof the DGP. In this case the ELBO of log R p ( Y | R ) p ( R ) d R iswritten as: L = L − K X k =1 KL [ q ( r k ) k p ( r k )] . (12)igure 2: Architecture of DGPLVM-based acoustic model formulti-speaker TTS with three hidden layers.
5. Experiments
We used JVS corpus [14], which is comprised of speech datafrom 100 Japanese speakers, 49 males and 51 females. Speechwaveforms were downsampled to 16 kHz. This corpus con-tained 100 parallel utterances (parallel100) and 30 non-parallelutterances (nonpara30) from each speaker. For the speaker-balanced situation, the training set consisted of all the non-parallel and 85 of the 100 parallel utterances from each speaker,and the test set consisted of the remaining 15 parallel utterancesfrom each speaker. For the speaker-imbalanced situation, fourspeakers, two males and two females, were selected as targetspeakers; for these speakers, only five non-parallel utteranceswere used in training. To avoid low speech quality for the tar-get speakers, we used an oversampling technique [15] and sam-pled each utterance of each target speaker 20 times. The tar-get speakers were selected on the basis of subjective speakersimilarity [16]. Specifically, we defined the speaker who hadthe largest median of similarity score between other speakers,in other words who had many similar speakers, as male/femalesimilar ( MS/FS ), and the opposite ones as male/female dissim-ilar ( MD/FD ). The test set consisted of 15 parallel utterancesfrom the four target speakers.The input linguistic features of the duration model were531-dimensional vectors containing contextual factors such asphoneme, accent, and part of speech, which were automati-cally estimated from texts using Open JTalk [17]. We addeda four-dimensional frame index to these linguistic features andused them as the input of the acoustic model. The outputof the duration model was a one-dimensional phoneme dura-tion. The acoustic features, i.e. the output of the acousticmodel, were 187-dimensional vectors comprised of 0–59th mel-cepstrum, log f o , coded aperiodicity and their ∆ , ∆ , followedby voiced/unvoiced flags. These acoustic features were ex-tracted every 5 ms using WORLD [18] (D4C edition [19]). Wenormalized input features to range [0 . , . and output fea-tures to zero-mean and unit variance.The DGP duration model had 2 hidden layers, with the di-mensionality of each layer set to 32. The acoustic model had5 hidden layers, and the dimensionality of each layer was 128.The number of inducing points was set to 1024 for hidden GPsand 8 for speaker GPs. We used ArcCos kernel [20] as a ker-nel function of GPs. The inducing inputs of each GP were ini-tialized randomly with the standard normal distribution. Thevariational distributions of inducing outputs q ( u ℓ,d ) of all GPsexcept the last hidden GP f L +1 were initialized with a Gaus-sian distribution with zero mean and variance − , while that Figure 3: Objective evaluation results for DGP and DGPLVMwith different layers to feed speaker information. of f L +1 had unit variance.The DGPLVM had similar settings to the DGP model.However, it does not have speaker GPs and thus the total num-ber of model parameters was reduced. The variational distribu-tions of latent variables q ( r k ) were initialized randomly withGaussian distribution with zero mean and variance − .We trained the models by mini-batch optimization with thebatch size set to 1024, using Adam [21] whose learning ratewas 0.01. For the conventional DNN model, we followed theprevious work [6] and set the numbers of hidden layers to 2 and5 for duration and acoustic models, respectively, the numberof hidden units to 1024, and the learning rate of Adam to − .Training was conducted up to 50 epochs for the DGP/DGPLVMand 100 epochs for the DNN. We compared the quality of synthetic speech in terms of dis-tortions between the original and synthetic speech parameters.As evaluation metrics, we used the root mean squared error(RMSE) of phoneme durations (DUR) for duration models, andmel-cepstral distance (MCD) and RMSE of log f o (F0) foracoustic models.We first focused on the speaker-balanced situation and in-vestigated the effect of model architecture on the performanceof acoustic modeling. For the DGP, we fed the speaker code S to a certain layer (the first, second, third, fourth, or fifth layer)or all hidden layers of the acoustic model. In the same way,for the DGPLVM, we fed the latent speaker variable r k to dif-ferent layers. Here the dimensionality of r k was set to three.The results are shown in Fig. 3. Although feeding speaker in-formation only to the last hidden layer increased the acousticdistortion, the differences among other settings were relativelysmall. In the following experiments, we adopted the all settingsfor both the DGP and DGPLVM.Next, we investigated the performance of the DGPLVMwith different dimensionality of r k . We set the dimensional-ity of r k to 2, 3, 16, and 64. The results are shown in Ta-ble 1. While higher dimensionality led to smaller distortionsin the speaker-balanced situation, the results in the speaker-imbalanced situation were the opposite; lower dimensionalityled to better results, and a dimensionality of three was optimal.This is possibly because latent speaker space becomes densewith low-dimensional speaker representation, and voice modelsof similar speakers are efficiently accounted for when synthesiz-ing the target speaker’s voice. We set the dimensionality of r k to 64 for the speaker-balanced situation and 3 for the speaker-imbalanced situation in the following experiments.Finally, we compared the performance of the conventionalDNN, proposed DGP, and DGPLVM. In the speaker-balancedsituation, all models yielded similar MCD, while the proposedDGP/DGPLVM showed better F0 and DUR than the DNN. Inable 1: Objective evaluation results for DGPLVM with differ-ent dimensionality of latent speaker variable r k . MCD: mel-cepstral distance [dB], F0: RMSE of log f o [cent]. Speaker-balanced Speaker-imbalancedDimensionality MCD F0 MCD F02 5.72 235 6.24 2803 5.71 236
233 6.28 28564
Table 2:
Comparison of DNN, DGP and DGPLVM in terms ofMCD: mel-cepstral distance [dB], F0: RMSE of log f o [cent],and DUR: RMSE of phoneme duration [ms]. Speaker-balanced Speaker-imbalancedMethod MCD F0 DUR MCD F0 DURDNN 5.66 239 25.6
271 28.0DGP 5.66
264 27.6 the speaker-imbalanced situation, DNN was the best in terms ofMCD and DGPLVM was the best in terms of F0 and DUR.
We conducted listening tests to subjectively evaluate the speechquality in terms of naturalness and speaker similarity . The nat-uralness of synthetic speech was evaluated by preference A/Btest, and speaker similarity was evaluated by XAB test. Wecompared two pairs: DNN–DGP and DGP–DGPLVM in thespeaker-balanced/imbalanced situations. Thirty crowdsourcedlisteners participated in each of the evaluations, and each lis-tener evaluated ten speech samples. The original speech of thetarget speaker was used as the reference X in the XAB tests.The results are shown in Figs. 4 and 5. In the speaker-balanced situation, the scores of both naturalness and speakersimilarity were higher for all speakers for the DGP than for theDNN. Although both scores of FS were lower in the DGPLVMthan in the DGP due to duration errors, the scores of the remain-ing three speakers were comparable in DGP–DGPLVM. Collat-ing these results with those of the objective evaluation, f o seemsto have the greatest effect on naturalness and speaker similarity.In the speaker-imbalanced situation, there was no signif-icant difference between the DNN and DGP in total, thoughwe observed larger acoustic feature distortions for the DGP inthe objective evaluation. The naturalness of the DGPLVM for MS and FS were significantly higher than those of the DGP. Inaddition, the speaker similarity of those speakers were slightlyhigher than those of the other speakers in the DGPLVM. Fromthese results, we infer that the DGPLVM can beneficially utilizesimilar speakers using the learned latent speaker representation. The latent speaker representation after training the DGPLVM isshown in Fig. 6. Here, the dimensionality of r k is set to two forease of visualization. We found that male and female speakerswere clearly separated, similar speakers ( MS : 022 and FS : 063)were embedded inside of the cluster while dissimilar speakers( MD : 006 and FD : 010) were embedded outside, and speakersembedded closely in the speaker-balanced situation were alsoclosely embedded in the speaker-imbalanced situation. Theseresults indicate that the learned latent speaker representation ex-presses the similarity or dissimilarity of speakers as expected. Synthetic speech samples are available at https://kentaro321.github.io/demo_DGP_MS_TTS/ . Figure 4:
Subjective evaluation results with 95% confidence in-tervals in speaker-balanced situation.
Figure 5:
Subjective evaluation results with 95% confidence in-tervals in speaker-imbalanced situation.
Figure 6:
Latent speaker representation learned by DGPLVM in(a) speaker-balanced situation and (b) speaker-imbalanced sit-uation. Red and blue numbers indicate female and male speak-ers, respectively. Orange and black circles indicate the similar and dissimilar speakers, respectively.
6. Conclusions
We have proposed multi-speaker TTS based on the DGP. Wefound that with one-hot speaker codes, the use of the DGPcan improve naturalness and speaker similarity of multi-speakerspeech relative to the DNN. We also introduced the DGPLVM-based multi-speaker TTS framework, in which speaker repre-sentation is treated as a latent variable and jointly learned withother model parameters. The experimental results showed thatthe DGPLVM-based approach is especially effective when theamount of training data from a certain speaker is highly limited.For future work, we will compare our DGPLVM-based methodwith other latent-space-based methods such as variational au-toencoder [22]. We also plan to compare the performance ofthe proposed methods with recent end-to-end approaches.
7. Acknowledgements
This work was supported by JSPS KAKENHI Grant NumberJP19K20292. . References [1] S. Ruder, “An overview of multi-task learning in deep neural net-works,” in arXiv preprint arXiv:1706.05098 , 2017.[2] J. Yamagishi and T. Kobayashi, “Average-voice-based speech syn-thesis using HSMM-based speaker adaptation and adaptive train-ing,”
IEICE Transactions on Information and Systems , vol. 90,no. 2, pp. 533–543, 2007.[3] H. Zen, A. Senior, and M. Schuster, “Statistical parametric speechsynthesis using deep neural networks,” in
Proc. ICASSP , Vancou-ver, Canada, May 2013, pp. 7962–7966.[4] Y. Fan, Y. Qian, F. Soong, and L. He, “Multi-speaker modelingand speaker adaptation for DNN-based TTS synthesis,” in
Proc.ICASSP , Brisbane, Australia, Apr. 2015, pp. 4475–4479.[5] H.-T. Luong, S. Takaki, G. Henter, and J. Yamagishi, “Adaptingand controlling DNN-based speech synthesis using input codes,”in
Proc. ICASSP , New Orleans, U.S.A., Mar. 2017, pp. 4905–4909.[6] N. Hojo, Y. Ijima, and H. Mizuno, “DNN-based speech synthe-sis using speaker codes,”
IEICE Transactions on Information andSystems , vol. 101, no. 2, pp. 462–472, 2018.[7] W. Ping, K. Peng, A. Gibiansky, S. Arik, A. Kannan, S. Narang,J. Raiman, and J. Miller, “Deep voice 3: Scaling text-to-speechwith convolutional sequence learning,” in
Proc. ICLR , Vancouver,Canada, May 2018.[8] Y. Jia, Y. Zhang, R. Weiss, Q. Wang, J. Shen, F. Ren, P. Nguyen,R. Pang, I. Moreno, Y. Wu et al. , “Transfer learning from speakerverification to multispeaker text-to-speech synthesis,” in
Proc.NIPS , Montreal, Canada, Dec. 2018, pp. 4480–4490.[9] T. Koriyama and T. Kobayashi, “Statistical parametric speech syn-thesis using deep Gaussian processes,”
IEEE/ACM Transactionson Audio, Speech, and Language Processing , vol. 27, no. 5, pp.948–959, 2019.[10] A. Damianou and N. Lawrence, “Deep Gaussian processes,” in
Proc. AISTATS , Scottsdale, U.S.A., Apr. 2013, pp. 207–215.[11] M. Titsias and N. Lawrence, “Bayesian Gaussian process latentvariable model,” in
Proc. AISTATS , Sardinia, Italy, May 2010, pp.844–851.[12] T. Koriyama and T. Kobayashi, “Semi-supervised prosody model-ing using deep Gaussian process latent variable model,” in
Proc.INTERSPEECH , Graz, Austria, Sep. 2019, pp. 4450–4454.[13] H. Salimbeni and M. Deisenroth, “Doubly stochastic variationalinference for deep Gaussian processes,” in
Proc. NIPS , California,U.S.A., Dec. 2017, pp. 4588–4599.[14] S. Takamichi, K. Mitsui, Y. Saito, T. Koriyama, N. Tanji, andH. Saruwatari, “JVS corpus: free Japanese multi-speaker voicecorpus,” in arXiv preprint arXiv:1908.06248 , 2019.[15] H.-T. Luong, X. Wang, J. Yamagishi, and N. Nishizawa, “Train-ing multi-speaker neural text-to-speech systems using speaker-imbalanced speech corpora,” in
Proc. INTERSPEECH , Graz,Austria, Sep. 2019, pp. 1303–1307.[16] Y. Saito, S. Takamichi, and H. Saruwatari, “DNN-based speakerembedding using subjective inter-speaker similarity for multi-speaker modeling in speech synthesis,” in
Proc. 10th ISCA SpeechSynthesis Workshop , Vienna, Austria, Sep. 2019, pp. 51–56.[17] Open JTalk, http://open-jtalk.sourceforge.net/.[18] M. Morise, F. Yokomori, and K. Ozawa, “WORLD: a vocoder-based high-quality speech synthesis system for real-time applica-tions,”
IEICE Transactions on Information and Systems , vol. 99,no. 7, pp. 1877–1884, 2016.[19] M. Morise, “D4C, a band-aperiodicity estimator for high-qualityspeech synthesis,”
Speech Communication , vol. 84, pp. 57–65,2016.[20] Y. Cho and L. Saul, “Kernel methods for deep learning,” in
Proc.NIPS , Vancouver, Canada, Dec. 2009, pp. 342–350.[21] D. Kingma and B. Jimmy, “Adam: A method for stochastic opti-mization,” in
Proc. ICLR , San Diego, U.S.A, May 2015.[22] D. Kingma and M. Welling, “Auto-encoding variational Bayes,”in