[PDF] Hierarchical Multi-Grained Generative Model for Expressive Speech Synthesis

Abstract

This paper proposes a hierarchical generative model with a multi-grained latent variable to synthesize expressive speech. In recent years, fine-grained latent variables are introduced into the text-to-speech synthesis that enable the fine control of the prosody and speaking styles of synthesized speech. However, the naturalness of speech degrades when these latent variables are obtained by sampling from the standard Gaussian prior. To solve this problem, we propose a novel framework for modeling the fine-grained latent variables, considering the dependence on an input text, a hierarchical linguistic structure, and a temporal structure of latent variables. This framework consists of a multi-grained variational autoencoder, a conditional prior, and a multi-level auto-regressive latent converter to obtain the different time-resolution latent variables and sample the finer-level latent variables from the coarser-level ones by taking into account the input text. Experimental results indicate an appropriate method of sampling fine-grained latent variables without the reference signal at the synthesis stage. Our proposed framework also provides the controllability of speaking style in an entire utterance.

Full PDF

HHierarchical Multi-Grained Generative Model for Expressive Speech Synthesis

Yukiya Hono , , Kazuna Tsuboi , Kei Sawada , Kei Hashimoto , Keiichiro Oura ,Yoshihiko Nankaku , and Keiichi Tokuda Department of Computer Science, Nagoya Institute of Technology, Nagoya, Japan Microsoft Development Co., Ltd., Tokyo, Japan [email protected], { ktsuboi, kesawada } @microsoft.com, { bonanza, uratec, nankaku, tokuda } @sp.nitech.ac.jp Abstract

This paper proposes a hierarchical generative model with amulti-grained latent variable to synthesize expressive speech.In recent years, ﬁne-grained latent variables are introduced intothe text-to-speech synthesis that enable the ﬁne control of theprosody and speaking styles of synthesized speech. However,the naturalness of speech degrades when these latent variablesare obtained by sampling from the standard Gaussian prior. Tosolve this problem, we propose a novel framework for mod-eling the ﬁne-grained latent variables, considering the depen-dence on an input text, a hierarchical linguistic structure, and atemporal structure of latent variables. This framework consistsof a multi-grained variational autoencoder, a conditional prior,and a multi-level auto-regressive latent converter to obtain thedifferent time-resolution latent variables and sample the ﬁner-level latent variables from the coarser-level ones by taking intoaccount the input text. Experimental results indicate an appro-priate method of sampling ﬁne-grained latent variables withoutthe reference signal at the synthesis stage. Our proposed frame-work also provides the controllability of speaking style in anentire utterance.

Index Terms : speech synthesis, multi-grained VAE, hierarchi-cal modeling, temporal modeling, speaking style

1. Introduction

Deep neural network (DNN)-based approaches have becomemainstream in statistical parametric text-to-speech (TTS) syn-thesis in recent years [1, 2, 3]. A DNN-based acoustic modelrepresents the mapping function between the linguistic featuresequences and the acoustic feature sequences. Recently, end-to-end neural network-based approaches have also made signif-icant progress [4, 5, 6, 7], and the quality of synthesized speechhas been greatly improved. Most data-driven TTS synthesisapproaches aim to achieve adequate neutral prosody, so syn-thesized speech is less expressive. Although such an averagedvoice is acceptable in short assistant-like utterance, the listenerfeels unpleasant when listening to such speeches in conversa-tion. Opportunities for humans to interact with a computer isincreasing due to the spread of practical applications such as in-telligent conversational agents and assistants; therefore, interestin expressive and controllable speech synthesis is increasing inthe speech-synthesis research ﬁeld.A simple method of controlling speaking styles of synthe-sized speech is to use an additional vector such as emotion IDas the input of the acoustic model [8]. A conditional vectoroften comes from the corpus, so the variation of synthesizedspeech depends on the annotation. Since making speaking-styleannotation is difﬁcult and often subjective, and the number ofclasses is limited, the variation and the quality of the synthe- sized speech are inadequate.Recent approaches learn a latent feature from a referencespeech in an unsupervised manner [9, 10]. With these ap-proaches, an additional network, which is referred to as thereference encoder, is used to extract a single latent feature foran entire utterance to capture the global speech attributes ofeach utterance. The latent feature can represent many speechattributes, such as speaking style, prosody, channel characteris-tics, and noise levels. Variational inference [11] has also beenincorporated into the TTS synthesis system [12, 13, 14]. Thisoffers various advantages, for example, the ability to obtain thelatent variable by direct sampling from an accompanying priorwithout using a reference signal, and smoothly interpolatingin the latent space. However, this utterance-level variationalautoencoder (VAE), where a single latent variable is extractedfor each utterance, has a limitation in capturing the prosody orthe speaking style at a speciﬁc moment. A ﬁne-grained latentvariable, which is a variable-length sequence such as a word-or phone-level sequence, has been introduced into the TTS, tosupport sequential control of prosody and speaking style [15].In a previous study [16], a multi-level ﬁne-grained VAE and adimension-wise autoregressive (AR) decomposition of the pos-terior were introduced. The authors used a conditional VAEstructure to replace the original VAE. This model conditions onthe projection of latent variables, and extracts latent dimensionsone at a time.The ﬁne-grained VAE naturally enables precise control ofthe speaking style of synthesized speech. However, generatedspeech using latent variables by sampling from a standard VAEprior is unnatural and discontinuous, for example, the speak-ing style changes dramatically between units such as words andphones. This is because the ﬁne-grained latent variables aresampled from a Gaussian prior distribution independently foreach unit such as word and phone, although an approximateposterior is derived from a reference speech and has a tempo-ral dependence. A recent study [17] used an AR prior networkconditioned on the phone embeddings, which is trained to ﬁtthe VAE posterior distribution. Fine-grained latent variableshave hierarchical linguistic and temporal dependence since thespeech signal is the time-series data with a hierarchical linguis-tic structure. Furthermore, the latent variables and text contentshould have a strong correlation. Thus, ﬁne-grained latent vari-ables should be modeled and sampled by taking such depen-dence into account.We propose a hierarchical multi-grained framework formodeling the ﬁne-grained latent variables for the expressiveTTS synthesis system. Our framework consists of a multi-grained VAE, a conditional prior, and a multi-level AR la-tent converter. The three different-resolution VAEs extract theutterance-, phrase-, and word-level latent variables, and the two a r X i v : . [ ee ss . A S ] S e p atent converters represent the relation between the hierarchyof these variables. We also introduce a residual structure forthe encoders and the latent converters, and a decoder param-eter sharing, which help in learning the hierarchical structureof latent variables. In the synthesis phase, the word-level latentvariables are predicted with the latent converters from a coarser-level latent variable that is sampled from the conditional prior.Therefore, we can obtain the proper word-level latent variableswithout reference speech. Our proposed system also providescontrollability. Since the word-level latent variables are pre-dicted with the latent converters, we can specify the speakingstyle as easily as with the utterance-level VAE-based system,and generated speech is more expressive than with that system.This is an advantage of our framework when used with certainapplications.

2. Variational autoencoder-basedtext-to-speech synthesis

A VAE [11] is the kind of deep generative model for learningcomplicated data distribution in an unsupervised manner. Weuse a conditional VAE, which extracts the latent random vari-ables z to capture the variations in the observed dataset X con-ditioned on the auxiliary features Y . The VAE is optimizedwith the evidence lower bound (ELBO) as follows: L ( p, q ) = E q ( z | X , Y ) [log p ( X | Y , z )] − D KL ( q ( z | X , Y ) (cid:107) p ( z )) , (1)where the ﬁrst term is the reconstruction loss and the secondterm is the Kullback-Leibler divergence between the prior andthe posterior. Generally, the prior p ( z ) is chosen to be a cen-tered isotropic multivariate Gaussian N ( , I ) , and the approxi-mate posterior q ( z | X , Y ) is a Gaussian N ( µ , σ ) with mean µ and variance σ as the outputs of the neural network calledan encoder. At the training, z is sampled from the approximateposterior. On the other hand, during inference, z is sampledfrom the prior p ( z ) . A VAE is applied to a TTS synthesis system by regarding X and Y as the sequence of acoustic features and the linguistic fea-tures, respectively [12, 13, 14]. It is noted that our experimentsin this paper are based on the basic TTS framework representedby [1, 2] to focus on how to extract and model the latent rep-resentation for expressive speech synthesis. Thus, the acousticfeature sequence and the linguistic feature sequence are timealigned in advance. The encoder maps a variable-length acous-tic feature to two utterance-level ﬁxed vectors, corresponding tothe posterior mean and log variance. The decoder works as anacoustic model, which predicts acoustic features from linguisticfeatures and latent variables.We build the encoder and decoder with long short-termmemory (LSTM) [18]. In the encoder, the acoustic featuresand the linguistic features are ﬁrst passed through the fully-connected (FC) layers to ensure their dimensions equal eachother. These features are added and then fed into a stack oftwo bidirectional LSTM layers. A mean pooling layer is usedto summarize the LSTM outputs across time, followed by twoseparate FC layers with linear activation to predict the mean andlog variance of the posterior distribution. The decoder has sim-ilar architecture as the encoder, which consists of two FC layersto merge both linguistic features and latent variables, a stack of Linguisticfeatures Utterance-levellatent variable Phrase-levellatent variablesWord-levellatent variables: Hierarchical dependence: Temporal dependence: Text dependence

Figure 1:

Graphical model of the latent variables two bidirectional LSTM layers, and one FC layers to output theacoustic features.The ﬁne-grained VAE is a modiﬁed version of theutterance-level VAE mentioned above, to obtain variable-lengthlatent variables such as a phone-, word-, and phrases-level la-tent variables. The difference between the ﬁne-grained VAEand the utterance-level VAE is the interval of times for sum-marizing the LSTM output of the encoder at the mean poolinglayer. The ﬁne-grained VAE can be trained in the same fashionas the utterance-level VAE.

3. Hierarchical multi-grained generativemodel for expressive text-to-speech synthesis

The ﬁne-grained VAE can capture the speaking style andprosody at a speciﬁc moment, but synthesized speech usingsampled latent variables from the prior directly may be un-natural. In this section, we reconsider ﬁne-grained latent vari-ables and propose a novel framework that can ﬁnely control thespeaking style and synthesize more expressive speech.

The speaking style can be factorized into multiple temporal res-olution representations, such as utterance-, phrase-, and word-level representations. These representations have a hierarchi-cal linguistic dependency and correlate with the content of thetext. These ﬁne-grained representations also have temporal co-herency. To incorporate these dependencies in explicitly mod-eling the ﬁne-grained latent variables, we assume the graph-ical model of the latent variables as shown in Fig. 1. Thenotation z u is the utterance-level latent variable, and Z p =( z p , z p , · · · , z pN ) , Z w = ( z w , z w , · · · , z wM ) are the sequencesof the phrase-level latent variables and the word-level latentvariables, respectively, and N and M are the number of phrasesand words in an utterance, respectively. As shown in Fig. 1,each latent variable depends on the content of the text, hierarchi-cal linguistic structure, and temporal structure. The ﬁne-grainedlatent variables should be sampled considering these kinds ofdependencies. An overview of the proposed framework is shown in Fig. 2.This framework consists of a multi-grained VAE, a conditionalprior, and a multi-level latent converter. The multi-grained VAEis layered at three different time-resolution VAEs: utterance-level, phrase-level, and word-level, to extract each level latentrepresentation. The conditional prior is a distribution condi-tioned on the text, which enables sampling the latent variablesby taking into account the content of the text. The multi-levellatent converter consists of two latent converters with AR struc-ture: utterance-to-phrase and phrase-to-word latent converter.Each latent converter predicts the ﬁner-level latent variablesfrom the coarser-level latent variables. These converters canalso be viewed as a conditional prior, which conditioned on not ncoder DecoderDecoderDecoderLatentconverterEncoderEncoderAcousticfeaturesAcousticfeaturesAcousticfeatures Utterance-levellatent variablesPhrase-levellatent variablesWord-levellatent variables AcousticfeaturesAcousticfeaturesAcousticfeaturesLinguisticfeaturesLinguisticfeaturesLinguisticfeatures LinguisticfeaturesLinguisticfeaturesLinguisticfeaturesWord-level VAEPhrase-level VAEUtterance-level VAE Latentconverter (Feedback)(Feedback)

Conditionalprior Parameter sharingRateconverterRateconverterRateconverter

Figure 2:

Overview of the proposed framework only the text but also the coarser-level latent variables.In the proposed framework, the distribution of the condi-tional prior, utterance-to-phrase latent converter, and phrase-to-word latent converter are denoted as p ( z u | Y ) , p ( Z p | z u , Y ) ,and p ( Z w | Z p , Y ) , respectively. The joint probability of thelatent variables can be written as p ( Z w , Z p , z u | Y ) = p ( Z w | Z p , Y ) p ( Z p | z u , Y ) p ( z u | Y ) . (2)Therefore, the proposed framework can properly model thethree dependencies of latent variables shown in Fig. 1. Word-level latent variables are sampled from these distributions ina step-by-step manner, so ﬁne-grained latent variables can besampled by considering the hierarchical linguistic structure, thetemporal coherency, and text content without any reference sig-nals. In addition, the encoders and the converters use resid-ual connections to model the ﬁner-level latent variables as vari-ations from the coarser-level ones. We also apply parametersharing to all decoders so that all latent variables can share thesame latent space. We found that these are helpful for the la-tent converters to predict the ﬁner-level latent variables fromthe coarser-level ones.The encoders and the decoders have the same architecturedescribed in Sec. 2.2. The conditional prior has three FC layersincluding the output layer. Each AR latent converter has onebidirectional LSTM layer and one unidirectional LSTM layerwith an AR structure, which takes the previous output of the la-tent converter, and the FC layer as the output layer. These con-ditional prior and latent converters take linguistic embeddingsof each level. To make these embeddings, we use rate convert-ers with a stack of two bidirectional LSTM layers and a meanpooling layer. We train this framework with two steps. The ﬁrststep involves training all encoders and decoders to learn eachlatent variable and predict the acoustic feature from the latentvariable and linguistic feature. The second step involves train-ing the conditional prior and AR latent converters. Scheduledsampling [19] is used for training the AR latent converters.

4. Experiments

In this experiment, a ten-hour Japanese single-female speakercorpus was used. This corpus contains four speaking styles,i.e., normal (13,520 utterances), happy (3,861 utterances), sad(1,716 utterances), and radio (1,816 utterances). Radio style iscolloquial speech spoken in a radio program and includes dif-ferent speaking styles within an utterance. We used 20,073 ut- Table 1:

Objective evaluation of reconstruction performance

Method MCD GVD F0ERUtterance-level VAE (oracle z u ) 5.023 0.541 0.113Word-level VAE (oracle Z w ) 4.915 0.527 0.065Multi-grained VAE (predicted Z w ): M1 w/o residual & dec. sharing 5.048 0.538 0.117 M2 w/ residual & dec. sharing 5.021 0.530 0.114terances for training, 417 for validation, and 417 for testing.The speech signals were sampled at 48 kHz, and each samplewas quantized by 16 bits. Feature vectors were extracted with a5-ms shift, and the feature vector consisted of the log F valuethat is voting results from three F estimators, the 0-th through69-th WORLD mel-cepstral coefﬁcients, and the 0-th through34-th mel-cepstral analysis aperiodicity measures [20, 21].Five-state, left-to-right, no-skip hidden semi-Markov mod-els were used to obtain phoneme alignments [22]. The phonemedurations were modeled using a style-dependent mixture den-sity network [23]. The linguistic feature vector is a 718-dimensional vector, consisting of a full-context feature vectorextracted by an external text analyzer [24] and duration fea-ture vector including the duration of the current phoneme andthe position of the current frame. The acoustic feature vec-tor is a 107-dimensional vector, consisting of 70-dimensionalWORLD mel-cepstral coefﬁcients, a log F value acquired bylinearly interpolating in unvoiced parts, a voiced/unvoiced bi-nary value, and 35-dimensional mel-cepstral analysis aperiodic-ity measures. All latent variables in the VAEs are 2-dimensionalto easily specify the value of latent variables. We calculated the objective scores to measure reconstructionperformance. Mel-cepstral distortion (MCD) [dB], global vari-ance distance (GVD) for mel-cepstrum coefﬁcients [25], androot mean squared error of log F (F0ER) [logHz] were used.The utterance-level VAE, word-level VAE, and two types ofmulti-grained VAE system were compared, i.e., M1 denotesa multi-grained model without neither residual connection nordecoder’s parameter sharing, and M2 denotes a multi-grainedmodel with residual structure and decoder parameter sharing.In the utterance-level VAE and the word-level VAE, the latentvariables are oracle ones extracted from natural acoustic fea-tures. In M1 and M2 , the utterance-level latent variables areoracle ones, and the other level latent variables are predictedusing the AR latent converters.Table 1 shows the results of the objective evaluation. In-troducing the ﬁne-grained latent variables improves the per-formance of predicting the acoustic features since these latentvariables can capture the variations at a speciﬁc moment ofspeech. Although M1 was worse than the utterance-level VAEexcept regarding GVD, M2 was improved and outperformed theutterance-level VAE in terms of MCD and GVD. This result in-dicates that residual connection and decoder parameter sharinghelps predict the ﬁner-level latent variables. However, M2 didnot reach the performance of word-level VAE, indicating that itis still challenging to predict the word-level latent variables. To evaluate the naturalness and expressiveness of the synthe-sized speech, we conducted subjective listening tests. The natu-ralness and expressiveness of the synthesized speech were as-able 2:

MOSs for naturalness and expressiveness

Method Naturalness Expressiveness FG ± ± FG+AR ± ± FG+CP ± ± FG+CP+AR 3.47 ± ± MG+CP ± ± MG+CP+AR ± ± sessed using mean opinion score (MOS) tests. The opinionscore in the MOS test for naturalness was based on a ﬁve-pointscale (5: natural – 1: poor in naturalness). In the MOS test forexpressiveness, a different ﬁve-point scale was used (5: very ex-pressive – 1: poor in expressiveness), and participants evaluatedit considering the content of the text. The participants were 19Japanese students in our research group, and 10 utterances werechosen at random per method from the test set. For proper eval-uation, we excluded short audio samples of less than 1 secondsexcept for the silence at both ends in advance.The following six systems were compared.• FG : The ﬁne-grained VAE-based system. The word-levellatent variables were sampled from the normal Gaussianprior at the synthesis stage.• FG+AR : The system in which the prior in FG is replacedby the AR prior. The AR prior consisted of one unidirec-tional LSTM layer and the FC layer as the output layer.• FG+CP : The system in which the prior in FG is replacedby the conditional prior. This prior is a word-level con-ditional prior with the same architecture as the utterance-level conditional prior described in Sec. 3.2.• FG+CP+AR : The system in which the prior in FG is re-placed by the conditional AR prior. This prior was thenetwork in which the feedback connection was added tothe prior in FG+CP .• MG+CP : The proposed system without the AR struc-ture in the latent converter. At the synthesis stage, theutterance-level latent variable was sampled from the con-ditional prior, and the word-level latent variables were pre-dicted by the non-AR latent converters.•

MG+CP+AR : The proposed system in which the latentconverters in

MG+CP are replaced with the AR latentconverters.Table 2 shows the result of subjective evaluation.

FG+AR had a worse score of expressiveness than FG . Sampling fromthe AR prior is unstable because the latent variables after sam-pling are fed back to the prior network as the previous outputs. FG+CP had a slightly better score than FG , but the differencebetween expressiveness was small. We used the full-contextfeature vector as the linguistic feature vector, and the condi-tional prior took this vector. This result suggests that it is difﬁ-cult for the conditional prior to capture the meanings of the textsfrom the full-context features and sample the latent variablesthat match the content. This may be improved by introducing atext-embedding model such as BERT [26, 27]. Unlike between FG and FG+CP , FG+CP+AR outperformed

FG+CP , whichindicates that linguistic features help in the stable sampling oflatent variables from the AR prior since full-context features in-clude information about the linguistic structure. Regarding theproposed systems,

MG+CP and

MG+CP+AR performed bet-ter than the others in terms of expressiveness. This suggests Figure 3:

Latent space of the multi-grained VAE that modeling the hierarchical linguistic structure of the latentvariables is effective for expressive speech synthesis. In ad-dition,

MG+CP+AR had the best score for expressiveness, soexplicit temporal modeling of latent variables is useful. Sincethe difference between them was small, modeling the hierar-chical structure is more important than modeling the temporalcoherency in latent variables. On the other hand, the naturalnessof

MG+CP and

MG+CP+AR was degraded compared to thatof

FG+CP+AR . This may be caused by multiple sampling toobtain the word-level latent variables in the proposed system.

Our proposed multi-grained system provides the controllabilityof speaking style at the synthesis stage. The word-level latentvariables are predicted from the utterance-level latent variablesso that we can control the speaking style of the entire utter-ance despite using the ﬁne-grained latent variables. In this ex-periment, since the latent variable was set to 2 dimensions, wecould visualize the utterance-level latent variables directly andmanually specify their values at the synthesis stage. Figure 3shows the latent space in the multi-grained VAE visualized byplotting the utterance-level latent variables of the randomly se-lected 1,500 utterances per speaking style. The utterances ofnormal, happy, and sad style were classiﬁed without annota-tion, and the utterances of radio style, which includes variousspeaking styles, were distributed so as to be mixed into threestyles. Thus, we could control the speaking style of synthesizedspeech intuitively while viewing the latent space. To show thecontrollability, we demonstrated synthesized speech samples ona web page .

5. Conclusions

We proposed a novel framework for expressive speech synthe-sis with a hierarchical multi-grained generative model for mod-eling ﬁne-grained latent variables, considering the hierarchicallinguistic structure, the temporal coherency, and an input text.Experimental results indicate that the proposed model is effec-tive for expressive and controllable speech synthesis. Futurework includes utilizing a text-embedding model such as BERTto consider the content of the text in our proposed model. Intro-ducing the proposed model into end-to-end TTS with an atten-tion mechanism is also included in our future work to controlnot only the acoustic features but also the phoneme durations.

6. Acknowledgements

This work was supported by JSPS KAKENHI Grant NumberJP19H04136. . References [1] H. Zen, A. Senior, and M. Schuster, “Statistical parametric speechsynthesis using deep neural networks,” in Proceedings of ICASSP ,2013, pp. 7962–7966.[2] Y. Fan, Y. Qian, F.-L. Xie, and F. K. Soong, “TTS synthesis withbidirectional LSTM based recurrent neural networks,” in

Proccd-ings of Interspeech , 2014, pp. 964–1968.[3] Z. Wu and S. King, “Investigating gated recurrent networks forspeech synthesis,” in

Proceedings of ICASSP , 2016, pp. 5140–5144.[4] Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss,N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio et al. , “Tacotron:Towards end-to-end speech synthesis,” in

Proccdings of Inter-speech , 2017, pp. 4004–4010.[5] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang,Z. Chen, Y. Zhang, Y. Wang, R. Skerrv-Ryan et al. , “Natural TTSsynthesis by conditioning wavenet on mel spectrogram predic-tions,” in

Proceedings of ICASSP , 2018, pp. 4779–4783.[6] W. Ping, K. Peng, A. Gibiansky, S. O. Arik, A. Kannan,S. Narang, J. Raiman, and J. Miller, “Deep voice 3: Scaling text-to-speech with convolutional sequence learning,” arXiv preprintarXiv:1710.07654 , 2017.[7] N. Li, S. Liu, Y. Liu, S. Zhao, and M. Liu, “Neural speech synthe-sis with transformer network,” in

Proceedings of the AAAI Con-ference on Artiﬁcial Intelligence , vol. 33, 2019, pp. 6706–6713.[8] K. Inoue, S. Hara, M. Abe, N. Hojo, and Y. Ijima, “An inves-tigation to transplant emotional expressions in DNN-based TTSsynthesis,” in

Proceedings of APSIPA , 2017, pp. 1253–1258.[9] R. Skerry-Ryan, E. Battenberg, Y. Xiao, Y. Wang, D. Stanton,J. Shor, R. J. Weiss, R. Clark, and R. A. Saurous, “Towardsend-to-end prosody transfer for expressive speech synthesis withtacotron,” arXiv preprint arXiv:1803.09047 , 2018.[10] Y. Wang, D. Stanton, Y. Zhang, R. Skerry-Ryan, E. Battenberg,J. Shor, Y. Xiao, F. Ren, Y. Jia, and R. A. Saurous, “Style tokens:Unsupervised style modeling, control and transfer in end-to-endspeech synthesis,” arXiv preprint arXiv:1803.09017 , 2018.[11] D. P. Kingma and M. Welling, “Auto-encoding variational bayes.”in

Proceedings of ICLR , 2014.[12] K. Akuzawa, Y. Iwasawa, and Y. Matsuo, “Expressive speech syn-thesis via modeling expressions with variational autoencoder,” in

Proccdings of Interspeech , 2018, pp. 3067–3071.[13] G. E. Henter, J. Lorenzo-Trueba, X. Wang, and J. Yamagishi,“Deep encoder-decoder models for unsupervised learning of con-trollable speech synthesis,” arXiv preprint arXiv:1807.11470 ,2018.[14] Y.-J. Zhang, S. Pan, L. He, and Z.-H. Ling, “Learning latent rep-resentations for style control and transfer in end-to-end speechsynthesis,” in

Proceedings of ICASSP , 2019, pp. 6945–6949.[15] Y. Lee and T. Kim, “Robust and ﬁne-grained prosody control ofend-to-end speech synthesis,” in

Proceedings of ICASSP , 2019,pp. 5911–5915.[16] G. Sun, Y. Zhang, R. J. Weiss, Y. Cao, H. Zen, and Y. Wu, “Fully-hierarchical ﬁne-grained prosody modeling for interpretablespeech synthesis,” in

Proceedings of ICASSP , 2020, pp. 6264–6268.[17] G. Sun, Y. Zhang, R. J. Weiss, Y. Cao, H. Zen, A. Rosenberg,B. Ramabhadran, and Y. Wu, “Generating diverse and naturaltext-to-speech samples using a quantized ﬁne-grained vae and au-toregressive prosody prior,” in

Proceedings of ICASSP , 2020, pp.6699–6703.[18] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”

Neural computation , vol. 9, no. 8, pp. 1735–1780, 1997.[19] S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer, “Scheduled sam-pling for sequence prediction with recurrent neural networks,” in

Advances in Neural Information Processing Systems , 2015, pp.1171–1179. [20] “Speech signal processing toolkit (SPTK),” http://sp-tk.sourceforge.net/.[21] M. Morise, F. Yokomori, and K. Ozawa, “WORLD: a vocoder-based high-quality speech synthesis system for real-time applica-tions,”

IEICE Transactions on Information and Systems , vol. 99,no. 7, pp. 1877–1884, 2016.[22] H. Zen, K. Tokuda, T. Masuko, T. Kobayasih, and T. Kitamura, “Ahidden semi-Markov model-based speech synthesis system,”

IE-ICE Transactions on Information and Systems , vol. E90-D, no. 5,pp. 825–834, 2007.[23] C. M. Bishop, “Mixture density networks,” Neural ComputingResearch Group, Aston University, Tech. Rep. NCRG/94/004,1994.[24] “Open JTalk,” http://open-jtalk.sourceforge.net/.[25] K. Hashimoto, K. Oura, Y. Nankaku, and K. Tokuda, “Trajectorytraining considering global variance for speech synthesis basedon neural networks,” in

Proceedings of ICASSP , 2016, pp. 5600–5604.[26] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language under-standing,” in

Proceedings of NAACL-HLT , 2019.[27] T. Hayashi, S. Watanabe, T. Toda, K. Takeda, S. Toshniwal, andK. Livescu, “Pre-trained text embeddings for enhanced text-to-speech synthesis,” in