[PDF] Modeling Prosodic Phrasing with Multi-Task Learning in Tacotron-based TTS

Abstract

Tacotron-based end-to-end speech synthesis has shown remarkable voice quality. However, the rendering of prosody in the synthesized speech remains to be improved, especially for long sentences, where prosodic phrasing errors can occur frequently. In this paper, we extend the Tacotron-based speech synthesis framework to explicitly model the prosodic phrase breaks. We propose a multi-task learning scheme for Tacotron training, that optimizes the system to predict both Mel spectrum and phrase breaks. To our best knowledge, this is the first implementation of multi-task learning for Tacotron based TTS with a prosodic phrasing model. Experiments show that our proposed training scheme consistently improves the voice quality for both Chinese and Mongolian systems.

Full PDF

11 Modeling Prosodic Phrasing with Multi-TaskLearning in Tacotron-based TTS

Rui Liu,

Member, IEEE , Berrak Sisman,

Member, IEEE , Feilong Bao, Guanglai Gao,Haizhou Li,

Fellow, IEEE

Abstract —Tacotron-based end-to-end speech synthesis hasshown remarkable voice quality. However, the rendering ofprosody in the synthesized speech remains to be improved,especially for long sentences, where prosodic phrasing errors canoccur frequently. In this paper, we extend the Tacotron-basedspeech synthesis framework to explicitly model the prosodicphrase breaks. We propose a multi-task learning scheme forTacotron training, that optimizes the system to predict both Melspectrum and phrase breaks. To our best knowledge, this is theﬁrst implementation of multi-task learning for Tacotron basedTTS with a prosodic phrasing model. Experiments show thatour proposed training scheme consistently improves the voicequality for both Chinese and Mongolian systems.

Index Terms —Tacotron, Multi-Task Learning, Prosody

I. I

NTRODUCTION

With the advent of deep learning, end-to-end text-to-speech(TTS) has shown many advantages over the conventional TTStechniques [1], [2]. Tacotron-based approaches [3]–[7] withan encoder-decoder architecture and attention mechanism haveshown remarkable performance. The key idea is to integratethe conventional TTS pipeline into a uniﬁed network andlearn the mapping directly from the text-waveform pair [8]–[10]. The recent progress in neural vocoder [4], [11]–[15] alsocontributes to the improvement of speech quality.Speech prosody includes affective prosody and linguisticprosody. Affective prosody represents the emotion of aspeaker, while linguistic prosody relates to the languagecontent. They are both crucial in speech communication. ATTS system is expected to synthesize the right prosodic patternat the right time. However, most of the current end-to-endsystems [3], [4], [6], [7] have not explicitly modeled speechprosody. Therefore, they can’t control well the melodic andrhythmic aspects of the generated speech. This usually leadsto monotonous speech, even when models are trained on veryexpressive speech datasets. In this paper, we would like to

Rui Liu, Feilong Bao and Guanglai Gao are with the Department ofComputer Science, Inner Mongolia University. Rui Liu is also with theNational University of Singapore. Berrak Sisman is with the InformationSystems Technology and Design (ISTD) Pillar at Singapore University ofTechnology and Design. Haizhou Li is with the Department of Electrical andComputer Engineering, National University of Singapore.This research is supported by the National Research Foundation, Singaporeunder its AI Singapore Programme (Award No: AISG-GC-2019-002) and(Award No: AISG-100E-2018-006), and its National Robotics Programme(Grant No. 192 25 00054), and by RIE2020 Advanced Manufacturingand Engineering Programmatic Grants A1687b0033, and A18A2b0046. Thisresearch is also supported by SUTD Start-up Grant Artiﬁcial Intelligencefor Human Voice Conversion (SRG ISTD 2020 158), SUTD AI Grant ’TheUnderstanding and Synthesis of Expressive Speech by AI’ (PIE-SGP-AI-2020-02) and China National Natural Science Foundation (No.61773224). study the way to enable Tacotron-based TTS for expressiveprosody generation.Multi-task learning (MTL) is a learning paradigm thatleverages information from multiple related tasks to helpimprove the overall performance [16]. MTL is inspiredby human learning activities where people often apply theknowledge learned from many tasks for learning a new task,that is called inductive transfer. For example, if we learnto read and write together, the experience in reading canstrengthen the writing and vice versa. MTL has been widelyused in speech enhancement [17], and speech recognition[18]. It has also been used in speech synthesis [19], suchas statistical parametric speech synthesis with GANs [20]and DNN-based speech synthesis with stacked bottleneckfeatures [21]. In this paper, we apply multi-task learning tothe Tacotron-based TTS for prosody modeling.The study on expressive speech synthesis is focused onprosody modeling [22]–[24], where speech prosody generallyrefers to intonation, stress, speaking rate, and phrase breaks.Prosodic phrasing [25]–[28] plays an important role in bothaffective and linguistic expressions. Inadequate phrase breaksmay lead to misperception in speech communication. Therehave been recent studies on prosody modeling for end-to-end TTS system [29], for example, to improve the prosodicphrasing by using contextual information [30], and syntacticfeatures [31]. They are incorporated in the stage of textpreprocessing, therefore, there are not optimized as part ofthe synthesis processing.We propose a novel two-task learning scheme for Tacotron-based TTS model to improve the prosodic phrasing: 1)the main task learns the prediction of the speech spectrumparameters from character-level embedding representation, and2) the secondary task learns the prediction of a word-levelprosody embedding. During training, the secondary task servesas an additional supervision for Tacotron to learn the exquisiteprosody structure associated with the input text. At run-time,the prosody embedding serves as a local condition that controlsthe prosodic phrasing during voice generation.The main contributions of this paper include: 1) a novelTacotron-based TTS architecture that explicitly modelsprosodic phrasing; and 2) a multi-task learning scheme, thatoptimizes the model for high quality speech spectrum, andadequate prosodic phrasing at the same time. The proposedsystem achieves remarkable voice quality for both ChineseMandarin and Mongolian. To our best knowledge, this isthe ﬁrst multi-task Tacotron implementation that includes anexplicit prosodic model. a r X i v : . [ ee ss . A S ] A ug Attention Grifﬁn-LimAlgorithm SpeechText EncoderProsodyGenerator W O RD W O RD W O RD W O RD < B l an k > W O R D < B l an k > SPWECEi CEoPE

LosswavLosspe

DecoderInput Text

Losstotal = Losswav + w * Losspe (a) MTL-Tacotron

Main taskSecondarytask (b) Prosody GeneratorBiLSTM LayerHidden LayerOutput LayerWEpe t Losspe

Fig. 1. Block diagrams of the proposed (a) MTL-Tacotron, and (b) prosody generator. MTL-Tacotron employs a prosody generator to explicitly modelprosodic phrasing. The prosody generator produces prosody embedding ( PE ) from input text, that forms a joint embedding vector with character embedding.Grifﬁn-Lim algorithm is not involved in training. SP denotes mel-spectrum speech features. pe t is a 5-dimension prosody embedding. This paper is organized as follows. Section II recaps theTacotron TTS framework. We propose the multi-task Tacotronin Section III and report the experiments in Section IV. SectionV concludes the discussion.II. T

ACOTRON - BASED

TTSTacotron [3] is a sequence-to-sequence speech synthesizerthat consists of an encoder, and a decoder with attentionmechanism. The encoder takes a sequence of text charactersas input, where each character is encoded as a one-hotvector and embedded into a continuous vector, that is calledinput character embedding. The encoder is trained as partof Tacotron to take the input character embedding andgenerates the output character embedding. The decoder isan autoregressive recurrent neural network that converts thecharacter embeddings into a sequence of Mel spectrum featurevectors with an attention mechanism.Just like most of other TTS systems, Tacotron [3] is trainedto predict the Mel spectrum features from input sequence ofcharacters. Prosody, if taken into consideration, is modeledfrom the statistics of the training data [3], [4], [6], [8]. Wenote that the character sequences themselves are not the mostsuitable for describing prosody. They do not generalize wellbecause prosody is manifested over a speech segment beyondcharacters and phonemes. There have been attempts [8], [32]to use word embedding as input to improve the expressivenessof Tacotron-based TTS model, that shows word embedding isprosody-informing.Another idea [33]–[37] is to extract the latent prosodyembeddings to characterize prosody. Some [33]–[35] learnspeech variations without explicit annotations for prosody orstyle. The learned prosody embeddings are usually not fullycontrollable and interpretable. Others [36], [37] just take theprosody embeddings as an auxiliary input to the TTS model.In this paper, we propose a novel prosody embedding,that directly interprets the phrase breaks from the input text.We also propose a novel multi-task learning framework thatoptimizes the system to generate Mel spectrum, at the sametime, accurately predict phrase breaks, which will be the focusof Section III.III. T

ACOTRON WITH M ULTI -T ASK L EARNING

We propose multi-task learning [38] for Tacotron asillustrated in Fig. 1(a), that is referred to as

MTL-Tacotron .The idea is to dedicate a prosody modeling task to model theprosodic phrasing, that is trainable from data. In the multi-tasklearning, not only do we optimize the output speech quality, but we also ensure that Tacotron is optimized to produceadequate phrase breaks. We study a two-task learning strategy,1) the main task generates the Mel-spectrums from the inputcharacter sequence; and 2) the secondary task predicts anappropriate prosodic phrasing.

A. Main Task: Spectral Modeling

The main task has a network architecture identical to thetraditional Tacotron [3] as shown in Fig. 1(a). It contains atext encoder and a decoder with attention mechanism. Weﬁrst convert the word sequence in raw input text to inputcharacter embeddings, denoted as CE i , which are encoded intooutput character embeddings, denoted as CE o , from which thedecoder generates Mel spectral features.The main task is optimized using a Mel spectral lossfunction, Loss wav = (cid:80) T (cid:48) t =1 L ( y t , y (cid:48) t ) , where y t and y (cid:48) t arethe target and the predicted Mel-spectrum respectively, T (cid:48) isthe total number of Mel spectrum features in the utterance,and L denotes a L norm function. B. Secondary Task: Prosody Modeling

The secondary task optimizes a prosody generator to predictthe phrase break pattern for each word in the input text, asshown in Fig. 1(b). We deﬁne prosody embedding ( PE ) asa vector of ﬁve elements, namely break, non-break, blank,punctuation and stop token , that represents ﬁve phrase breakpatterns. stop token denotes the end of an utterance, while punctuation refers to any punctuation symbols other than stoptoken .The word sequence in the input text is ﬁrst represented by asequence of word embeddings. We devise a Bidirectional LongShort-Term Memory (BLSTM) as the prosody generator, thattakes the word embeddings WE = { we , ..., we t , ..., we T } as input and generates the prosody embedding PE = { pe , ..., pe t , ..., pe T } as output, where pe t is a 5-dimensionembedding vector. Speciﬁcally, the forward and backwardLSTM reads the word embedding sequence WE from bothdirections. We add a hidden layer on top of the LSTM todetect higher-level feature combinations, and a softmax layerto produce the probability distribution of phrase break patterns pe t for each of the T words. An element in the embeddingvector pe t = [ p t [1] , ..., p t [ k ] , ..., p t [5]] , t ∈ [1 , T ] , representsthe probability of the phrase break label k .The secondary task minimizes the differences betweenthe predicted prosody embedding and the ground truthone-hot vector using the cross-entropy loss Loss pe = − (cid:80) Tt =1 log p t [ k ] , where k represents the target phrase breakpattern. C. Multi-task Learning

We now have two parallel feature representations of inputtext, as shown in Fig. 1(a). CE o is the character representationin the main task and the prosody embedding ( PE ) in thesecondary task. Prosody embedding serves as an auxiliaryinput to Tacotron that informs the phrase break information.We concatenate CE o and PE to form a joint embeddingvector as the input for the attention mechanism. In this way,we expect that Tacotron optimizes the voice quality, by alsomaking sure that the phrase break is correct. As CE o and PE have different time resolutions, we upsample the PE to alignwith CE o as shown in Fig. 1(a).The total loss function is given as Loss total = Loss wav + w ∗ Loss pe , with w as a weight. With the total loss, weexpect that the prosody generator learns from both the phrasebreak annotations and the actual speech utterances to associateacoustic-prosodic patterns with the input text, thus improvingthe Tacotron spectral generation at run-time inference. Thetwo-task learning strategy is also referred to as the jointtraining strategy. IV. E XPERIMENTS

A. Databases

Speech Data : We use the TsingHua-Corpus of SpeechSynthesis (TH-CoSS) [39] for Chinese. We use a subset ofTH-CoSS, denoted as , that contains approximately 9hours of speech data and 5.6k utterances with 103k words.The speech signals are sampled at 16 kHz and encoded at16-bit. The Mongolian speech data as in [40] contains about17 hours of speech in total. The speech signals are sampledat 22.05 kHz and encoded at 16-bit. For both Chinese andMongolian, we divide the corpus into training and test sets ina ratio of 4 to 1 in all experiments.

Phrase Break Labels : We use the text transcript of thespeech data as the training data of prosody generator. Theprosodic phrases of Chinese text, break and non-break , aremanually labelled. The Mongolian phrase breaks are markedby examining the text and listening to the speech samples. The blank , punctuation and stop token labels are naturally presentin the text. Word Embedding : We generate the word embedding WE via table look-up. For Chinese, we use the Tencent AI Labembedding database for Chinese Words and Phrases [41]. ForMongolian, the pre-trained 200-dimension word embeddingreported in [42] is used. B. Contrastive Systems

We build three constrastive systems to validate the two ideasin the proposed

MTL-Tacotron , namely multi-task learning,and prosody embedding in a comparative study. In all systems,we use Grifﬁn-Lim algorithm [43] for waveform generation forrapid turn-around.1) Traditional

Tacotron

TTS system as in [3], that doesn’texplicitly model prosodic phrasing.2) Tacotron augmented with word embedding as in [8],denoted as

WE-Tacotron and illustrated in Fig. 2 (a). Theword embedding informs

Tacotron the word identity and itsboundaries, that is shown effective [8].

PE (b) PE-Tacotron(a) WE-TacotronDecoderText EncoderAttentionInput TextWE Grifﬁn-LimAlgorithmSpeechSPWORDCEi DecoderText EncoderAttentionInput TextWE Grifﬁn-LimAlgorithmSpeechSPWORDWORDWORDWORD WORD

PretrainedProsodyGenerator

CEi

Fig. 2. Block diagrams of the baseline frameworks (a) WE-Tacotron, and (b)PE-Tacotron. Grifﬁn-Lim algorithm is not involved in training. SP denotesmel-spectrum speech features.

3) Tacotron augmented with prosody embedding withoutmulti-task joint training, denoted as

PE-Tacotron and illus-trated in Fig. 2 (b). The prosody embeddings are derived fromword embedding to encode the prosodic phrasing.Besides the multi-task learning,

MTL-Tacotron is alsodifferent from both

WE-Tacotron and

PE-Tacotron in the waythat the text encoder is incorporated in order to facilitate thejoint training.

PE-Tacotron and

WE-Tacotron share similararchitecture with

Tacotron baseline except that

PE-Tacotron is augmented by prosody embedding, while

WE-Tacotron isaugmented by word embedding. Unlike

MTL-Tacotron , theyincorporate the embeddings that are trained independently of

Tacotron . They are the contrastive models for

MTL-Tacotron to show the effect of multi-task learning.We note that prosody embedding is derived from word em-bedding.

MTL-Tacotron and

PE-Tacotron are trained to predictthe phrase breaks explicitly from word embeddings, while

WE-Tacotron use the word embeddings directly. Therefore,

WE-Tacotron serves as the contrastive model for

MTL-Tacotron and

PE-Tacotron to show the advantage of the proposed prosodyembedding.

C. Experimental Setup

The Chinese text is encoded in

Pinyin string with tones, andMongolian text in Latin transliteration. For both languages, wegenerate 80-channel Mel-spectrum as output. CE i (or CE o )and PE size are set to 256 and 5 respectively. The numberof output frames is controlled by a hyperparameter reductionfactor (r), which is set to 5, and the weight w is set to 0.5.We use the Adam optimizer with β = 0.9, β = 0.999 and alearning rate of − exponentially decaying to − startingwith 50k steps. We also apply L regularization with weight − . All models are trained with a batch size of 32. The ﬁnalmodels are trained with 200k steps for all systems.The prosody generator in the MTL-Tacotron is jointlytrained with other Tacotron modules. The size of the LSTMlayer is set to 200 in both directions for all experiments, thesize of hidden layer is set to 50. The prosody generator in

PE-Tacotron has exactly the same conﬁguration as that in

MTL-Tacotron , except that it is pre-trained on the training data.

D. Phrase Break Prediction

We report the phrase break prediction performance of theprosody generator in

MTL-Tacotron and

PE-Tacotron where

TABLE IC

OMPARISON OF PHRASE BREAK PREDICTION IN TERMS OF P RECISION (P), R

ECALL (R)

AND F- SCORE (F)

FOR TWO SYSTEMS THAT EMPLOYPROSODY EMBEDDING , AND MEAN OPINION SCORE (MOS)

IN LISTENINGTESTS FOR ALL SYSTEMS . System Language P R F MOS

Tacotron Chinese NA NA NA 3.71 ± ± ± ± ± ± MTL-Tacotron

Chinese 90.77 91.54 ± ± prosody embedding is used. As the text in the datasets hasalready been annotated with prosody labels, it serves asthe ground truth for reporting the performance. At run-timeinference, the phrase break pattern of a word we t is predictedas ˆ k = arg max k p t [ k ] . We report the performance in terms ofPrecision (P), Recall (R) and F-score (F) which is deﬁned asthe harmonic mean of the P and R. F values range from 0 to1, with a higher value indicating better performance.As shown in Table I, MTL-Tacotron clearly outperforms

PE-Tacotron in phrase break prediction. By comparing

MTL-Tacotron and

PE-Tacotron , we conﬁrm the advantage ofjoint training over pre-trained prosody embedding. We expectthat

MTL-Tacotron will reﬂect the improved phrase breakprediction into actual prosodic rendering in speech.

E. Subjective Listening Test

We conduct listening experiments for all systems . 20Chinese and 15 Mongolian speakers participated in thelistening tests. Each subject listens to 80 converted utterancesof his/her native language.

1) Voice Quality:

We ﬁrst evaluate the voice quality withmean opinion score (MOS) among these four systems. Thelisteners rate the quality on a 5-point scale: “5” for excellent,“4” for good, “3” for fair, “2” for poor, and “1” for bad.In Table I, we observe that

PE-Tacotron and

MTL-Tacotron consistently outperform traditional

Tacotron that doesn’t ex-plicitly model prosodic phrasing. The results validate the ideaof prosody embedding. Moreover,

MTL-Tacotron outperforms

WE-Tacotron and

PE-Tacotron consistently that conﬁrms theadvantage of the proposed joint training.

2) Prosodic Embedding vs. Word Embedding:

To conﬁrmthe advantage of prosody embedding over word embed-ding [8], we further conduct ABX preference tests betweenpairs of systems. The subjects are asked to choose theirpreferred utterances in terms of the rhythm and prosodybreak between a pair of synthesized utterances. The resultsin Table II suggest that

MTL-Tacotron system with prosodicphrasing signiﬁcantly outperforms others in both Chinese andMongolian experiments.We also observe that both

MTL-Tacotron and

PE-Tacotron outperform

WE-Tacotron system. As

MTL-Tacotron and

PE-Tacotron model the phrase breaks explicitly, the results suggestthat modeling phrase breaks explicitly is more effective than Speech samples in the listening tests: https://ttslr.github.io/SPL2020 TABLE IIT

HE PREFERENCE PERCENTAGE (%)

WITH

CONFIDENCE INTERVALSIX COMPETING PAIRS ON COMMON TEST DATA . Competingpair Language Preference (%) p -valueFormer Neutral Latter Tacotronvs.WE-Tacotron Chinese 30.56 28.56 41.19 0.00105Mongolian 29.00 24.92 46.08 0.00014Tacotronvs.PE-Tacotron Chinese 29.38 27.18 43.44 0.00248Mongolian 27.33 24.42 48.25 0.00134Tacotronvs.MTL-Tacotron Chinese 27.44 26.25 46.31 0.00176Mongolian 25.91 21.01 53.08 0.00054WE-Tacotronvs.MTL-Tacotron Chinese 39.56 9.75 50.69 0.00392Mongolian 37.42 11.41 51.17 0.00217WE-Tacotronvs.PE-Tacotron Chinese 38.81 9.69 51.50 0.00282Mongolian 37.33 12.00 50.67 0.00153PE-Tacotronvs.MTL-Tacotron Chinese 40.06 12.69 47.25 0.00047Mongolian 41.50 9.92 48.58 0.00318

ChineseMongolianChineseMongolianChineseMongolianT50T100T200 0 20 40 60 80 100Tacotron MTL-Tacotron Neutral

Fig. 3. The preference percentage (%) with 95% conﬁdence interval between

Tacotron and

MTL-Tacotron for various text length. using word embedding as a proxy to inform the prosody [8].As

MTL-Tacotron consistently offers superior performance, weare convinced that multi-task learning improves the accuracyof the prosody model over

PE-Tacotron , thereby generatingmore accurate prosody embeddings.

3) Effect of Text Length:

We further investigate how thesystems perform with regard to the length of input text.By grouping the test sentences by length, we create threesubsets: 1) T50 with sentences up to 50 characters; 2) T100with sentences of 51 to 100 characters; and 3) T200 withsentences of 101 to 200 characters. We select 80 utterancesfrom each group for evaluation of expressiveness, and reportthe subjective listening test in Fig. 3. We observe that

MTL-Tacotron consistently outperforms the

Tacotron baseline. It isworth noting that

MTL-Tacotron performs remarkably well forlong sentences for T100 and T200, which is encouraging.V. C

ONCLUSIONS

We have proposed a novel multi-task Tacotron model tomodel the prosodic phrasing in speech synthesis, where aword-level prosody generator is introduced as the secondarytask. The experiments show that the proposed

MTL-Tacotron consistently outperforms all contrastive systems. The modelingtechnique for prosodic phrasing can be easily extended to themodeling of other melodic and rhythmic aspects of speech,such as intonation and stress. R EFERENCES[1] K. Tokuda, Y. Nankaku, T. Toda, H. Zen, J. Yamagishi, and K. Oura,“Speech synthesis based on hidden markov models,”

Proceedings of theIEEE , vol. 101, no. 5, pp. 1234–1252, 2013.[2] H. Zen, A. Senior, and M. Schuster, “Statistical parametric speechsynthesis using deep neural networks,” in

Proc. ICASSP2013 . IEEE,2013, pp. 7962–7966.[3] Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly,Z. Yang, Y. Xiao, Z. Chen, S. Bengio et al. , “Tacotron: A fully end-to-end text-to-speech synthesis model,” in

INTERSPEECH , 2017, pp.4006–4010.[4] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen,Y. Zhang, Y. Wang, R. Skerrv-Ryan et al. , “Natural TTS synthesisby conditioning wavenet on mel spectrogram predictions,” in

Proc.ICASSP2018 . IEEE, 2018, pp. 4779–4783.[5] Y. Lee and T. Kim, “Robust and ﬁne-grained prosody control of end-to-end speech synthesis,” in

Proc. ICASSP2019 . IEEE, 2019.[6] R. Liu, B. Sisman, J. Li, F. Bao, G. Gao, and H. Li, “Teacher-studenttraining for robust tacotron-based tts,” in

ICASSP 2020 - 2020 IEEEInternational Conference on Acoustics, Speech and Signal Processing(ICASSP) , 2020, pp. 6274–6278.[7] R. Liu, B. Sisman, F. Bao, G. Gao, and H. Li, “WaveTTS: Tacotron-based TTS with Joint Time-Frequency Domain Loss,” in

Proc. Odyssey2020 The Speaker and Language Recognition Workshop , 2020, pp. 245–251.[8] Y.-A. Chung, Y. Wang, W.-N. Hsu, Y. Zhang, and R. Skerry-Ryan,“Semi-supervised training for improving data efﬁciency in end-to-endspeech synthesis,” in

Proc. ICASSP2019 , 2019, pp. 6940–6944.[9] M. He, Y. Deng, and L. He, “Robust Sequence-to-Sequence AcousticModeling with Stepwise Monotonic Attention for Neural TTS,” in

Proc.Interspeech 2019 , 2019, pp. 1293–1297.[10] H.-T. Luong, X. Wang, J. Yamagishi, and N. Nishizawa, “Training Multi-Speaker Neural Text-to-Speech Systems Using Speaker-ImbalancedSpeech Corpora,” in

Proc. Interspeech 2019 , 2019, pp. 1303–1307.[11] T. Hayashi, A. Tamamori, K. Kobayashi, K. Takeda, and T. Toda,“An investigation of multi-speaker training for WaveNet vocoder,” in . IEEE, 2017, pp. 712–718.[12] T. Okamoto, T. Toda, Y. Shiga, and H. Kawai, “Real-Time Neural Text-to-Speech with Sequence-to-Sequence Acoustic Model and WaveGlowor Single Gaussian WaveRNN Vocoders,” in

Proc. Interspeech 2019 ,2019, pp. 1308–1312.[13] B. Sisman, M. Zhang, and H. Li, “A voice conversion framework withtandem feature sparse representation and speakera-adapted WaveNetvocoder,” in

Proc. Interspeech 2018 , 2018, pp. 1978–1982.[14] ——, “Group Sparse Representation with WaveNet Vocoder Adaptationfor Spectrum and Prosody Conversion,”

IEEE/ACM Transactions onAudio, Speech and Language Processing , 2019.[15] B. Sisman, M. Zhang, S. Sakti, H. Li, and S. Nakamura, “Adaptivewavenet vocoder for residual compensation in gan-based voice conver-sion,” in . IEEE, 2018.[16] Y. Zhang and Q. Yang, “A survey on multi-task learning,” arXiv preprintarXiv:1707.08114 , 2017.[17] Z. Chen, S. Watanabe, H. Erdogan, and J. R. Hershey, “Speech en-hancement and recognition using multi-task learning of long short-termmemory recurrent neural networks,” in

Sixteenth Annual Conference ofthe International Speech Communication Association , 2015.[18] S. Kim, T. Hori, and S. Watanabe, “Joint CTC-attention based end-to-end speech recognition using multi-task learning,” in . IEEE, 2017, pp. 4835–4839.[19] Q. Hu, Z. Wu, K. Richmond, J. Yamagishi, Y. Stylianou, and R. Maia,“Fusion of multiple parameterisations for DNN-based sinusoidal speechsynthesis with multi-task learning,” in

Sixteenth annual conference ofthe international speech communication association , 2015.[20] S. Yang, L. Xie, X. Chen, X. Lou, X. Zhu, D. Huang, and H. Li,“Statistical parametric speech synthesis using generative adversarialnetworks under a multi-task learning framework,” in .IEEE, 2017, pp. 685–691.[21] Z. Wu, C. Valentini-Botinhao, O. Watts, and S. King, “Deep neuralnetworks employing multi-task learning and stacked bottleneck featuresfor speech synthesis,” in . IEEE, 2015, pp.4460–4464. [22] I. Jauk, J. Lorenzo Trueba, J. Yamagishi, and A. Bonafonte C´avez,“Expressive speech synthesis using sentiment embeddings,” in

Proc.Interspeech 2018 , 2018, pp. 3062–3066.[23] K. Akuzawa, Y. Iwasawa, and Y. Matsuo, “Expressive speech synthesisvia modeling expressions with variational autoencoder,” pp. 3067–3071,2018.[24] Y. Mass, S. Shechtman, M. Mordechay, R. Hoory, O. S. Shalom, G. Lev,and D. Konopnicki, “Word emphasis prediction for expressive text tospeech.” in

Proc. Interspeech 2018 , 2018, pp. 2868–2872.[25] C. Wightman, S. Shattuck-Hufnagel, M. Ostendorf, and P. J. Price,“Segmental durations in the vicinity of prosodic phrase boundaries,”

Journal Acoustical Society of America , pp. 1707–1717, 1992.[26] H. Kim, T. Yoon, J. Cole, and M. Hasegawa-Johnson, “Acousticdifferentiation of L- and L-L% in switchboard and radio news speech,” in

Speech Prosody 2006 – 3 rd International Conference on Speech Prosody,May 2-5, Dresden, Germany, Proceedings , 2006.[27] P. Taylor and A. W. Black, “Assigning phrase breaks from part-of-speechsequences,”

Computer Speech & Language , vol. 12, no. 2, pp. 99–117,1998.[28] T. Mishra, Y.-j. Kim, and S. Bangalore, “Intonational phrase breakprediction for text-to-speech synthesis using dependency relations,” in . IEEE, 2015, pp. 4919–4923.[29] R. Sloan, S. S. Akhtar, B. Li, R. Shrivastava, A. Gravano, andJ. Hirschberg, “Prosody prediction from syntactic, lexical, and wordembedding features,” in

Proc. 10th ISCA Speech Synthesis Workshop ,2019, pp. 269–274.[30] Y. Lu, M. Dong, and Y. Chen, “Implementing prosodic phrasing inchinese end-to-end speech synthesis,” in

ICASSP 2019-2019 IEEEInternational Conference on Acoustics, Speech and Signal Processing(ICASSP) . IEEE, 2019, pp. 7050–7054.[31] H. Guo, F. K. Soong, L. He, and L. Xie, “Exploiting syntactic featuresin a parsed tree to improve end-to-end TTS,”

Proc. Interspeech 2019 ,2019.[32] H. Ming, L. He, H. Guo, and F. K. Soong, “Feature reinforcement withword embedding and parsing information in neural tts,” arXiv preprintarXiv:1901.00707 , 2019.[33] Y. Wang, D. Stanton, Y. Zhang, R. Skerry-Ryan, E. Battenberg, J. Shor,Y. Xiao, F. Ren, Y. Jia, and R. A. Saurous, “Style tokens: Unsupervisedstyle modeling, control and transfer in end-to-end speech synthesis,” arXiv preprint arXiv:1803.09017 , 2018.[34] D. Stanton, Y. Wang, and R. Skerry-Ryan, “Predicting expressivespeaking style from text in end-to-end speech synthesis,” in . IEEE, 2018.[35] R. Skerry-Ryan, E. Battenberg, Y. Xiao, Y. Wang, D. Stanton, J. Shor,R. J. Weiss, R. Clark, and R. A. Saurous, “Towards end-to-end prosodytransfer for expressive speech synthesis with tacotron,”

Proceedings ofthe 35th International Conference on Machine Learning (ICML). , 2018.[36] Y. Yasuda, X. Wang, S. Takaki, and J. Yamagishi, “Investigation ofenhanced Tacotron text-to-speech synthesis systems with self-attentionfor pitch accent language,” in

ICASSP 2019-2019 IEEE InternationalConference on Acoustics, Speech and Signal Processing . IEEE, 2019.[37] G. Sun, Y. Zhang, R. J. Weiss, Y. Cao, H. Zen, A. Rosenberg,B. Ramabhadran, and Y. Wu, “Generating diverse and natural text-to-speech samples using a quantized ﬁne-grained VAE and auto-regressiveprosody prior,”

Proc. ICASSP 2020 .[38] Caruana and R. A., “Multitask learning: A knowledge-based source ofinductive bias,”

Machine Learning Proceedings , pp. 41–48, 1993.[39] L. Cai, D. Cui, and R. Cai, “TH-CoSS, a Mandarin Speech Corpus forTTS,”

Journal of Chinese Information Processing , vol. 21, no. 2, 2007.[40] J. Li, H. Zhang, R. Liu, X. Zhang, and F. Bao, “End-to-EndMongolian Text-to-Speech System,” in

ISCSLP 2018 – 11 st InternationalSymposium on Chinese Spoken Language Processing, NOVEMBER 26-29, TAIPEI, Proceedings , 2018, pp. 3062–3066.[41] Y. Song, S. Shi, J. Li, and H. Zhang, “Directional Skip-Gram:Explicitly Distinguishing Left and Right Context for Word Embeddings,”in

NAACL 2018 – 16 th Annual Conference of the North AmericanChapter of the Association for Computational Linguistics, June 1-6, NewOrleans, USA, Proceedings , 2018, pp. 175–180.[42] R. Liu, F. Bao, G. Gao, and W. Wang, “Improving Mongolian phrasebreak prediction by using syllable and morphological embeddings withBiLSTM model,”

Proc. Interspeech 2018 , pp. 57–61, 2018.[43] D. Grifﬁn and J. Lim, “Signal estimation from modiﬁed short-timefourier transform,”