[PDF] Controllable Neural Prosody Synthesis

Abstract

Speech synthesis has recently seen significant improvements in fidelity, driven by the advent of neural vocoders and neural prosody generators. However, these systems lack intuitive user controls over prosody, making them unable to rectify prosody errors (e.g., misplaced emphases and contextually inappropriate emotions) or generate prosodies with diverse speaker excitement levels and emotions. We address these limitations with a user-controllable, context-aware neural prosody generator. Given a real or synthesized speech recording, our model allows a user to input prosody constraints for certain time frames and generates the remaining time frames from input text and contextual prosody. We also propose a pitch-shifting neural vocoder to modify input speech to match the synthesized prosody. Through objective and subjective evaluations we show that we can successfully incorporate user control into our prosody generation model without sacrificing the overall naturalness of the synthesized speech.

Full PDF

CControllable Neural Prosody Synthesis

Max Morrison , ∗ , Zeyu Jin , Justin Salamon , Nicholas J. Bryan , Gautham J. Mysore Northwestern University, Evanston, IL, USA Adobe Research, San Francisco, CA, USA [email protected], [email protected]

Abstract

Speech synthesis has recently seen signiﬁcant improvementsin ﬁdelity, driven by the advent of neural vocoders and neuralprosody generators. However, these systems lack intuitive usercontrols over prosody, making them unable to rectify prosodyerrors (e.g., misplaced emphases and contextually inappropri-ate emotions) or generate prosodies with diverse speaker excite-ment levels and emotions. We address these limitations witha user-controllable, context-aware neural prosody generator.Given a real or synthesized speech recording, our model allowsa user to input prosody constraints for certain time frames andgenerates the remaining time frames from input text and contex-tual prosody. We also propose a pitch-shifting neural vocoder tomodify input speech to match the synthesized prosody. Throughobjective and subjective evaluations we show that we can suc-cessfully incorporate user control into our prosody generationmodel without sacriﬁcing the overall naturalness of the synthe-sized speech.

Index Terms : prosody generation, speech editing, speech syn-thesis, text to speech, voice modiﬁcation, vocoder

1. Introduction

Text is a strong indicator of prosodic patterns [1], but not a de-terminant. For the same text, prosody varies with speaker inten-tion [2], which imposes challenges for modern text-to-speechmodels [3, 4, 5] in the form of misplaced emphases and de-graded naturalness. Manually attempting such corrections usingan audio editor, as is done in podcast and video dialogue edit-ing, requires expertise in speech manipulation and signiﬁcanttime and effort. In this paper, we aim to address these issues byproposing an intuitive and less error-prone process, consistingof three steps: (1) a user provides constraints on the prosody(e.g., by drawing part of an F0 contour), (2) a neural prosodygenerator predicts an F0 contour for the whole utterance whilematching the user’s constraints, and (3) a neural vocoder synthe-sizes a high-ﬁdelity speech recording that exhibits the generatedprosody.Early approaches for F0 synthesis use techniques such asdecision trees [6], unit selection [7], and hidden Markov mod-els (HMMs) [8]. More recently, deep learning methods suchas variational autoencoders (VAEs) [9], deep autoregressive(DAR) neural networks [10], and vector-quantized VAEs (VQ-VAEs) [11] were shown to be effective at generating F0 con-tours of speech from text features. Hodari et. al. [9] show thatVAEs produce F0 contours that cluster around placing empha-sis on the same words despite repeated sampling. This indicatesthat the VAE is not capturing the multimodal nature of Englishprosody associated with contrastive emphases. For example, thesentence “the dog is black” communicates a different intention ∗ This work was carried out during an internship at Adobe Re-search. when one of “dog”, “is”, or “black” is emphasized. The DARmodel proposed in [12] has previously shown promise in mod-eling the multimodality of English prosody [10] but does notallow user control over F0 generation. While our work focuseson user control of F0 generation, additional prosodic controlcan be achieved by ﬁrst generating a speech waveform with thedesired phoneme durations (e.g., with [5]) and then using ourmethod to achieve the desired F0.Once we generate an F0 contour, we synthesize speech us-ing a compatible vocoder. Existing vocoders allow either per-ceptually high-quality synthesis [3, 13, 14] or a high degreeof control over prosody [15]. Recently, signiﬁcant effort hasgone into disentangling the latent spaces of high-quality neuralvocoders to recover explicit prosodic control [16, 17, 18, 19].One such model, Quasi-Periodic WaveNet, allows frame-wiseF0 control via an explicit F0 contour but produces lower natural-ness than DSP-based vocoders, especially when pitch is shiftedupward [20]. In contrast, we propose a pitch-shifting neuralvocoder that achieves comparable or superior performance asDSP-based methods while factorizing prosody control parame-ters in the input space using a simple, jointly-trained bottleneck.Our key contributions are: (1) a novel method for F0 gen-eration that permits intuitive user controls, (2) a pitch-shiftingneural vocoder with explicit F0 conditioning, and (3) a newsubjective evaluation method for measuring the naturalness ofprosody. Through our perceptual evaluation, we show thatuser control of prosody can be obtained without degradingprosody naturalness, and our pitch-shifting neural vocoder per-forms comparably with existing DSP-based methods while out-performing prior neural pitch-shifting methods.

2. Controllable F0 generation

For its effectiveness and simplicity, we use DAR as our baselinemodel for F0 generation (model QF FT [12]). DAR feeds inputtext features through two fully-connected layers with ReLU ac-tivation followed by two RNNs, one bidirectional and one uni-directional, followed by a fully-connected layer. The outputs ofthe last fully-connected layer are the logits of a categorical dis-tribution of quantized F0 values. One F0 value is sampled pertime frame and the resulting observation (a one-hot-encoded F0value) is concatenated to the input of the unidirectional RNN atthe next frame (i.e., the unidirectional RNN is autoregressive).To prevent the unidirectional RNN from ignoring the currentinput features and focusing on its hidden state (i.e., exposurebias), data dropout is used, whereby autoregressive inputs to theunidirectional RNN are set to zero with probability p (we use p = 0 . , as in [12]). The original DAR uses a hierarchical soft-max loss to improve binary classiﬁcation of voiced/unvoiced(V/UV) frames. However, the ground truth V/UV sequencecan also be derived directly from phonemes when synthesizing a r X i v : . [ ee ss . A S ] A ug igure 1: Our proposed C-DAR architecture for controllableF0 generation. x t , y t , o t , and ˆ o t are the input features, groundtruth F0, predicted F0, and predicted F0 before postnet, respec-tively, and t = 1 , . . . , T is the current frame. x Lt and x Rt arethe input features from the preceding and following speech. Thefully-connected layers between the text features and the bidirec-tional GRU as well as three layers of the postnet are omitted. speech, or from a preexisting F0 contour when editing speech.We use V/UV sequences from existing F0 contours and there-fore use cross-entropy loss instead of hierarchical softmax.DAR has been shown to be effective at modeling Englishprosody, but does not permit user control and lacks the context-awareness necessary for speech editing tasks. We address theselimitations in our proposed F0 generation model, ControllableDAR (C-DAR), shown in Figure 1. A signiﬁcant advantage of working with an F0 contour, as op-posed to jointly predicting all prosodic features, is that usersmay explicitly create, modify, and constrain the F0 contour torealize a creative goal. We propose three techniques designedto facilitate control of a DAR-based model for F0 generation:(1) if available, the preceding and following speech content issummarized and used to condition the model, (2) random seg-ments of the ground truth F0 are provided to the model duringtraining, and (3) the model predicts F0 values in reverse order.The preceding and following speech content provides use-ful indicators for placing emphases [2], capturing the speaker’scurrent prosodic style [21], and determining F0 values nearboundary points. This context-awareness is essential in speechediting tasks, where prosody edits must sound natural relativeto the surrounding speech. We incorporate context-awarenessby summarizing the preceding and following content each withan untied, two-layer bidirectional GRU with hidden size 128.We use the same input features for the preceding and followingcontent (see Section 2.3) with the addition of one-hot-encodedF0 values. The result is concatenated with the text features atthe input of the model at each time frame.A potentially useful user interaction for controllable F0generation is to explicitly specify some segments of the desiredcontour (e.g., by placing and moving anchor points or draw-ing) and have a generative model infer the remaining F0 val-ues. This permits iterative reﬁnement, in which a user gener-ates an F0 contour using our model, selects regions they wantto keep, regenerates only unselected regions, and repeats un- Figure 2:

Effects of postnet on F0 generation. (Top)

Log pos-terior distribution at each frame before postnet. (Middle)

Logposterior distribution after postnet. (Bottom)

Argmax of eachdistribution converted to frequency. Posteriors are per-framenormalized to have a maximum of 0 and clipped below -40. til satisﬁed. Explicitly specifying a higher pitch over a wordalso allows users to quickly create emphases . This is usefulas emphases in English are challenging to predict [22] as theycan arise semantically [23] or simply due to speaker excitement[21]. We implement this technique by providing ground-truthF0 as an input feature during training for random subsequencesbetween 10 and 1000 milliseconds, and explicitly conditioningthe autoregressive RNN on this ground truth F0. Explicitly con-ditioning the unidirectional RNN allows it to predict the correctF0 with high accuracy while, we hypothesize, the input featuresencourage the model to learn to generate smooth, continuous F0surrounding the speciﬁed contour rather than suddenly jumpingto the speciﬁed contour. During inference, we use the user-speciﬁed F0 values instead of the ground-truth F0. Because arecurrent model uses a combination of its input features andhistory and does not have a reliable history at the start of gen-eration, discontinuities are reduced when the speciﬁed contouroccurs towards the start of generation. Therefore, downstreamtasks that require more editing in the second half of an utterancebeneﬁt from reversing the order of sequence generation.Relative to DAR, C-DAR has three additional changes thatdo not signiﬁcantly impact naturalness or controllability, butprovide additional insights into F0 generation. First, a 5-layerpostnet [3] follows the autoregressive RNN. We ﬁnd that thispostnet has the effect of reducing autoregressive sampling er-rors and tightening the posterior distribution around the argmax(Figure 2). Second, we use scheduled sampling [24] instead ofdata dropout. Scheduled sampling is known to not be a con-sistent estimator [25] and was shown to exhibit worse objectivemetrics for F0 generation (V/UV precision and pitch RMSE)[12]. Our ﬁndings indicate that neither consistency nor superiorobjective metrics are reliable indicators of improved subjectivenaturalness. Lastly, our bidirectional RNN has 16 hidden unitsinstead of 256, indicating that prior F0 generation models maybe using more model capacity than necessary. For each 10 ms frame, we concatenate ﬁve input features: (1)the one-hot-encoded phoneme, (2) a BERT [26] word embed- We demonstrate this use case and provide audio examplesof our experiments at . D Gender Book(s) Hours94

Male

The Canterville Ghost

Female

Little Fuzzy

Male

The Cowardly Lion of Oz

Female

The Warren Report

Speakers used for evaluation and vocoder training. ding, (3) the V/UV label, and (4) one-hot encodings of nearbypunctuation (e.g., whether the word precedes a comma or is inquotations). We use P2FA [27] for phoneme alignment. Wordembeddings are computed by averaging over subword tokensextracted via the bert-large-uncased pretrained modelfrom the HuggingFace Transformers package [28]. These fea-tures are jointly referred to as “Text features” in Figure 1.

We use a modiﬁed version of CREPE [29] to extract groundtruth F0 contours. Our modiﬁcation is as follows: rather thanperforming a localized search around the argmax of the poste-rior distribution over F0 bins, we directly decode the F0 contourfrom the posterior distribution via Viterbi decoding. Our transi-tion matrix places maximal probability on maintaining the sameF0 value and zero probability on F0 discontinuities greater than240 cents—with linearly decreasing probability in between. Wedetermine V/UV labels via hysteresis thresholding applied toCREPE’s harmonicity conﬁdence value. During training, wequantize our F0 representation to one of 128 values. We reserveone value for predicting unvoiced tokens and evenly divide theother 127 bins to span standard deviations above and belowthe speaker’s average F0 in base-2 log-space.

3. Pitch-shifting WaveNet vocoder

In order to synthesize speech from an arbitrary F0 contour, wepropose a pitch-shifting WaveNet vocoder (PS-WaveNet) thataccepts an F0 contour as conditioning. We use a publicly avail-able implementation [30] of a single-speaker WaveNet vocoder[31] and predict 10-bit µ -law-encoded waveform samples. Weenable explicit F0 control by forcing the input acoustic featuresthrough a small, jointly learned bottleneck that encourages thenetwork to take F0 features solely from the input F0 contour.We use 21-channel mel-cepstral coefﬁcients (MCeps) in-stead of the typical 80-channel log-mel-spectrograms as MCepare less individually representative of energies at speciﬁc fre-quencies and therefore more easily separable from F0. Ourinformal experiments using 80-channel log-mel-spectrogramsproduced samples that were relatively inharmonic, but also cap-tured more high-frequency detail. Our MCep bottleneck con-sists of three 1D convolutional layers with a ﬁlter width of 5and ReLU activations between layers. The output channels ofeach convolutional layer are 20, 20, and 12, respectively. Wetrain our PS-WaveNet vocoder on the bottlenecked MCep fea-tures concatenated with the one-hot-encoded F0 contour.

4. Experimental Design

We use the 360-hour clean training data partition from Lib-riTTS dataset [32] to train our F0 generation networks. Wetrain our PS-Wavenet vocoders and evaluate all models usingLJSpeech [33] as well as three single-speaker datasets similarlyconstructed from LibriVox recordings. The reader ID, gender, book, and amount of training data for each speaker are givenin Table 1. For evaluation, we use 20 held-out utterances (10questions and 10 statements) from each speaker with durationbetween 2 and 10 seconds. For all subjective listening tasks,we collect 25 responses from each of our 48 US participants onAmazon Mechanical Turk (AMT).

We train DAR and C-DAR with a batch size of 32 utterancesand an ADAM optimizer [34] with a learning rate of − . Weﬁnd that validation loss corresponds poorly with naturalness.Instead of early stopping, we train for 15 epochs (3.6k steps perepoch) and manually listen to results from the LibriTTS vali-dation set from epochs 5-15. We ﬁnd that DAR and C-DARproduce the most natural F0 contours after 9 epochs. We trainone PS-WaveNet for each speaker in Table 1. We train for 475ksteps with a batch size of 8 and an ADAM optimizer with alearning rate of − that is halved every 200k steps. We use 30dilated convolution layers with dilation rate (cid:96) mod at layer (cid:96) .Noise injection with − Gaussian noise is used to improvethe stability of synthesis [35].

We evaluate F0 generation models using both objective and sub-jective metrics, but emphasize that subjective metrics align bestwith our goal. Our objective metrics are the pitch RMSE and thenegative log-likelihood (NLL) of the model relative to groundtruth pitch. We do not report V/UV metrics, as all models cor-rectly preserve the V/UV sequence. We report two subjectivemetrics, including a novel subjective metric that addresses aproblem with previous F0 generation evaluation methods.Prior work on F0 generation uses pitch-shifting vocodersto generate evaluation samples, which participants listen to andprovide a naturalness rating [9, 10, 11]. Here, we address theissue where artifacts induced by pitch-shifting are proportionalto the size of the shift. This penalizes natural-sounding F0 con-tours that have high (cid:96) or (cid:96) distance from the original pitch,and rewards unnatural F0 contours close to the original pitch.To address this, we low-pass ﬁlter each vocoded sample at 10Hz above the maximum F0. The resulting audio preserves theF0 contour and amplitude envelope while removing all artifactsabove the cutoff frequency. During the user study, participantsare told that they are listening to the intonation of speech spo-ken by either a real person (“real”) or synthesized by a computer(“fake”), and are asked to identify each sample as real or fake.We implement our proposed user study to evaluate the nat-uralness of DAR and C-DAR. We use the PSOLA vocoder, andinclude as baselines a monotone model as well as two randommodels: replace , which replaces the F0 contour of each wordwith a contour from a random word uttered by the same speaker,and swap , which randomly swaps F0 contours of words withinthe sentence. For completeness, we also conduct the more typ-ical MOS naturalness test without low-pass ﬁltering using ourproposed vocoder (see Section 4.3).Our second subjective study evaluates the controllability ofthe C-DAR model on the task of synthesizing F0 after chang-ing a question mark to a statement, or vice versa. We call thistask “repunctuation”. Our weak baseline is the original audiowith the original punctuation. As a strong but unnatural base-line, we replace only the last two words of a sentence with amanually-selected F0 contour that is representative of the targetpunctuation. For DAR and C-DAR, we change the punctuationof the text input. For C-DAR, we also provide the F0 of the last Original – 0.00 0.72Monotone – 0.37 0.19Random (swap) – 0.37 0.37Random (replace) – 0.43 0.38DAR 8.15 0.43 0.57C-DAR 9.97 0.45 0.55Table 2:

Results for objective F0 generation experiments andthe subjective low-pass experiment. Lower scores are better forNLL and RMSE and higher is better for % Considered Real.

F0 source Vocoder V/UV Metrics RMSE MOS

Pitch-shifting vocoder experiment results. PS-WNis our proposed PS-WaveNet. V/UV metrics are formatted as precision/recall . two words as a user-speciﬁed F0 segment. Samples are vocodedusing PSOLA and low-pass ﬁltered as described above. AMTparticipants are given a sample and asked to select whether thesample sounds more like a statement or question. We evaluate the consistency and naturalness of PS-WaveNet viatwo tasks. For both tasks, our baselines are PSOLA [36] andWORLD [15]—two DSP-based vocoders with frame-wise F0control. For our ﬁrst task, we measure how closely the synthe-sized speech follows the given F0 contour via the F0 RMSE andV/UV errors between the input and output F0. We obtain the F0of the output using our method described in Section 2.4. Forour second task, AMT participants rate the naturalness of eachsample between 1 (low naturalness) and 5 (high naturalness).We evaluate all vocoders using both the original F0 contour andthe F0 contour generated by C-DAR. We include the originalaudio and intentionally degraded audio (quantized to 3 bits) asreferences for high and low naturalness, respectively.

5. Results

We present the F0 generation results in Table 2. We see thatC-DAR achieves a comparable naturalness to DAR while en-abling user control and context-awareness. This is further cor-roborated by the mean opinion scores (MOS) in Table 3, whichshow that participants considered C-DAR to be slightly morenatural than DAR. The results of Table 2 also corroborate thatNLL and RMSE are unsuitable metrics for F0 generation: nei-ther correlates with subjective perceptions of naturalness. Fur-ther, we found that NLL could be trivially lowered by trainingC-DAR for fewer epochs, but with clearly degraded naturalness.This reinforces the need for domain-speciﬁc subjective metrics

Original Heuristic Monotone DAROriginal - - - -

Heuristic

Monotone - 0.28/0.36 - -

DAR

C-DAR

Repunctuation experiment results. A pairwise compar-ison of ﬁve models. All results indicate percent preference forthe model speciﬁed in the same row over the model in the samecolumn. Results are formatted as

Q/S , where Q and S are thepercent preferences when the target punctuations are questionmarks and periods, respectively. Heuristic is our strong base-line described in Section 4.2. such as our proposed low-pass evaluation method.We present our repunctuation experiment results in Table4. Relative to DAR, using C-DAR with short, user-speciﬁed F0contours improves the adherence of the generated F0 contour tohigh-level semantic concepts (e.g., questions and statements).We ﬁnd this to be especially true when the target punctuation isa question mark. We believe this is because statements are heav-ily over-represented in the dataset, leading to class-imbalanceand mode collapse. Our results indicate that simple user inputsmake for an effective mode selector for prosody generation.

In Table 3, we see that our PS-WaveNet signiﬁcantly outper-forms the naturalness of WORLD while achieving comparableperformance to PSOLA. We ﬁnd that PS-WaveNet has a highervariance of MOS across speakers, ranging from 3.04 for speaker5717 to 3.80 for speaker 94 when using C-DAR. In comparison,PSOLA achieves 3.40 and 3.58 MOS on speakers 5717 and 94,respectively. An additional pairwise test between PS-WaveNetand PSOLA using F0 contours generated with C-DAR conﬁrmsthat PSOLA is preferred only for speakers 5717 and 11049.The objective metrics reported in Table 3 highlight addi-tional tradeoffs when selecting a pitch-shifting method. For ex-ample, we see that WORLD achieves the best RMSE despiteits low MOS, but also tends to make unvoiced regions soundvoiced (i.e., low V/UV recall). This is more useful for pitch-shifting singing, for example, as high pitch accuracy is impor-tant and unvoiced regions are less common than speech. PS-WaveNet achieves higher V/UV recall than WORLD, but at acost to V/UV precision and RMSE. We hypothesize that theincrease in inharmonicity due to lower V/UV precision also in-duces more pitch-tracking errors, including pitch-doubling er-rors which produce extremely high RMSE.

6. Conclusion

In this work, we present a deep autoregressive model that sup-ports controllable, context-aware F0 generation; a pitch-shiftingneural vocoder that allows explicit F0 conditioning; and novelsubjective evaluation methods for F0 generation. We show inuser studies that our controllable F0 model exhibits compa-rable naturalness as non-controllable baselines, and that ourpitch-shifting neural vocoder exhibits comparable naturalnessas DSP-based vocoders. There are many directions for futurework, including real-time pitch-shifting vocoding and interac-tion design for prosody editing. . References [1] D. Stanton, Y. Wang, and R. Skerry-Ryan, “Predicting expressivespeaking style from text in end-to-end speech synthesis,” in . IEEE,2018, pp. 595–602.[2] J. C. Wells,

English intonation PB and Audio CD: An introduc-tion . Cambridge University Press, 2006.[3] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang,Z. Chen, Y. Zhang, Y. Wang, R. Skerrv-Ryan et al. , “Naturaltts synthesis by conditioning wavenet on mel spectrogram pre-dictions,” in . IEEE, 2018, pp. 4779–4783.[4] W. Ping, K. Peng, A. Gibiansky, S. O. Arik, A. Kannan,S. Narang, J. Raiman, and J. Miller, “Deep voice 3: Scaling text-to-speech with convolutional sequence learning,” arXiv preprintarXiv:1710.07654 , 2017.[5] Y. Ren, Y. Ruan, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu,“Fastspeech: Fast, robust and controllable text to speech,” in

Ad-vances in Neural Information Processing Systems , 2019, vol. 32,pp. 3171–3180.[6] K. E. Dusterhoff, A. W. Black, and P. A. Taylor, “Using decisiontrees within the tilt intonation model to predict f0 contours.”

EU-ROSPEECH , 1999.[7] A. Raux and A. W. Black, “A unit selection approach to f0 mod-eling and its application to emphasis,” in . IEEE, 2003, pp. 700–705.[8] K. Yu and S. Young, “Continuous f0 modeling for hmm based sta-tistical parametric speech synthesis,”

IEEE Transactions on Au-dio, Speech, and Language Processing , vol. 19, no. 5, pp. 1071–1079, 2010.[9] Z. Hodari, O. Watts, and S. King, “Using generative modellingto produce varied intonation for speech synthesis,” arXiv preprintarXiv:1906.04233 , 2019.[10] X. Wang, S. Takaki, and J. Yamagishi, “Autoregressive neuralf0 model for statistical parametric speech synthesis,”

IEEE/ACMTransactions on Audio, Speech, and Language Processing ,vol. 26, no. 8, pp. 1406–1419, 2018.[11] X. Wang, S. Takaki, J. Yamagishi, S. King, and K. Tokuda,“A vector quantized variational autoencoder (vq-vae) autoregres-sive neural f model for statistical parametric speech synthesis,” IEEE/ACM Transactions on Audio, Speech, and Language Pro-cessing , vol. 28, pp. 157–170, 2019.[12] X. Wang, S. Takaki, and J. Yamagishi, “An rnn-based quantized f0model with multi-tier feedback links for text-to-speech synthesis.”in

Interspeech , 2017, pp. 1059–1063.[13] W. Ping, K. Peng, K. Zhao, and Z. Song, “Waveﬂow: Acompact ﬂow-based model for raw audio,” arXiv preprintarXiv:1912.01219 , 2019.[14] X. Wang, S. Takaki, and J. Yamagishi, “Neural source-ﬁlter-basedwaveform model for statistical parametric speech synthesis,” in

ICASSP 2019-2019 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP) . IEEE, 2019, pp. 5916–5920.[15] M. Morise, F. Yokomori, and K. Ozawa, “World: a vocoder-basedhigh-quality speech synthesis system for real-time applications,”

IEICE TRANSACTIONS on Information and Systems , vol. 99,no. 7, pp. 1877–1884, 2016.[16] W.-N. Hsu, Y. Zhang, R. J. Weiss, H. Zen, Y. Wu, Y. Wang,Y. Cao, Y. Jia, Z. Chen, J. Shen et al. , “Hierarchical genera-tive modeling for controllable speech synthesis,” arXiv preprintarXiv:1810.07217 , 2018.[17] Y. Lee and T. Kim, “Robust and ﬁne-grained prosody control ofend-to-end speech synthesis,” in

ICASSP 2019-2019 IEEE Inter-national Conference on Acoustics, Speech and Signal Processing(ICASSP) . IEEE, 2019, pp. 5911–5915. [18] R. Habib, S. Mariooryad, M. Shannon, E. Battenberg, R. Skerry-Ryan, D. Stanton, D. Kao, and T. Bagby, “Semi-supervised gener-ative modeling for controllable speech synthesis,” in

InternationalConference on Learning Representations , 2020.[19] G. Sun, Y. Zhang, R. J. Weiss, Y. Cao, H. Zen, and Y. Wu, “Fully-hierarchical ﬁne-grained prosody modeling for interpretablespeech synthesis,” arXiv preprint arXiv:2002.03785 , 2020.[20] Y.-C. Wu, T. Hayashi, P. L. Tobing, K. Kobayashi, and T. Toda,“Quasi-Periodic WaveNet Vocoder: A Pitch Dependent DilatedConvolution Model for Parametric Speech Generation,” in

Inter-speech , 2019, pp. 196–200.[21] S. Im, J. Cole, and S. Baumann, “The probabilistic relationshipbetween pitch accents and information status in public speech,” in

Proceedings of Speech Prosody , vol. 9, 2018.[22] Y. Mass, S. Shechtman, M. Mordechay, R. Hoory, O. Shalom,G. Lev, and D. Konopnicki, “Word emphasis prediction for ex-pressive text to speech,” in

Interspeech , 2018, pp. 2868–2872.[23] C. Gussenhoven, “Types of focus in english,” in

Topic and focus .Springer, 2008, pp. 83–100.[24] S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer, “Scheduled sam-pling for sequence prediction with recurrent neural networks,” in

Advances in Neural Information Processing Systems , 2015, pp.1171–1179.[25] F. Husz´ar, “How (not) to train your generative model:Scheduled sampling, likelihood, adversary?” arXiv preprintarXiv:1511.05101 , 2015.[26] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language under-standing,” arXiv preprint arXiv:1810.04805 , 2018.[27] J. Yuan and M. Liberman, “Speaker identiﬁcation on the scotuscorpus,”

Journal of the Acoustical Society of America , vol. 123, p.3878, 2008.[28] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi,P. Cistac, T. Rault, R. Louf, M. Funtowicz, and J. Brew, “Hug-gingface’s transformers: State-of-the-art natural language pro-cessing,” arXiv preprint arXiv:1910.03771 , 2019.[29] J. W. Kim, J. Salamon, P. Li, and J. P. Bello, “Crepe: A convo-lutional representation for pitch estimation,” in . IEEE, 2018, pp. 161–165.[30] R. Yamamoto, “r9y9/wavenet vocoder,” https://github.com/r9y9/wavenet vocoder, 2019.[31] A. Tamamori, T. Hayashi, K. Kobayashi, K. Takeda, and T. Toda,“Speaker-dependent wavenet vocoder.” in

Interspeech , 2017, pp.1118–1122.[32] H. Zen, V. Dang, R. Clark, Y. Zhang, R. J. Weiss, Y. Jia, Z. Chen,and Y. Wu, “Libritts: A corpus derived from librispeech for text-to-speech,” arXiv preprint arXiv:1904.02882 , 2019.[33] K. Ito, “The lj speech dataset,” https://keithito.com/LJ-Speech-Dataset/, 2017.[34] D. P. Kingma and J. Ba, “Adam: A method for stochastic opti-mization,” arXiv preprint arXiv:1412.6980 , 2014.[35] Z. Jin, A. Finkelstein, G. J. Mysore, and J. Lu, “FFTNet: a real-time speaker-dependent neural vocoder,” in

The 43rd IEEE Inter-national Conference on Acoustics, Speech and Signal Processing(ICASSP) , Apr. 2018.[36] F. Charpentier and M. Stella, “Diphone synthesis using anoverlap-add technique for speech waveforms concatenation,” in