Exploiting Deep Sentential Context for Expressive End-to-End Speech Synthesis
EExploiting Deep Sentential Context for Expressive End-to-End SpeechSynthesis
Fengyu Yang , Shan Yang , Qinghua Wu , Yujun Wang , Lei Xie , Audio, Speech and Language Processing Group (ASLP@NPU), School of Software, School of Computer Science, Northwestern Polytechnical University, China Xiaomi, Beijing, China { fyyang, syang, lxie } @nwpu-aslp.org, { wuqinghua, wangyujun } @xiaomi.com Abstract
Attention-based seq2seq text-to-speech systems, especiallythose use self-attention networks (SAN), have achieved state-of-art performance. But an expressive corpus with rich prosodyis still challenging to model as 1) prosodic aspects, which spanacross different sentential granularities and mainly determineacoustic expressiveness, are difficult to quantize and label and2) the current seq2seq framework extracts prosodic informationsolely from a text encoder, which is easily collapsed to an aver-aged expression for expressive contents. In this paper, we pro-pose a context extractor, which is built upon SAN-based textencoder, to sufficiently exploit the sentential context over anexpressive corpus for seq2seq-based TTS. Our context extrac-tor first collects prosodic-related sentential context informationfrom different SAN layers and then aggregates them to learn acomprehensive sentence representation to enhance the expres-siveness of the final generated speech. Specifically, we inves-tigate two methods of context aggregation: 1) direct aggrega-tion which directly concatenates the outputs of different SANlayers, and 2) weighted aggregation which uses multi-head at-tention to automatically learn contributions for different SANlayers. Experiments on two expressive corpora show that ourapproach can produce more natural speech with much richerprosodic variations, and weighted aggregation is more superiorin modeling expressivity.
IndexTerms : speech synthesis, self-attention network, prosody
1. Introduction
Recently, the naturalness of corpus-based text-to-speech (TTS)has been significantly improved with the use of attention-basedsequence-to-sequence (seq2seq) mapping framework [1, 2].Such so-called end-to-end (E2E) systems directly employ atext encoder network to learn linguistic, syntactic and seman-tic information from simple character or phoneme sequences.The sequence of aggregated textual representation is further at-tended by an acoustic decoder network through some attentionmechanism, producing predicted speech representations (e.g.,mel-spectrogram) that are subsequently transformed to wave-forms via a neural vocoder.
Sentential context [3] mainly involves the latent syntac-tic and semantic information embedded in the text, recentlyproved to be important in natural language processing (NLP)tasks [3, 4]. It might be essential to the naturalness of speech
First author performed part of this work at Xiaomi. Lei Xieis the corresponding author. This work was partially supportedby the National Key Research and Development Program of China(No.2017YFB1002102). synthesis as well, especially for a system built upon an expres-sive corpus with rich prosodic variations. The seq2seq frame-work extracts prosodic information solely from the text encoderin an unsupervised way, which is easily collapsed to an aver-aged expression for expressive contents. To better make use ofthe sentential context in an E2E framework, one way is featureengineering as the previous generation of TTS does. For exam-ple, recent study has shown that exploiting syntactic features ina parsed tree is beneficial to the richness of the prosodic out-comes, leading to more natural synthesized speech [5].However, modeling expressiveness in text-to-speech is stillchallenging as it refers to different levels of syntactic and se-mantic information reflected in intensity, rhythm, intonation andother prosody related factors. However, it is difficult to definethe relations explicitly between the syntactic/semantic factorsand the prosodic factors. To model expressivity, the global styletokens (GST) family [6, 7] learns style embeddings from a ref-erence audio in an unsupervised way, which lets the synthesizedspeech imitate the style of reference audio. Although the styleembeddings from a reference audio is helpful to control the styleof synthesized speech, it is hard to choose an appropriate refer-ence audio for each input sentence. Likewise, the variationalautoencoder (VAE) models styles or expressivity in a similarway [8].Recent studies have revealed that self-attention based net-works (SAN) [9, 10, 11, 12] have strong ability in capturingglobal prosodic information, leading to more natural synthe-sized speech. And unveiled by recent NLP tasks, different SANencoder layers can capture latent syntactic and semantic prop-erties of the input sentence at different levels [3, 4]. But cur-rent SAN-based TTS systems only leverage the highly aggre-gated latent text representation, usually the outputs of the textencoder, from the simple textual input, to guide the speech gen-eration process. Although the highly aggregated representationcan be treated as a global description of the sentential context,it is not enough to generate expressive content according to ourexperiments as it may disperse the contribution of sententialcontext embedded in the intermediate SAN layers [13].In this paper, to excavate the sentential context for expres-sive speech synthesis, we propose a context extractor to suf-ficiently exploit sentential context over an expressive corpusfor seq2seq-based TTS. Specifically, we utilize different lev-els of representations from the SAN-based text encoder to builda context extractor, which is helpful to extract different levelsof syntactic and semantic information [14]. In details, our con-text extractor first collects the prosodic-related sentential con-text information from different SAN-based encoder layers, andthen aggregates them to learn a comprehensive sentence repre-sentation to enhance the expressiveness of the final generated a r X i v : . [ ee ss . A S ] A ug ncoder Pre-Net
Character
EmbeddingsScaled Positional Encoding
Self-attention
Layer 1Self-attentionLayer l Self-attention
Layer L ... ...
DecoderMelSpectrogram
GMM
AttentionContext Aggregation
Add &Norm Add&NormFeedForward
Aggregation
Add & NormMultiheadAttention
Aggregation
Add & NormFFNSelf-attention Multihead attention or Direct Aggregation thru Concatenation
Weighted
Aggregationthru MultiHead AttentionLinear
Linear
Linear Linear
Linear
Linear Linear
Linear
Linear
V K Q
Scaled Dot-Product Attention
Scaled Dot-Product AttentionScaled Dot-Product AttentionConcatLinear h Figure 1:
Proposed architecture with context aggregation basedon Tacotron2 and SAN encoder. speech. Specifically, we investigate two methods of context ag-gregation: 1) direct aggregation which directly concatenates theoutputs of different SAN layers, and 2) weighted aggregation which uses multi-head attention to automatically learn contri-butions for different SAN layers. Experiments on two expres-sive corpora show that our approach can produce more naturalspeech with richer prosodic variations, and weighted aggrega-tion is more superior in modeling expressivity.
2. Proposed Model
Figure 1 illustrates our proposed approach on exploiting deepsentential contexts for expressive speech synthesis. It containsa modified self-attention based text encoder, an auto-regressivedecoder and a GMM-based attention [15] to bridge the encoderand the decoder. WaveGlow [16] is adopted to reconstructwaveforms from mel spectrogram. We augment the encoderwith a context aggregation module, which will be described indetail.
Self-attention based sequence-to-sequence framework has beensuccessfully applied to speech synthesis [9, 10, 17]. In the basicSAN-based text encoder, there is a stack of L blocks, each ofwhich has two sub-networks: a multi-head attention and a feedforward network. The residual connection and layer normaliza-tion are applied to both of the sub-networks. Formally, from theprevious encoder block output H l − , the first sub-network C l and the second sub-network H l are calculated as: C l = LN(MultiHead( head l , . . . , head lH ) + H l − ) , (1) H l = LN(FFN( C l ) + C l ) . (2)where MultiHead( · ), FFN( · ) and LN( · ) are multi-head attention,feed forward network and layer normalization respectively. Andeach head in multi-head attention split from the previous en-coder block output is computed by: head h = α · V = softmax( QK T √ d ) · V, (3)where { Q, K, V } represent queries, keys and values, d is the di-mension of the hidden state and α represents the weight matrixfor each head. Although the SANs have the ability of directly capturing globaldependencies among whole input sequence [18], it may not ap-propriately exploit the sentential context because it calculates the relevance between the characters without considering thecontextual information [3, 14]. Besides, the weighted sum op-tion from the lower layers in SANs has only aggregated theglobal contextual information, which may weaken the contri-bution of sequential context extracted in each block.To fully make use of the contexts extracted from each block,we propose a context extractor to aggregate the different levelsof contexts to form a comprehensive sentence representation.For the l th self-attention block, we extract the intermediate con-text from the output H l through: g l = g( H l ) = MeanPool(Conv1d( H l )) , (4)where Conv1d means 1d-convolution, MeanPool representsmean pooling[19], g( · ) denotes the function to summarize theoutputs of self-attention layers, and g l represents the sententialcontext from l th block. A straightforward and intuitive choiceto aggregate the different levels of contexts is through a concate-nation operation, with residual connection and layer normaliza-tion [9]: C g = LN(Concat( g , . . . , g L ) + g L ) , (5)where g represents the inputs of the first self-attention layerthrough Eq. (4). To further integrate the information concate-nated from all sentential contexts, we use another round of feed-forward network and layer normalization as the final aggrega-tion function [14][20]: g = LN(FFN( C g ) + C g ) . (6)Here, g is the final sentential context. With direct aggregation, the sequential contexts of each blockare simply concatenated to guide the auto-regressive genera-tion, which does not consider the varying importance of each g l . Assuming the sequential contexts in each block may havedifferent contribution to the expressiveness of the synthesizedspeech, we utilize a self-learned weighted aggregation moduleacross layers to catch the different levels of contribution.In detail, we employ a multi-head attention to learn thecontribution of each block. The individual sentential contexts { g , g , . . . , g L } are treated as attention memory for the at-tention based weighted aggregation. Specifically, we transposethe dimension of sequential length with the number of headsin the multi-head attention to combine the contextual informa-tion across layers. Therefore, we modify Eq. (5) to obtain theweighted contexts: C g = LN(MultiHead( g , . . . , g L ) + g L ) , (7)where the modified C g offers a more precise control of aggre-gation for each g l .
3. Experiments
To investigate the effectiveness of modeling expressiveness, wecarried out experiments on two expressive Mandarin corpora– the publicly-available Blizzard Challenge 2019 corpus [21]from a male talk-show speaker and an internal voice assistantcorpus from a female speaker. The talk-show (TS) corpus con-tains about 8 hours speech of, and the voice assistant (VA)corpus contains about 40 hours of speech. Both corpora areable 1:
MCD scores over the two expressive corpora.
Corpus BASE SA SA-DA SA-WATS 8.01 7.48 7.42
VA 7.60 7.37 7.32 expressive in prosodic delivery, separated to non-overlappingtraining and testing sets (with data ratio 9:1). Besides, we alsoutilize a publicly-available standard reading-style corpus [22]with less expressivity to see how our approach perform. Thecorpus, named DB1, contains 10 hours of female speech withconsistent reading style. Again, we separate the corpus to train-ing and testing with ratio 9:1. Linguistic inputs include phones,tones, character segments and three levels of prosodic segments:prosodic word (PW), phonological phrase (PPH) and intonationphrase (IPH). We use 80-band mel-spectrogram extracted from22.05KHz waveforms (for TS and VA) and 16KHz waveforms(for DB1) as acoustic targets. We calculate mel cepstral distor-tion (MCD) on test set for objective evaluation. As for subjec-tive evaluation, we conduct mean opinion score (MOS) and A/Bpreference test on 30 randomly selected test set samples and 20native Chinese listeners participated in the tests.
We use the standard encoder-decoder structure in Tacotron2 [2]as the baseline, but GMM attention is adopted instead becauseit can bring superior naturalness and stability [15]. For net-works using SAN based encoder, a 3-layer CNN is firstly ap-plied to the input text embeddings with positional information.Each self-attention block includes an 8-head self-attention anda feed forward sub-network consisting of two linear transfor-mations with 2048 and 512 hidden units. Residual connectionand layer normalization are applied to these two sub-networks.There are totally 6 self-attention blocks. In the aggregationmodule, we double feed g L into aggregation attention functionfor the convenience of implementation, where the number ofheads in multi-head attention are length and the dimension ofweighted matrix are [ batch, length, , . For the remainingpart, we adopt the auto-regressive decoder described in [2]. Weuse WaveGlow as vocoder which follows the structure in [16],trained using the same training set. We built the following sys-tems for comparison:• Base : Baseline system following Tacotron2 [2] withslightly modified GMMv2 attention [15].• SA : Another baseline system with SAN based encoderdescribed in Section 2.1.• SA-DA : SAN based encoder with the direct aggregationmodule fusing all sentential contexts described in Sec-tion 2.2.•
SA-WA : SAN based encoder with the weighted aggre-gation module fusing all sentential contexts described inSection 2.3.
Table 1 shows the MCD results of different systems. It demon-strates that SAN based encoder has lower MCD than the RNNbased encoder for both expressive corpora. It also shows thatmodeling sentential context can further improve the perfor-mance of SAN based encoder. Besides, weighted aggregationis a better way than direct aggregation to extract the deep sen-tential context. With the help of deep sentential context, theSA-WA system achieves the lowest MCD on both expressive Table 2:
The MOS over the two expressive corpora with confi-dence intervals of 95%.
Corpus BASE SA SA-DATS 3.84 ± ± ± ± ± ± ± ± ± ± SA SA -DA SA -DA SA -WA SA -WA No Preference
No PreferenceNo Preference
No Preference13.3%5.0%
Figure 2:
AB Preference results on TS with confidence intervalsof 95% and p-value < corpora, which indicates that the synthesized speech samplesare the most similar ones to the real speech samples. We conduct AB preference tests and MOS tests on the two ex-pressive test sets which include a large number of modal par-ticles, interrogatives and exclamations. The listeners are askedto select preferred audio according to the overall impression onthe expressiveness of the testing samples . The AB preferenceresults are shown in Figure 2 and 3 for TS and VA, respectively.MOS results are reported in Table 2.For baseline systems, we can find that the SA system withSAN based encoder brings better performance on expressive-ness than the conventional BASE system. It indicates that us-ing self-attention layers as text encoder may capture featuresthat better represents expressiveness, in accordance with ourprevious findings [12]. For the proposed context extractor, wefind that, by introducing direct aggregation across all the self-attention layers, system SA-DA achieves significantly betterperformance than the solely self-attention based encoder sys-tem SA. This is confirmed by both AB preference and MOStest on the two tested corpora. By further replacing simpleconcatenation operation with multi-head attention aggregation(i.e., weighted aggregation), system SA-WA brings extra perfor-mance gain over system SA-DA. Listeners particularly give theSA-WA system more preferences according to the AB prefer-ence test. In summary, the results unveil that the deep sententialcontext encoder achieves significantly better performance thanthe baseline systems, showing that modeling different levels oflatent syntax and semantic information through a deep encoderis effective for generating expressive speech. This conclusion isconsistently confirmed on two expressive corpora. We also quickly examine the performance of our approach on aless-expressive reading-style corpus – DB1 [22], to see how oursentential context extractor perform. Here, we only comparethe above best-performed SA-WA system with the BASE sys-tem. The MCD scores for BASE and SA-WA are 5.78 and 5.72,respectively. The AB preference is illustrated in Figure 4. In- Samples can be found from: https://fyyang1996.github.io/context/
SA SA -DA SA -DA SA -WA SA -WA No Preference
No PreferenceNo Preference
No Preference4.0%
Figure 3:
AB Preference results on VA with confidence intervalsof 95% and p-value < SA -WA No Preference39.3%
Figure 4:
AB Preference results on DB1 with confidence inter-vals of 95% and p-value < terestingly, the effectiveness of our sentential context extractoris not salient on this less-expressive corpus, which is proved byclose MCD and AB preference scores between the two systems.In other words, our sentential context extractor works better onexpressive datasets, which is further confirmed in the following. To further evaluate the expressivenessfor statistical significance, we extract the acoustic features com-monly associated with prosody: relative energy within eachphoneme (E), duration in ms (Dur.) and fundamental frequencyin Hertz (F0), which represent phoneme-level intensity, rhythmand intonation of audio, respectively. According to [23], wemeasure the three prosody attributes for each phoneme through-out additional alignments. The ratio of the average signal mag-nitude within a phoneme with the average magnitude of the en-tire utterance is used as the relative energy of a phoneme. Wecalculate the number of frames within a phoneme as the du-ration of the phoneme. And the mean value of F0 within aphoneme is regarded as a prosody attribute. To estimate thesestatistics, we synthesized 100 random samples in the test setto calculate the Pearson correlation coefficient between all sys-tems and the ground truth. The higher Pearson correlation co-efficient value the model produces, the higher accuracy of thepredicted prosody attribute the model can achieve.Table 3 shows that our proposed SA-WA system obtainshighest correlation scores in all three prosody attributes, whichdemonstrates that our approach has the best reconstruction per-formance in phoneme-level intensity, rhythm and intonation.Additionally, [24] reveals that the order of prosody attributesbeing captured is always found to be energy, duration, and F0.Energy is the amplitude of the signal directly related to the re-construction loss and is easier to be captured, but F0 is the mostdifficult to be captured as it is modeled implicitly. However,our SA-WA system achieves approximately 20% gains than theBASE system in F0, which is far more than the promotion ofapproximately 6% in energy and duration. Based on these, webelieve that the proposed approach has strong ability in mod-eling F0 that is most difficult one to be captured in the threeprosody attributes.Figure 5 shows the F0 trajectories for a synthesized test ut-terance. The sentence begins with a modal particle (heng1),which reveals sense of disgust emotion in Mandarin. In thiscase, high raise of F0, where the SA-WA system produces, candeliver more disgust mood to listeners. This example showsthat the proposed sequential context extractor can model betterexpressive patterns compared to the baseline systems. Table 3:
Correlation in relative energy, duration and F0 withina phoneme computed from different models on TS.
BASE SA SA-DA SA-WAE 0.755 0.776 0.781
Dur. 0.617 0.638 0.641
F0 0.42 0.426 0.437
BASE SASA -DA SA -WA 哼,想你有什么用,你又不来陪我玩。 Figure 5:
F0 values of a test utterance generated by differentsystems. Audios can be found in Section 1.1 of the demo page.
Prosody Diversity
An expressive TTS system should beable to generate speech with a large prosody diversity. Con-sequently, we measure the standard deviation of three prosodyattributes at phoneme level as well according to [23]. And theaverage standard deviation across all 100 utterances is reportedfor statistical significance. Table 4 demonstrate that the SA-WA system has the highest diversity in phoneme-level intensity,rhythm and intonation among all systems, which is the closestto the ground-truth (GT). We believe that the SA-WA systemhas better ability in modeling prosody variations on expressivedatasets.Table 4:
Diversity values using average standard deviationcomputed across 100 samples on TS.
BASE SA SA-DA SA-WA GTE 0.238 0.277 0.285
4. Conclusion
Seq2seq-based TTS directly maps the character/phoneme se-quence to the acoustic feature sequence using an encoder-decoder paradigm. The encoder functions as a sentential con-text extractor which aggregates latent semantic and syntactic in-formation that highly correlates with the expressiveness of thesynthesized speech by the decoder. In this paper, we proposea context extractor, which is built upon the SAN-based text en-coder, to sufficiently exploit the text-side sentential context toproduce more expressive speech. With the belief that differ-ent self-attention layers may capture different levels of latentsyntactic and semantic information, which was discovered byrecent NLP researches, we proposed two context aggregationstrategies: 1) direct aggregation which directly concatenates theoutputs of different SAN layers, and 2) weighted aggregation which uses multi-head attention to automatically learn contribu-tions for different SAN layers. Experiments on two expressivecorpora show that the two strategies can produce more naturaland expressive speech, and weighted aggregation is more supe-rior. Comprehensive analysis on the synthesized speech demon-strates that our sentential context extractor has better ability inreconstruction of prosody related acoustic features and model-ing prosody diversity. . References [1] Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss,N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio et al. , “Tacotron:Towards end-to-end speech synthesis,” in . ISCA, 2017,pp. 4006–4010.[2] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang,Z. Chen, Y. Zhang, Y. Wang, R. Skerrv-Ryan et al. , “Naturaltts synthesis by conditioning wavenet on mel spectrogram pre-dictions,” in . IEEE, 2018, pp. 4779–4783.[3] X. Wang, Z. Tu, L. Wang, and S. Shi, “Exploiting sentential con-text for neural machine translation,” in . ACL, 2019, pp.6197–6203.[4] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark,K. Lee, and L. Zettlemoyer, “Deep contextualized word represen-tations,” vol. 1, pp. 2227–2237, 2018.[5] H. Guo, F. K. Soong, L. He, and L. Xie, “Exploiting syntactic fea-tures in a parsed tree to improve end-to-end tts,” in . ISCA,2019, pp. 4460–4464.[6] Y. Wang, D. Stanton, Y. Zhang, R. Skerry-Ryan, E. Battenberg,J. Shor, Y. Xiao, F. Ren, Y. Jia, and R. A. Saurous, “Style tokens:Unsupervised style modeling, control and transfer in end-to-endspeech synthesis,” in , vol. 80. PMLR, 2018, pp. 5167–5176.[7] X. An, Y. Wang, S. Yang, Z. Ma, and L. Xie, “Learning hierar-chical representations for expressive speaking style in end-to-endspeech synthesis,” in . IEEE, 2019, pp. 184–191.[8] Y.-J. Zhang, S. Pan, L. He, and Z.-H. Ling, “Learning latent rep-resentations for style control and transfer in end-to-end speechsynthesis,” in . IEEE, 2019, pp. 6945–6949.[9] N. Li, S. Liu, Y. Liu, S. Zhao, and M. Liu, “Neural speech synthe-sis with transformer network,” in , vol. 33. AAAI, 2019, pp.6706–6713.[10] Y. Yasuda, X. Wang, S. Takaki, and J. Yamagishi, “Investiga-tion of enhanced tacotron text-to-speech synthesis systems withself-attention for pitch accent language,” in . IEEE, 2019, pp. 6905–6909.[11] S. Yang, H. Lu, S. Kang, L. Xie, and D. Yu, “Enhancing hy-brid self-attention structure with relative-position-aware bias forspeech synthesis,” in . IEEE, 2019,pp. 6910–6914. [12] F. Yang, S. Yang, P. Zhu, P. Yan, and L. Xie, “Improving man-darin end-to-end speech synthesis by self-attention and learnablegaussian bias,” in . IEEE, 2019, pp. 208–213.[13] X. Shi, I. Padhi, and K. Knight, “Does string-based neural mtlearn source syntax?” in . ACL, 2016, pp. 1526–1534.[14] Z.-Y. Dou, Z. Tu, X. Wang, S. Shi, and T. Zhang, “Exploiting deeprepresentations for neural machine translation,” in . ACL, 2018,pp. 4253–4262.[15] E. Battenberg, R. Skerry-Ryan, S. Mariooryad, D. Stanton,D. Kao, M. Shannon, and T. Bagby, “Location-relative atten-tion mechanisms for robust long-form speech synthesis,” arXivpreprint arXiv:1910.10288 , 2019.[16] R. Prenger, R. Valle, and B. Catanzaro, “Waveglow: A flow-basedgenerative network for speech synthesis,” in . IEEE, 2019, pp. 3617–3621.[17] S. Yang, H. Lu, S. Kang, L. Xue, J. Xiao, D. Su, L. Xie, and D. Yu,“On the localness modeling for the self-attention based end-to-endspeech synthesis,”
Neural Networks , vol. 125, pp. 121–130, 2020.[18] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones,A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all youneed,” in , 2017, pp. 5998–6008.[19] M. Iyyer, V. Manjunatha, J. Boyd-Graber, and H. Daum´e III,“Deep unordered composition rivals syntactic methods for textclassification,” in , vol. 1. ACL, 2015, pp. 1681–1691.[20] Q. Wang, F. Li, T. Xiao, Y. Li, Y. Li, and J. Zhu, “Multi-layer rep-resentation fusion for neural machine translation,” arXiv preprintarXiv:2002.06714 arXiv preprint arXiv:2002.03788 ,2020.[24] G. Sun, Y. Zhang, R. J. Weiss, Y. Cao, H. Zen, and Y. Wu, “Fully-hierarchical fine-grained prosody modeling for interpretablespeech synthesis,” arXiv preprint arXiv:2002.03785arXiv preprint arXiv:2002.03785