[PDF] Neural text-to-speech with a modeling-by-generation excitation vocoder

Abstract

This paper proposes a modeling-by-generation (MbG) excitation vocoder for a neural text-to-speech (TTS) system. Recently proposed neural excitation vocoders can realize qualified waveform generation by combining a vocal tract filter with a WaveNet-based glottal excitation generator. However, when these vocoders are used in a TTS system, the quality of synthesized speech is often degraded owing to a mismatch between training and synthesis steps. Specifically, the vocoder is separately trained from an acoustic model front-end. Therefore, estimation errors of the acoustic model are inevitably boosted throughout the synthesis process of the vocoder back-end. To address this problem, we propose to incorporate an MbG structure into the vocoder's training process. In the proposed method, the excitation signal is extracted by the acoustic model's generated spectral parameters, and the neural vocoder is then optimized not only to learn the target excitation's distribution but also to compensate for the estimation errors occurring from the acoustic model. Furthermore, as the generated spectral parameters are shared in the training and synthesis steps, their mismatch conditions can be reduced effectively. The experimental results verify that the proposed system provides high-quality synthetic speech by achieving a mean opinion score of 4.57 within the TTS framework.

Full PDF

NNeural Text-to-Speech with a Modeling-by-Generation Excitation Vocoder

Eunwoo Song , Min-Jae Hwang , Ryuichi Yamamoto , Jin-Seob Kim ,Ohsung Kwon , and Jae-Min Kim NAVER Corp., Seongnam, Korea Search Solutions Inc., Seongnam, Korea LINE Corp., Tokyo, Japan

Abstract

This paper proposes a modeling-by-generation (MbG) ex-citation vocoder for a neural text-to-speech (TTS) system. Re-cently proposed neural excitation vocoders can realize quali-ﬁed waveform generation by combining a vocal tract ﬁlter witha WaveNet-based glottal excitation generator. However, whenthese vocoders are used in a TTS system, the quality of syn-thesized speech is often degraded owing to a mismatch be-tween training and synthesis steps. Speciﬁcally, the vocoderis separately trained from an acoustic model front-end. There-fore, estimation errors of the acoustic model are inevitablyboosted throughout the synthesis process of the vocoder back-end. To address this problem, we propose to incorporate anMbG structure into the vocoder’s training process. In the pro-posed method, the excitation signal is extracted by the acousticmodel’s generated spectral parameters, and the neural vocoderis then optimized not only to learn the target excitation’s dis-tribution but also to compensate for the estimation errors oc-curring from the acoustic model. Furthermore, as the gener-ated spectral parameters are shared in the training and synthe-sis steps, their mismatch conditions can be reduced effectively.The experimental results verify that the proposed system pro-vides high-quality synthetic speech by achieving a mean opin-ion score of . within the TTS framework. Index Terms : neural text-to-speech, WaveNet, ExcitNet,modeling-by-generation vocoder

1. Introduction se Generative models for raw speech waveform have signif-icantly improved the quality of neural text-to-speech (TTS)systems [1, 2]. Speciﬁcally, by conditioning acoustic fea-tures to the network input, neural vocoding models such asWaveNet, WaveRNN, and WaveGlow successfully generate atime-sequence of speech signal [2–5]. More recently, neural ex-citation vocoders such as GlotNet, ExcitNet, LP-WaveNet andLPCNet [6–10] have exploited the advantages of linear predic-tion (LP)-based parametric vocoders. In this type of vocoder, anadaptive predictor is used to decouple the formant-related spec-tral structure from the input speech signal, and the probabilitydistribution of its residual signal (i.e. the excitation signal) isthen modeled by the vocoding network. As variation in the ex-citation signal is only constrained by vocal cord movement, thetraining and generation processes become much more efﬁcient.However, because the vocoding and acoustic models havebeen trained separately, it is not known whether or not com-bining them within the TTS framework would beneﬁt synthesisquality. Furthermore, as parameters estimated from the acousticmodel are used as a direct input of the vocoding model in thesynthesis step, estimation errors of the acoustic features can be propagated throughout the synthesis process. It is therefore cru-cial to model the interactions between the acoustic and vocod-ing elements during the training process in order to achieve thebest complete performance of the TTS system.In this paper, we propose a neural excitation model basedon modeling-by-generation (MbG) in which the spectral param-eters generated from the acoustic model are utilized in the neu-ral vocoder’s training process. Speciﬁcally, the target excitationis deﬁned as a combination of the prediction errors from theLP analysis and those from the acoustic model. The vocodingmodel is then optimized to learn the distribution of the targetexcitation while compensating for the errors from the acousticmodel. It has been reported elsewhere that training the neuralvocoder with generated acoustic parameters improves syntheticquality [11]. Although the MbG method is similar to this ap-proach, there are also clear differences in that MbG aligns eventhe target excitation signal with the acoustic model’s generatedspectral parameters.We investigated the effectiveness of the proposed methodby conducting subjective evaluation tasks. The MbG structurecan be extended to any neural excitation vocoder that uses LPcoefﬁcients, but the focus here is on the WaveNet-based Ex-citNet vocoder [7]. The experimental results show that a TTSsystem with the proposed MbG-ExcitNet vocoder provides sig-niﬁcantly better perceptual quality than a similarly conﬁguredsystem with a conventional vocoder. In particular, our TTSframework achieves . mean opinion score (MOS).

2. Related work

The idea of using an MbG structure is not new. In a study ofparametric glottal vocoders , Juvela et al. [12] ﬁrst proposed theclosed-loop extraction of glottal excitation from the generatedspectral parameters, and our own previous work proposed theMbG structure to compensate for missing noise componentsin generated glottal signals [13]. However, it was not possi-ble to fully utilize the effectiveness of the MbG training strat-egy because our experiments were only performed with sim-ple deep learning models including stacked feed-forward and/orlong short-term memory (LSTM) networks.Our aim here was to extend the usage of the MbG structureto recently proposed neural excitation models (e.g. ExcitNet)with autoregressive acoustic models (e.g. Tacotron) [11,14,15].As the accuracy of acoustic models has been signiﬁcantly im-proved, it is now possible to extract stable excitation signalsfrom the generated spectral parameters. Furthermore, the Ex-citNet vocoder directly models the time-domain excitation se-quence which enables straightforward application of the MbGstructure to the training process. As a result, the entire modelcan be stably and easily trained while the perceptual quality ofthe synthesized speech is signiﬁcantly improved. a r X i v : . [ ee ss . A S ] A ug a) (b) (c)Figure 1: An ExcitNet vocoder for a TTS system: (a) conventional training; (b) proposed MbG training; and (c) synthesis methods.

Figure 2:

Negative log-likelihood (NLL) obtained during thetraining process with respect to the plain ExcitNet and MbG-based ExcitNet (MbG-ExcitNet) training methods.

3. ExcitNet TTS systems

The basic WaveNet framework is an autoregressive networkwhich generates a probability distribution of discrete speechsymbols from a ﬁxed number of past samples [16]. The Excit-Net vocoder is an advanced version of this network which takesadvantages of both the LP vocoder and the WaveNet structure.In an ExcitNet framework, an LP-based adaptive predictor isused to decouple the spectral formant structure from the inputspeech signal (Fig. 1a). The WaveNet model is then used totrain the distribution of the prediction residuals (i.e. excitation)as follows: p ( e | h ) = N (cid:89) n =1 p ( e n | e , ..., e n − , h ) , (1) e n = x n − p (cid:88) k =1 α k x n − k , (2)where x n and e n denote the n th sample of speech and excita-tion, respectively; α k denotes the k th LP coefﬁcient with theorder p ; h denotes the conditional inputs composed of acousticparameters.In the speech synthesis step (Fig. 1c), the acoustic parame-ters of the given input text are generated by a pre-trained acous-tic model. These parameters are then used as conditional inputsfor the WaveNet model to generate the corresponding time se- quence of the excitation signal. Finally, the speech signal is re-constructed by passing the generated excitation signal throughthe LP synthesis ﬁlter. To further improve the quality of the synthesized speech, wepropose the incorporation of an MbG structure into the train-ing process of the ExcitNet vocoder. As illustrated in Fig. 1a,conventional vocoding models are trained separately from theacoustic model, even though the generated acoustic parameters,which contain estimation errors, are used as direct conditionalinputs (Fig. 1c). This inevitably causes quality degradation ofthe synthesized speech as the estimation errors from the acous-tic model are boosted non-linearly throughout the synthesis pro-cess in the vocoder back-end.Fig. 1b shows the proposed MbG training method whichuses closed-loop extraction of the excitation signal. To mini-mize the mismatch between the training and the generation pro-cesses, the LP coefﬁcients in the training step are replaced withthose generated by the pre-trained acoustic model as follows: ˆ e n = x n − p (cid:88) k =1 ˆ α k x n − k , (3)where { ˆ α , ..., ˆ α p } denotes the generated LP coefﬁcients. Bycombining equations (2) and (3), the excitation sequence can berepresented as follows: ˆ e n = e n + e amn , (4)where e amn denotes an intermediate prediction deﬁned as fol-lows: e amn = p (cid:88) k =1 ( α k − ˆ α k ) x n − k . (5)Using the excitation signal (i.e. ˆ e n ) as the training target meansthat it becomes possible to guide the model to learn the distribu-tions of the true excitation signal (i.e. e n ) as well as compensatefor the acoustic model’s estimation errors (i.e. e amn ). Further-more, because the training and synthesis processes share thesame LP coefﬁcients, it is also possible to minimize any mis-match. This extraction method has been adopted in analysis-by-synthesisspeech coding frameworks [17,18] where the encoder and decoder sharethe same quantized ﬁlter parameters for minimizing their mismatch con-ditions. igure 3:

Acoustic model consisting of three sub-modules: context analysis, context embedding, and Tacotron decoding.

The merits of the proposed method are presented in Fig. 2which shows the negative log-likelihood obtained from thetraining and validation sets. The proposed MbG-ExcitNetmodel enables a reduction in both training and validation er-rors as compared to a plain ExcitNet approach. It is thereforeexpected that the proposed method will provide more accuratetraining and generation results, to be further discussed in thefollowing section.

4. Experiments

The experiments used a phonetically and prosodically bal-anced speech corpus recorded by a Korean female professionalspeaker. The speech signals were sampled at 24 kHz with 16 bitquantization. In total, 4,408 utterances (7.9 hours) were usedfor training, 230 utterances (0.4 hours) were used for validation,and a further 120 utterances (0.2 hours) were used for testing.The acoustic features were extracted using the improved time-frequency trajectory excitation vocoder at analysis intervals of 5ms [19], and these features included 40-dimensional line spec-tral frequencies (LSFs), fundamental frequency (F0), energy,voicing ﬂag (v/uv), 32-dimensional slowly evolving waveform(SEW), and 4-dimensional rapidly evolving waveform (REW),all of which constituted a 79-dimensional feature vector.

Although there are many state-of-the-art acoustic models avail-able, including Tacotron and Transformer [11,14,20], we optedto pursue a Tacotron model with phoneme alignment approach[15] because of its fast and stable generation and competitivesynthetic quality. Fig. 3 is a block diagram of the acoustic modelwhich consists of three sub-modules, namely context analysis,context embedding, and Tacotron decoding.In the context analysis module, the phoneme-level linguis-tic feature vectors were extracted from the input text. Thesewere composed of 330 binary features for categorical linguis-tic contexts and 24 features for numerical linguistic contexts.Having input these features, the corresponding phoneme dura-tion was estimated through three fully connected (FC) layerswith 1,024, 512, 256 units followed by a unidirectional LSTMnetwork with 128 memory blocks. Based on this estimated du-ration, the phoneme-level linguistic features were then upsam-pled to frame-level adding two numerical vectors of phonemeduration and its relative position. In context embedding, the linguistic features were trans-formed into high-level context vectors. The module here con-sisted of three convolution layers with a 101 kernel and 512channels per layer, a bi-directional LSTM network with 512memory blocks, and an FC layer with 512 units.We used a Tacotron 2 decoder network to generate the out-put acoustic features [11]. First, the previously generated acous-tic features were fed into two FC layers with 256 units (i.e. thePreNet), and those features and the vectors from the context em-bedding module were then passed through two uni-directionalLSTM layers with 1,024 memory blocks followed by two pro-jection layers. Finally, to improve generation accuracy, ﬁve con-volution layers with 5 × Xavier initialization and

Adam optimization wasused [21, 22]. The learning rate was scheduled to be decayedfrom 0.001 to 0.0001 via a decaying rate of 0.33 per 100,000steps.

The architecture of the proposed MbG-ExcitNet comprisedthree convolutional blocks, each with 10 convolution layerswith dilations of 1, 2, 4, and so on, up to 512. The numbersof dilated causal convolution channels and 1 × × TTS naturalness MOS results with 95% conﬁdence in-tervals with respect to the different vocoding models: the bestMOS scores are in bold.

Index System MOSTest 1 WaveNet 3.23 ± ± ± ± ± Test 6 Raw 4.66 ± In the synthesis step, all of the acoustic feature vectors werepredicted by the acoustic model with the given input text. Byinputting these features, the MbG-ExcitNet vocoder generateda discrete symbol of the quantized excitation signal, and its dy-namic was recovered via µ -law expansion. Finally, the speechsignal was reconstructed by applying the LP synthesis ﬁlter tothe generated excitation signal. To evaluate the perceptual quality of the proposed system, nat-uralness MOS tests were performed by asking 13 native Ko-rean speakers to make quality judgments about the synthesizedspeech samples using the following ﬁve responses: 1 = Bad; 2= Poor; 3 = Fair; 4 = Good; and 5 = Excellent. In total, 20utterances were randomly selected from the test set and weresynthesized using the different generation models. In particu-lar, the speech samples synthesized by the below conventionalvocoding methods were evaluated together to conﬁrm perfor-mance differences:• WaveNet : Plain WaveNet vocoder [3]•

ExcitNet : Plain ExcitNet vocoder [7]•

G-WaveNet : WaveNet vocoder trained with generatedacoustic parameters [11]•

G-ExcitNet : ExcitNet vocoder trained with generatedacoustic parametersThe G-ExcitNet vocoder was conﬁgured similarly to the pro-posed MbG-ExcitNet, but its target excitation was extractedfrom the ground-truth spectral parameters.Table 1 presents the MOS test results for the TTS systemswith respect to the different vocoding models, and the analysiscan be summarized as follows: First, when training vocodingmodels using ground-truth acoustic parameters, ExcitNet per-formed better than WaveNet (Tests 1 and 2). This implies thatExcitNet’s adaptive spectral ﬁlter is beneﬁcial to reconstruct amore accurate speech signal [7]. Second, training the modelwith generated parameters provided better perceptual qualitythan using the ground-truth approach in WaveNet (Tests 1 and3), but vice versa in ExcitNet (Tests 2 and 4). This result con-ﬁrms that target excitation should be replaced by consideringthe acoustic model’s estimation errors in excitation-based meth-ods. Lastly, the proposed MbG-ExcitNet performed best acrossthe different vocoders (Tests 5 and the others). Because the Generated audio samples are available at the following URL: https://sewplay . github . io/demos/mbg excitnet Figure 4:

A/B/X preference comparison of MbG-ExcitNet andits initialization-reﬁned version, MbG-ExcitNet ∗ . MbG training strategy guided the vocoding model to compen-sate for errors from the acoustic model, it was possible to signif-icantly improve synthesis accuracy. Consequently, the TTS sys-tem with the proposed MbG-ExcitNet vocoder achieved . MOS.To further verify the effectiveness of the proposed method,we designed additional experiments to reﬁne the initializationof MbG-ExcitNet’s model weights. Since the MbG trainingprocess utilizes generated spectral parameters and the corre-sponding excitation signals as the input conditions and targetoutputs, respectively, it may be difﬁcult to capture the speechsignals’ original characteristics. We therefore adopted a transferlearning method [23] through which the MbG-ExcitNet was ini-tialized by the plain ExcitNet model whose own weights wereoptimized by ground-truth speech spectra and excitations. Allweights were then ﬁne-tuned by the MbG framework. As a re-sult, it was possible to guide the entire training process to learnthe characteristics of both the original and the generated speechsegments.Fig. 4 depicts the results of an A/B/X preference testbetween the proposed MbG-ExcitNet and this initialization-reﬁned version (MbG-ExcitNet ∗ ). The setup for this test wasthe same as for the MOS assessment except that listeners wereasked to rate the quality preference of the synthesized speechsamples. The results conﬁrm that the initialization-reﬁned sys-tem provided better perceptual quality than the originally pro-posed MbG-ExcitNet. This conﬁrms that adopting a transferlearning method is advantageous to generating more naturalspeech signal in an MbG-structured TTS system.

5. Conclusions

This paper has proposed a high-quality neural TTS system thatincorporates an MbG structure into the ExcitNet vocoder. TheMbG-ExcitNet back-end was optimized to learn excitation out-put distributions while simultaneously compensating for esti-mation errors from the acoustic model front-end. As such, theproposed method was effective in minimizing the mismatch be-tween the acoustic model and the vocoder. The experimentalresults veriﬁed that a TTS system with the proposed MbG-ExcitNet vocoder performed signiﬁcantly better than conven-tional systems with similarly conﬁgured WaveNet vocoders.Future research should include extending the framework intospeech synthesis systems based on WaveRNN and/or WaveG-low vocoders.

6. Acknowledgements

The authors would like to thank Hyungseob Lim, KyungguenByun, Seyun Um, and Suhyeon Oh at DSP&AI Lab., YonseiUniversity, Seoul, Korea, for their support. . References [1] H. Zen, A. Senior, and M. Schuster, “Statistical parametric speechsynthesis using deep neural networks,” in

Proc. ICASSP , 2013, pp.7962–7966.[2] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals,A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu,“WaveNet: A generative model for raw audio,” arXiv preprintarXiv:1609.03499 , 2016.[3] A. Tamamori, T. Hayashi, K. Kobayashi, K. Takeda, andT. Toda, “Speaker-dependent WaveNet vocoder,” in

Proc. INTER-SPEECH , 2017, pp. 1118–1122.[4] N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury,N. Casagrande, E. Lockhart, F. Stimberg, A. van den Oord,S. Dieleman, and K. Kavukcuoglu, “Efﬁcient neural audiosynthesis,” in

Proc. ICML , 2018, pp. 2410–2419.[5] R. Prenger, R. Valle, and B. Catanzaro, “WaveGlow: A ﬂow-based generative network for speech synthesis,” in

Proc. ICASSP ,2019, pp. 3617–3621.[6] L. Juvela, V. Tsiaras, B. Bollepalli, M. Airaksinen, J. Yamagishi,and P. Alku, “Speaker-independent raw waveform model for glot-tal excitation,” in

Proc. INTERSPEECH , 2018, pp. 2012–2016.[7] E. Song, K. Byun, and H.-G. Kang, “ExcitNet vocoder: A neu-ral excitation model for parametric speech synthesis systems,” in

Proc. EUSIPCO , 2019, pp. 1179–1183.[8] M.-J. Hwang, F. Soong, E. Song, X. Wang, H. Kang, and H.-G.Kang, “LP-WaveNet: Linear prediction-based WaveNet speechsynthesis,” arXiv preprint arXiv:1811.11913 , 2018.[9] J.-M. Valin and J. Skoglund, “LPCNet: Improving neural speechsynthesis through linear prediction,” in

Proc. ICASSP , 2019, pp.5891–5895.[10] M.-J. Hwang, E. Song, R. Yamamoto, F. Soong, and H.-G. Kang,“Improving LPCNet-based text-to-speech with linear prediction-structured mixture density network,” in

Proc. ICASSP , 2020, pp.7219–7223.[11] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang,Z. Chen, Y. Zhang, Y. Wang, R. Skerrv-Ryan et al. , “Natural TTSsynthesis by conditioning WaveNet on Mel spectrogram predic-tions,” in

Proc. ICASSP , 2018, pp. 4779–4783.[12] L. Juvela, B. Bollepalli, J. Yamagishi, P. Alku et al. , “Reducingmismatch in training of DNN-based glottal excitation models ina statistical parametric text-to-speech system.” in

Proc. INTER-SPEECH , 2017, pp. 1368–1372.[13] M.-J. Hwang, E. Song, K. Byun, and H.-G. Kang, “Modeling-by-generation-structured noise compensation algorithm for glottalvocoding speech synthesis system,” in

Proc. ICASSP , 2018, pp.5669–5673.[14] Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss,N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, Q. Le,Y. Agiomyrgiannakis, R. Clark, and R. A. Saurous, “Tacotron:Towards end-to-end speech synthesis,” in

Proc. INTERSPEECH ,2017, pp. 4006–4010.[15] T. Okamoto, T. Toda, Y. Shiga, and H. Kawai, “Tacotron-basedacoustic model using phoneme alignment for practical neural text-to-speech systems,” in

Proc. ASRU , 2019, pp. 214–221.[16] A. Van Den Oord, S. Dieleman, H. Zen, K. Simonyan,O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, andK. Kavukcuoglu, “WaveNet: A generative model for raw audio,”

CoRR abs/1609.03499 , 2016.[17] B. Atal and M. Schroeder, “Predictive coding of speech signalsand subjective error criteria,”

IEEE Trans. Acoust., Speech SignalProcess. , vol. 27, no. 3, pp. 247–254, 1979.[18] T. Drugman, P. Alku, A. Alwan, and B. Yegnanarayana, “Glot-tal source processing: From analysis to applications,”

Comput.Speech Lang. , vol. 28, no. 5, pp. 1117–1138, 2014. [19] E. Song, F. K. Soong, and H.-G. Kang, “Effective spectral and ex-citation modeling techniques for LSTM-RNN-based speech syn-thesis systems,”

IEEE/ACM Trans. Audio, Speech, and Lang. Pro-cess. , vol. 25, no. 11, pp. 2152–2161, 2017.[20] N. Li, S. Liu, Y. Liu, S. Zhao, M. Liu, and M. T. Zhou, “Neu-ral speech synthesis with Transformer network,” in

Proc. AAAI ,2019, pp. 6706–6713.[21] X. Glorot and Y. Bengio, “Understanding the difﬁculty of trainingdeep feedforward neural networks,” in

Proc. AISTATS , 2010, pp.249–256.[22] D. P. Kingma and J. Ba, “Adam: A method for stochasticoptimization,”