Improved parallel WaveGAN vocoder with perceptually weighted spectrogram loss
Eunwoo Song, Ryuichi Yamamoto, Min-Jae Hwang, Jin-Seob Kim, Ohsung Kwon, Jae-Min Kim
IIMPROVED PARALLEL WAVEGAN VOCODER WITH PERCEPTUALLY WEIGHTEDSPECTROGRAM LOSS
Eunwoo Song , Ryuichi Yamamoto , Min-Jae Hwang , Jin-Seob Kim , Ohsung Kwon , Jae-Min Kim NAVER Corp., Seongnam, Korea LINE Corp., Tokyo, Japan Search Solutions Inc., Seongnam, Korea
ABSTRACT
This paper proposes a spectral-domain perceptual weightingtechnique for Parallel WaveGAN-based text-to-speech (TTS)systems. The recently proposed Parallel WaveGAN vocodersuccessfully generates waveform sequences using a fastnon-autoregressive WaveNet model. By employing multi-resolution short-time Fourier transform (MR-STFT) criteriawith a generative adversarial network, the light-weight con-volutional networks can be effectively trained without anydistillation process. To further improve the vocoding perfor-mance, we propose the application of frequency-dependentweighting to the MR-STFT loss function. The proposedmethod penalizes perceptually-sensitive errors in the fre-quency domain; thus, the model is optimized toward reducingauditory noise in the synthesized speech. Subjective listeningtest results demonstrate that our proposed method achieves4.21 and 4.26 TTS mean opinion scores for female and maleKorean speakers, respectively.
Index Terms — Text-to-speech, speech synthesis, neuralvocoder, Parallel WaveGAN
1. INTRODUCTION
Generative models for raw speech waveforms have signifi-cantly improved the quality of neural text-to-speech (TTS)systems [1, 2]. Specifically, autoregressive generative modelssuch as
WaveNet have successfully replaced the role of tra-ditional parametric vocoders [2–5]. Non-autoregressive ver-sions, including
Parallel WaveNet , provide a fast waveformgeneration method based on a teacher-student framework [6,7]. In this method, the model is trained using a probabilitydensity distillation method in which the knowledge of an au-toregressive teacher WaveNet is transferred to an inverse au-toregressive flow student model [8].In our previous work, we introduced generative adversar-ial network (GAN) training methods to the Parallel WaveNetframework [9], and proposed
Parallel WaveGAN by combin-ing the adversarial training with multi-resolution short-timeFourier transform (MR-STFT) criteria [10, 11]. Although itis possible to train GAN-based non-autoregressive models by only using adversarial loss function [12], employing the MR-STFT loss function has been proven to be advantageous forincreasing the training efficiency [10, 13, 14]. Furthermore,because the Parallel WaveGAN only trains a WaveNet modelwithout any density distillation, the entire training process be-comes much easier than it is in the conventional methods, andthe model can produce natural sounding speech waveformswith just a small number of parameters.To further enhance the performance of the Parallel Wave-GAN, this paper proposes a spectral-domain perceptualweighting method for optimizing the MR-STFT criteria. Afrequency-dependent masking filter is designed to penalize er-rors near the spectral valleys, which are perceptually-sensitiveto the human ear [15]. By applying this filter to the STFTloss function calculations in the training step, the networkis guided to reduce the noise component in those regions.Consequently, the proposed model generates a more naturalvoice in comparison to the original Parallel WaveGAN. Ourcontributions can be summarized as follows:• We propose a perceptually weighted MR-STFT lossfunction alongside a conventional adversarial trainingmethod. This approach improves the quality of the syn-thesized speech in the Parallel WaveGAN-based neuralTTS system.• Because the proposed method does not change the net-work architecture, it maintains the small number of pa-rameters found in the original Parallel WaveGAN’s andretain its fast inference speed. In particular, the systemcan generate a 24 kHz speech waveform 50.57 timesfaster than real-time in a single GPU environment with1.83 M parameters.• Consequently, our method achieved mean opinionscore (MOS) results of 4.21 and 4.26 for female andmale Korean speakers, respectively, in the neural TTSsystems. a r X i v : . [ ee ss . A S ] J a n . RELATED WORK The idea of using STFT-based loss functions is not new. Intheir study of spectrogram inversion, Sercan et al. [16] firstproposed spectral convergence and log-scale STFT magni-tude losses, and our previous work proposed combing thesein a multi-resolution form [9].Moreover, perceptual noise-shaping filters have signifi-cantly improved the quality of synthesized speech in autore-gressive WaveNet frameworks [17]. Based on the characteris-tics of the human auditory system, an external noise-shapingfilter is designed to reduce perceptually-sensitive noise in thespectral valley regions. This filter acts as a pre-processor inthe training step; thus, the WaveNet learns the distribution of noise-shaped residual signal . In the synthesis step, by apply-ing its inverse filter to the WaveNet’s output, the enhancedspeech can be reconstructed.However, it has been shown that the filter’s effectivenessdoes not work for the non-autoregressive generation mod-els, including WaveGlow [18] and the Parallel WaveGAN.One possible reason for this might be that the character-istics of the noise-shaped residual signal are difficult forthe non-autoregressive model to capture without previoustime-step information. To address this problem, the pro-posed system applies a frequency-dependent mask to theprocess of calculating the STFT loss functions. As thismethod does not change the target speech’s distribution, thenon-autoregressive WaveNet can be stably optimized, whilesignificantly reducing the auditory noise components.
3. PARALLEL WAVEGAN
The Parallel WaveGAN jointly trains a non-causal WaveNetgenerator, G , and a convolutional neural network (CNN) dis-criminator, D , to generate a time-domain speech waveformfrom the corresponding input acoustic parameters. Specif-ically, the generator learns a distribution of realistic wave-forms by trying to deceive the discriminator into recognizingthe generated samples as real. The process is performed byminimizing the generator losses as follows: L G ( G, D ) = L mr stft ( G ) + λ adv L adv ( G, D ) , (1)where L mr stft ( G ) denotes an MR-STFT loss, which will bediscussed in the next section; L adv ( G, D ) denotes an adver-sarial loss; λ adv denotes the hyperparameter balancing thetwo loss functions. The adversarial loss is designed basedon least-squares GANs [19–22], as follows: L adv ( G, D ) = E z ∼ p z (cid:2) (1 − D ( G ( z , h ))) (cid:3) , (2)where z , p z , and h denote the input noise, a Gaussian distri-bution N ( , I ) , and the conditional acoustic parameters, re-spectively. The discriminator is trained to correctly classify the gen-erated sample as fake while classifying the ground truth as real using the following optimization criterion: L D ( G, D ) = E x ∼ p data [(1 − D ( x )) ]+ E z ∼ p z (cid:2) D ( G ( z , h )) (cid:3) , (3)where x and p data denote the target speech waveform and itsdistribution, respectively. To guarantee the stability of the adversarial training methoddescribed above, it is crucial to incorporate an MR-STFT lossfunction into the generator’s optimization process [10]. TheMR-STFT loss function in equation (1) is defined in terms ofthe number of STFT losses, M , as follows: L mr stft ( G ) = 1 M M (cid:88) m =1 L ( m )stft ( G ) , (4)where L ( m )stft ( G ) denotes the m th STFT loss defined as fol-lows: L stft ( G ) = E x ∼ p data , ˆ x ∼ p G [ L sc ( x , ˆ x ) + L mag ( x , ˆ x )] , (5)where ˆ x denotes the generated sample drawn by probabilitydistribution of generator, p G ; L sc and L mag denote spectralconvergence and log STFT magnitude losses, respectively,which are defined as follows [16]: L sc ( x , ˆ x ) = (cid:113)(cid:80) t,f ( | X t,f | − | ˆ X t,f | ) (cid:113)(cid:80) t,f | X t,f | , (6) L mag ( x , ˆ x ) = (cid:80) t,f | log | X t,f | − log | ˆ X t,f || T · N , (7)where | X t,f | and | ˆ X t,f | denote the f th STFT magnitude of x and ˆ x at the time frame t , respectively; T and N denote thenumber of frames and the number of frequency bins, respec-tively. To further enhance the performance of the Parallel Wave-GAN, this paper proposes to apply a spectral-domain percep-tual masking filter to the MR-STFT loss criteria as follows: L w sc ( x , ˆ x ) = (cid:113)(cid:80) t,f ( W t,f ( | X t,f | − | ˆ X t,f | )) (cid:113)(cid:80) t,f | X t,f | , (8) L w mag ( x , ˆ x ) = (cid:80) t,f | log W t,f (log | X t,f | − log | ˆ X t,f | ) | T · N , (9)a) (b) (c)
Fig. 1 : Magnitude distance (MD) obtained when calculatingthe spectral convergence: (a) The weight matrix of spectralmask, (b) the MD before applying the mask (conventionalmethod), and (c) the MD after applying the mask (proposedmethod).where W t,f denotes a weight coefficient of the spectral mask.The weight matrix W is constructed by repeating a time-invariant frequency masking filter along the time axis, whosetransfer function is defined as follows: W ( z ) = 1 − p (cid:88) k =1 ˜ α k z − k , (10)where ˜ α k denotes the k th linear prediction (LP) coefficientwith the order p , obtained by averaging all spectra extractedfrom the training data. As shown in Fig. 1a, the weight ma-trix of the spectral mask is designed to represent the globalcharacteristics of the spectral formant structure. This enablesan emphasis on losses at the frequency regions of the spec-tral valleys, which are more sensitive to the human ear. Whencalculating the STFT loss (Fig. 1b), this filter is used to penal-ize losses in those regions (Fig. 1c). As a result, the trainingprocess can guide the model to further reduce the perceptualnoise in the synthesized speech .The merits of the proposed method are presented in Fig. 2which shows the log-spectral distance between the origi-nal and generated speech signals. The proposed perceptualweighting of MR-STFT losses enables an accurate estimationof speech spectra, and it is therefore expected that it willprovide more accurate training and generation results, to bediscussed further in the following section. Although the log-scale STFT-magnitude loss in equation (7) was de-signed to fit small amplitude components [16], our preliminary experimentsverified that applying the masking filter to this loss was also beneficial tosynthetic quality.
Fig. 2 : Log-spectral distance (LSD; dB) between the originaland generated speech signals
Table 1 : Utterances in speech sets by Korean male (KRM)and Korean female (KRF) speakers (SPK).
SPK Training validation TestingKRF 5,085 (5.5 h) 360 (0.4 h) 180 (0.2 h)KRM 5,382 (7.4 h) 290 (0.4 h) 140 (0.2 h)
4. EXPERIMENTS4.1. Experimental setup
The experiments used two phonetically and prosodically richspeech corpora recorded by Korean male and female pro-fessional speakers. The speech signals were sampled at 24kHz, and each sample was quantized by 16 bits. Table 1shows the number of utterances in each set. The acoustic fea-tures were extracted using an improved time-frequency tra-jectory excitation vocoder at the analysis intervals of 5 ms[23], and these features included 40-dimensional line spectralfrequencies (LSFs), fundamental frequency, energy, voicingflag, a 32-dimensional slowly evolving waveform, and a 4-dimensional rapidly evolving waveform, all of which consti-tuted a 79-dimensional feature vector.
Although there are many state-of-the-art acoustic architec-tures available [24–26], we used a Tacotron model withphoneme alignment [27, 28] for its fast and stable genera-tion and competitive synthesis quality. The left section ofFig. 3 presents the acoustic model which consists of threesub-modules, namely, context analysis, context embedding,and Tacotron decoding.In the context analysis module, a grapheme-to-phonemeconverter was applied to the input text by the Korean stan-dard pronunciation grammar, and then phoneme-level featurevectors were extracted by the internal context information-labeling program. These were composed of 330 binary fea-tures for categorical linguistic contexts and 24 features for ig. 3 : Block diagram of the TTS framework.
Table 2 : Vocoding model details, including size and inference speed: Note that inference speed, k , indicates that a system wasable to generate waveforms k times faster than real-time. This evaluation was conducted on a server with a single NVIDIATesla V100 GPU. System Model MR-STFT Perceptual Noise Number of Model Inferenceloss weighting shaping layers size speedBaseline 1 WaveNet - - - 24 3.71 M 0.34 × − Baseline 2 WaveNet + NS - - Yes 24 3.81 M 0.34 × − Baseline 3 Parallel WaveGAN Yes - - 30 1.83 M 50.57Baseline 4 Parallel WaveGAN + NS Yes - Yes 30 1.83 M 47.70Proposal Parallel WaveGAN + PW Yes Yes - 30 1.83 M 50.57 numerical linguistic contexts. By inputting those linguisticfeatures, the corresponding phoneme duration was estimatedthrough three fully connected (FC) layers with 1,024, 512,256 units followed by a unidirectional long short-term mem-ory (LSTM) network with 128 memory blocks. Based onthis estimated duration, the phoneme-level linguistic featureswere then up-sampled to frame-level by adding the two nu-merical vectors of phoneme duration and relative position.In the context embedding module, the linguistic featuresare transformed into high-level context vectors. The mod-ule in this experiment consisted of three convolution layerswith 10 × × Xavier initialization [29] and
Adam op-timization was used [30]. The learning rate was scheduled tobe decayed from 0.001 to 0.0001 via a decaying rate of 0.33per 100 K steps.
Table 2 presents details of the vocoding models includingtheir size and inference speed. As baseline systems, weused two autoregressive WaveNet vocoders, namely, a plainWaveNet (Baseline 1) [3] and a WaveNet with noise-shaping(NS) method (Baseline 2) [17]. We adopted continuous Gaus-sian output distributions for both the baseline systems [7],instead of using the categorical distributions. These two ap-proaches used the same network architecture but differed inthe target output; the plain WaveNet system was designedto predict speech signals, whereas the latter method was de-signed to predict the noise-shaped residual signals. Note thata time-invariant noise-shaping filter was obtained by averag-ing all spectra extracted from the training data. This externalfilter was used to extract the residual signal before the train-ing process, and its inverse filter was applied to reconstructthe speech signal in the synthesis step.The WaveNet systems consisted of 24 layers of dilatedresidual convolution blocks with four dilation cycles. Therewere 128 residual and skip channels, and the filter size was setto three. The model was trained for 1 M steps with a RAdamoptimizer. The learning rate was set to 0.001, and this wasreduced by half every 200 K steps. The minibatch size wasset to eight, and each audio clip was set to 12 K time samples(0.5 seconds).The experiment involved three Parallel WaveGAN sys-tems, namely, the plain Parallel WaveGAN (Baseline 3) [10],a Parallel WaveGAN with the same noise-shaping method asbefore (Baseline 4), and the proposed method with the per-ceptually weighted (PW) criteria (Proposal). All had the samenetwork architecture consisting of 30 dilated residual convo-lution block layers with three exponentially increasing dila- able 3 : The details of the MR-STFT loss calculations. AHanning window was applied before the FFT process.
STFT loss FFT size Window size Frame shift L (1)stft
512 240 (10 ms) 50 ( ≈ L (2)stft L (3)stft tion cycles. The number of residual and skip channels was setto 64, and the convolution filter size was three. The discrimi-nator consisted of 10 layers of non-causal dilated 1-D convo-lutions with leaky ReLU activation function ( α = 0 . ). Thestrides were set to 1, and linearly increasing dilations wereapplied to the 1-D convolutions, except the first and last lay-ers, from 1 to 8. The number of channels and filter size werethe same as the generator. We applied weight normalizationto all convolutional layers for both the generator and the dis-criminator [31].The MR-STFT loss was calculated by summing threeSTFT losses as shown in Table 3, which had been defined inits original version [10]. In the proposed method, to obtainthe time-invariant masking filter in equation (10), all the LSFs( p = 40 ) collected from the training data were averaged, andconverted to the corresponding LP coefficients [32]. For astable convergence, the masking filter’s magnitude responsewas normalized to have a range from 0.5 to 1.0 before ap-plying it to the MR-STFT loss. The discriminator loss wascomputed by the average of per-time-step scalar predictionswith the discriminator. The value of hyperparameter, λ adv , inequation (1) was chosen to be 4.0. The models were trainedfor 400 K steps with RAdam optimization ( (cid:15) = 1 e − ) tostabilize training [33]. The discriminator was fixed for thefirst 100 K steps, and both the generator and discriminatorwere jointly trained afterwards. The minibatch size was set to8, and the length of each audio clip was set to 24 K time sam-ples (1.0 second). The initial learning rate was set to 0.0001and 0.00005 for the generator and discriminator, respectively.The learning rate was reduced by half every 200 K steps.Across all vocoding models, the input auxiliary featureswere up-sampled by nearest neighbor up-sampling followedby 2-D convolutions so that the time-resolution of the auxil-iary features matched the sampling rate of the speech wave-forms [9, 34]. In the synthesis step, the acoustic feature vectors were pre-dicted by the acoustic model with the given input text. Toenhance spectral clarity, an LSF-sharpening filter was appliedto the spectral parameters [23]. By using these features asthe conditional inputs, vocoding models such as WaveNet andParallel WaveGAN generated corresponding time-sequencesof the waveforms.
Table 4 : Naturalness MOS test results with 95% confidenceintervals for the TTS systems with respect to the differentvocoding models: The MOS results for the proposed systemare in bold font. The KRF and KRM denote Korean femaleand male speakers, respectively.
Index Model KRF KRMTest 1 WaveNet 3.64 ± ± ± ± ± ± ± ± Test 5 Parallel WaveGAN + PW 4.26 ± ± Test 6 Raw 4.64 ± ± Naturalness MOS tests were conducted to evaluate the per-ceptual quality of the proposed system . 20 native Koreanspeakers were asked to make quality judgments about the syn-thesized speech samples using the five following possible re-sponses: 1 = Bad; 2 = Poor; 3 = Fair; 4 = Good; and 5 = Excel-lent. In total, 30 utterances were randomly selected from thetest set and synthesized using the different generation models.Table 4 presents the MOS test results for the TTS sys-tems with respect to the different vocoding models, and theanalysis can be summarized as follows: First, in systemswith autoregressive WaveNet vocoders, applying the noise-shaping filter performed significantly better than the plainsystems (Tests 1 and 2). This confirms that reducing au-ditory noise in the spectral valley regions was beneficial toperceptual quality. However, the effectiveness of the noise-shaping filter was not evident for the Parallel WaveGANsystems (Tests 3 and 4). Since the training and generationprocesses are both non-autoregressive, it might be that thecharacteristics of a noise-shaped target signal were difficultfor the model to capture without previous time-step infor-mation. Second, the systems with Parallel WaveGAN andthe proposed perceptually weighted MR-STFT loss functiondemonstrated improved quality of synthesized speech (Tests3 and 5). Because the weighting helped the model reducegeneration errors in the spectral valleys, and because the ad-versarial training method helped capture the characteristicsof realistic speech waveforms, the system was able to gen-erate a natural voice within a non-autoregressive framework,providing the 14.87 K times faster inference than the bestautoregressive model (Test2). Consequently, the TTS sys-tem with the proposed Parallel WaveGAN vocoder achieved4.21 and 4.26 MOS results for female and male speakers,respectively. Generated audio samples are available at the following URL: https://sewplay . github . io/demos/wavegan-pwsl . CONCLUSIONS This paper proposed a spectral-domain perceptual weight-ing technique for Parallel WaveGAN-based TTS systems. Afrequency-dependent masking filter was applied to the MR-STFT loss function, enabling the system to penalize errorsnear the spectral valleys. As a result, the generation errors inthose frequency regions were reduced, which improved thequality of the synthesized speech. The experimental resultsverified that a TTS system with the proposed Parallel Wave-GAN vocoder performs better than systems with conven-tional methods. Future research includes further improvingthe Parallel WaveGAN’s perceptual quality by replacing thetime-invariant spectral masking filter with a signal-dependentadaptive predictor.
6. REFERENCES [1] H. Zen, A. Senior, and M. Schuster, “Statistical para-metric speech synthesis using deep neural networks,” in
Proc. ICASSP , 2013, pp. 7962–7966.[2] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan,O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, andK. Kavukcuoglu, “WaveNet: A generative model forraw audio,” arXiv preprint arXiv:1609.03499 , 2016.[3] A. Tamamori, T. Hayashi, K. Kobayashi, K. Takeda,and T. Toda, “Speaker-dependent WaveNet vocoder,” in
Proc. INTERSPEECH , 2017, pp. 1118–1122.[4] M.-J. Hwang, F. Soong, E. Song, X. Wang, H. Kang,and H.-G. Kang, “LP-WaveNet: Linear prediction-based WaveNet speech synthesis,” arXiv preprintarXiv:1811.11913 , 2018.[5] E. Song, K. Byun, and H.-G. Kang, “ExcitNet vocoder:A neural excitation model for parametric speech synthe-sis systems,” in
Proc. EUSIPCO , 2019, pp. 1–5.[6] A. van den Oord, Y. Li, I. Babuschkin, K. Simonyan,O. Vinyals, K. Kavukcuoglu, G. van den Driessche,E. Lockhart, L. C. Cobo, F. Stimberg et al. , “ParallelWaveNet: Fast high-fidelity speech synthesis,” in
Proc.ICML , 2018, pp. 3915–3923.[7] W. Ping, K. Peng, and J. Chen, “ClariNet: Parallel wavegeneration in end-to-end text-to-speech,” in
Proc. ICLR ,2019.[8] D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen,I. Sutskever, and M. Welling, “Improved variational in-ference with inverse autoregressive flow,” in
Proc. NIPS ,2016, pp. 4743–4751.[9] R. Yamamoto, E. Song, and J.-M. Kim, “Probabilitydensity distillation with generative adversarial networks for high-quality parallel waveform generation,” in
Proc.INTERSPEECH , 2019, pp. 699–703.[10] ——, “Parallel WaveGAN: A fast waveform generationmodel based on generative adversarial networks withmulti-resolution spectrogram,” in
Proc. ICASSP , 2020,pp. 6199–6203.[11] R. Yamamoto, E. Song, M.-J. Hwang, and J.-M. Kim,“Parallel waveform synthesis based on generative adver-sarial networks with voicing-aware conditional discrim-inators,” arXiv preprint arXiv:2010.14151 , 2020.[12] K. Kumar, R. Kumar, T. de Boissiere, L. Gestin, W. Z.Teoh, J. Sotelo, A. de Br´ebisson, Y. Bengio, and A. C.Courville, “MelGAN: Generative adversarial networksfor conditional waveform synthesis,” in
Proc. NeurIPS ,2019, pp. 14 881–14 892.[13] G. Yang, S. Yang, K. Liu, P. Fang, W. Chen, andL. Xie, “Multi-band MelGAN: Faster waveform gen-eration for high-quality text-to-speech,” arXiv preprintarXiv:2005.05106 , 2020.[14] J. Yang, J. Lee, Y. Kim, H.-Y. Cho, and I. Kim,“VocGAN: A high-fidelity real-time vocoder with ahierarchically-nested adversarial network,” in
Proc. IN-TERSPEECH , 2020, pp. 200–204.[15] M. R. Schroeder, B. S. Atal, and J. Hall, “Optimizingdigital speech coders by exploiting masking propertiesof the human ear,”
Journal of Acoust. Soc. of America ,vol. 66, no. 6, pp. 1647–1652, 1979.[16] S. ¨O. Arık, H. Jun, and G. Diamos, “Fast spectro-gram inversion using multi-head convolutional neuralnetworks,”
IEEE Signal Procees. Letters , vol. 26, no. 1,pp. 94–98, 2019.[17] K. Tachibana, T. Toda, Y. Shiga, and H. Kawai, “Aninvestigation of noise shaping with perceptual weight-ing for WaveNet-based speech generation,” in
Proc.ICASSP , 2018, pp. 5664–5668.[18] T. Okamoto, T. Toda, Y. Shiga, and H. Kawai, “Real-time neural text-to-speech with sequence-to-sequenceacoustic model and WaveGlow or single Gaussian Wa-veRNN vocoders,” in
Proc. INTERSPEECH , 2019, pp.1308–1312.[19] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, andS. Paul Smolley, “Least squares generative adversarialnetworks,” in
Proc. ICCV , 2017, pp. 2794–2802.[20] Q. Tian, X. Wan, and S. Liu, “Generative adversar-ial network based speaker adaptation for high fidelityWaveNet vocoder,” in
Proc. SSW , 2019, pp. 19–23.21] B. Bollepalli, L. Juvela, and P. Alku, “Generative ad-versarial network-based glottal waveform model for sta-tistical parametric speech synthesis,” in
Proc. INTER-SPEECH , 2017, pp. 3394–3398.[22] S. Pascual, A. Bonafonte, and J. Serr`a, “SEGAN:Speech enhancement generative adversarial network,” in
Proc. INTERSPEECH , 2017, pp. 3642–3646.[23] E. Song, F. K. Soong, and H.-G. Kang, “Effectivespectral and excitation modeling techniques for LSTM-RNN-based speech synthesis systems,”
IEEE/ACMTrans. Audio, Speech, and Lang. Process. , vol. 25,no. 11, pp. 2152–2161, 2017.[24] Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J.Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Ben-gio, Q. Le, Y. Agiomyrgiannakis, R. Clark, and R. A.Saurous, “Tacotron: Towards end-to-end speech synthe-sis,” in
Proc. INTERSPEECH , 2017, pp. 4006–4010.[25] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly,Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Skerrv-Ryan et al. , “Natural TTS synthesis by conditioning WaveNeton Mel spectrogram predictions,” in
Proc. ICASSP ,2018, pp. 4779–4783.[26] N. Li, S. Liu, Y. Liu, S. Zhao, M. Liu, and M. T. Zhou,“Neural speech synthesis with Transformer network,” in
Proc. AAAI , 2019, pp. 6706–6713.[27] T. Okamoto, T. Toda, Y. Shiga, and H. Kawai,“Tacotron-based acoustic model using phoneme align-ment for practical neural text-to-speech systems,” in
Proc. ASRU , 2019, pp. 214–221.[28] E. Song, M.-J. Hwang, R. Yamamoto, J.-S. Kim,O. Kwon, and J.-M. Kim, “Neural text-to-speech witha modeling-by-generation excitation vocoder,” in
Proc.INTERSPEECH , 2020, pp. 3570–3574.[29] X. Glorot and Y. Bengio, “Understanding the difficultyof training deep feedforward neural networks,” in
Proc.AISTATS , 2010, pp. 249–256.[30] D. P. Kingma and J. Ba, “Adam: A method for stochas-tic optimization,”
CoRR , vol. abs/1412.6980, 2014.[Online]. Available: http://arxiv . org/abs/1412 . Proc. NIPS , 2016, pp. 901–909.[32] F. Soong and B. Juang, “Line spectrum pair (LSP) andspeech data compression,” in
Proc. ICASSP , 1984, pp.37–40. [33] L. Liu, H. Jiang, P. He, W. Chen, X. Liu, J. Gao, andJ. Han, “On the variance of the adaptive learning rateand beyond,” arXiv preprint arXiv:1908.03265 , 2019.[34] A. Odena, V. Dumoulin, and C. Olah, “Deconvolutionand checkerboard artifacts,”
Distill , 2016. [Online].Available: http://distill ..