[PDF] Improved parallel WaveGAN vocoder with perceptually weighted spectrogram loss

Abstract

This paper proposes a spectral-domain perceptual weighting technique for Parallel WaveGAN-based text-to-speech (TTS) systems. The recently proposed Parallel WaveGAN vocoder successfully generates waveform sequences using a fast non-autoregressive WaveNet model. By employing multi-resolution short-time Fourier transform (MR-STFT) criteria with a generative adversarial network, the light-weight convolutional networks can be effectively trained without any distillation process. To further improve the vocoding performance, we propose the application of frequency-dependent weighting to the MR-STFT loss function. The proposed method penalizes perceptually-sensitive errors in the frequency domain; thus, the model is optimized toward reducing auditory noise in the synthesized speech. Subjective listening test results demonstrate that our proposed method achieves 4.21 and 4.26 TTS mean opinion scores for female and male Korean speakers, respectively.

Full PDF

IIMPROVED PARALLEL WAVEGAN VOCODER WITH PERCEPTUALLY WEIGHTEDSPECTROGRAM LOSS

Eunwoo Song , Ryuichi Yamamoto , Min-Jae Hwang , Jin-Seob Kim , Ohsung Kwon , Jae-Min Kim NAVER Corp., Seongnam, Korea LINE Corp., Tokyo, Japan Search Solutions Inc., Seongnam, Korea

ABSTRACT

This paper proposes a spectral-domain perceptual weightingtechnique for Parallel WaveGAN-based text-to-speech (TTS)systems. The recently proposed Parallel WaveGAN vocodersuccessfully generates waveform sequences using a fastnon-autoregressive WaveNet model. By employing multi-resolution short-time Fourier transform (MR-STFT) criteriawith a generative adversarial network, the light-weight con-volutional networks can be effectively trained without anydistillation process. To further improve the vocoding perfor-mance, we propose the application of frequency-dependentweighting to the MR-STFT loss function. The proposedmethod penalizes perceptually-sensitive errors in the fre-quency domain; thus, the model is optimized toward reducingauditory noise in the synthesized speech. Subjective listeningtest results demonstrate that our proposed method achieves4.21 and 4.26 TTS mean opinion scores for female and maleKorean speakers, respectively.

Index Terms — Text-to-speech, speech synthesis, neuralvocoder, Parallel WaveGAN

1. INTRODUCTION

Generative models for raw speech waveforms have signiﬁ-cantly improved the quality of neural text-to-speech (TTS)systems [1, 2]. Speciﬁcally, autoregressive generative modelssuch as

WaveNet have successfully replaced the role of tra-ditional parametric vocoders [2–5]. Non-autoregressive ver-sions, including

Parallel WaveNet , provide a fast waveformgeneration method based on a teacher-student framework [6,7]. In this method, the model is trained using a probabilitydensity distillation method in which the knowledge of an au-toregressive teacher WaveNet is transferred to an inverse au-toregressive ﬂow student model [8].In our previous work, we introduced generative adversar-ial network (GAN) training methods to the Parallel WaveNetframework [9], and proposed

Parallel WaveGAN by combin-ing the adversarial training with multi-resolution short-timeFourier transform (MR-STFT) criteria [10, 11]. Although itis possible to train GAN-based non-autoregressive models by only using adversarial loss function [12], employing the MR-STFT loss function has been proven to be advantageous forincreasing the training efﬁciency [10, 13, 14]. Furthermore,because the Parallel WaveGAN only trains a WaveNet modelwithout any density distillation, the entire training process be-comes much easier than it is in the conventional methods, andthe model can produce natural sounding speech waveformswith just a small number of parameters.To further enhance the performance of the Parallel Wave-GAN, this paper proposes a spectral-domain perceptualweighting method for optimizing the MR-STFT criteria. Afrequency-dependent masking ﬁlter is designed to penalize er-rors near the spectral valleys, which are perceptually-sensitiveto the human ear [15]. By applying this ﬁlter to the STFTloss function calculations in the training step, the networkis guided to reduce the noise component in those regions.Consequently, the proposed model generates a more naturalvoice in comparison to the original Parallel WaveGAN. Ourcontributions can be summarized as follows:• We propose a perceptually weighted MR-STFT lossfunction alongside a conventional adversarial trainingmethod. This approach improves the quality of the syn-thesized speech in the Parallel WaveGAN-based neuralTTS system.• Because the proposed method does not change the net-work architecture, it maintains the small number of pa-rameters found in the original Parallel WaveGAN’s andretain its fast inference speed. In particular, the systemcan generate a 24 kHz speech waveform 50.57 timesfaster than real-time in a single GPU environment with1.83 M parameters.• Consequently, our method achieved mean opinionscore (MOS) results of 4.21 and 4.26 for female andmale Korean speakers, respectively, in the neural TTSsystems. a r X i v : . [ ee ss . A S ] J a n . RELATED WORK The idea of using STFT-based loss functions is not new. Intheir study of spectrogram inversion, Sercan et al. [16] ﬁrstproposed spectral convergence and log-scale STFT magni-tude losses, and our previous work proposed combing thesein a multi-resolution form [9].Moreover, perceptual noise-shaping ﬁlters have signiﬁ-cantly improved the quality of synthesized speech in autore-gressive WaveNet frameworks [17]. Based on the characteris-tics of the human auditory system, an external noise-shapingﬁlter is designed to reduce perceptually-sensitive noise in thespectral valley regions. This ﬁlter acts as a pre-processor inthe training step; thus, the WaveNet learns the distribution of noise-shaped residual signal . In the synthesis step, by apply-ing its inverse ﬁlter to the WaveNet’s output, the enhancedspeech can be reconstructed.However, it has been shown that the ﬁlter’s effectivenessdoes not work for the non-autoregressive generation mod-els, including WaveGlow [18] and the Parallel WaveGAN.One possible reason for this might be that the character-istics of the noise-shaped residual signal are difﬁcult forthe non-autoregressive model to capture without previoustime-step information. To address this problem, the pro-posed system applies a frequency-dependent mask to theprocess of calculating the STFT loss functions. As thismethod does not change the target speech’s distribution, thenon-autoregressive WaveNet can be stably optimized, whilesigniﬁcantly reducing the auditory noise components.

3. PARALLEL WAVEGAN

The Parallel WaveGAN jointly trains a non-causal WaveNetgenerator, G , and a convolutional neural network (CNN) dis-criminator, D , to generate a time-domain speech waveformfrom the corresponding input acoustic parameters. Specif-ically, the generator learns a distribution of realistic wave-forms by trying to deceive the discriminator into recognizingthe generated samples as real. The process is performed byminimizing the generator losses as follows: L G ( G, D ) = L mr stft ( G ) + λ adv L adv ( G, D ) , (1)where L mr stft ( G ) denotes an MR-STFT loss, which will bediscussed in the next section; L adv ( G, D ) denotes an adver-sarial loss; λ adv denotes the hyperparameter balancing thetwo loss functions. The adversarial loss is designed basedon least-squares GANs [19–22], as follows: L adv ( G, D ) = E z ∼ p z (cid:2) (1 − D ( G ( z , h ))) (cid:3) , (2)where z , p z , and h denote the input noise, a Gaussian distri-bution N ( , I ) , and the conditional acoustic parameters, re-spectively. The discriminator is trained to correctly classify the gen-erated sample as fake while classifying the ground truth as real using the following optimization criterion: L D ( G, D ) = E x ∼ p data [(1 − D ( x )) ]+ E z ∼ p z (cid:2) D ( G ( z , h )) (cid:3) , (3)where x and p data denote the target speech waveform and itsdistribution, respectively. To guarantee the stability of the adversarial training methoddescribed above, it is crucial to incorporate an MR-STFT lossfunction into the generator’s optimization process [10]. TheMR-STFT loss function in equation (1) is deﬁned in terms ofthe number of STFT losses, M , as follows: L mr stft ( G ) = 1 M M (cid:88) m =1 L ( m )stft ( G ) , (4)where L ( m )stft ( G ) denotes the m th STFT loss deﬁned as fol-lows: L stft ( G ) = E x ∼ p data , ˆ x ∼ p G [ L sc ( x , ˆ x ) + L mag ( x , ˆ x )] , (5)where ˆ x denotes the generated sample drawn by probabilitydistribution of generator, p G ; L sc and L mag denote spectralconvergence and log STFT magnitude losses, respectively,which are deﬁned as follows [16]: L sc ( x , ˆ x ) = (cid:113)(cid:80) t,f ( | X t,f | − | ˆ X t,f | ) (cid:113)(cid:80) t,f | X t,f | , (6) L mag ( x , ˆ x ) = (cid:80) t,f | log | X t,f | − log | ˆ X t,f || T · N , (7)where | X t,f | and | ˆ X t,f | denote the f th STFT magnitude of x and ˆ x at the time frame t , respectively; T and N denote thenumber of frames and the number of frequency bins, respec-tively. To further enhance the performance of the Parallel Wave-GAN, this paper proposes to apply a spectral-domain percep-tual masking ﬁlter to the MR-STFT loss criteria as follows: L w sc ( x , ˆ x ) = (cid:113)(cid:80) t,f ( W t,f ( | X t,f | − | ˆ X t,f | )) (cid:113)(cid:80) t,f | X t,f | , (8) L w mag ( x , ˆ x ) = (cid:80) t,f | log W t,f (log | X t,f | − log | ˆ X t,f | ) | T · N , (9)a) (b) (c)

Fig. 1 : Magnitude distance (MD) obtained when calculatingthe spectral convergence: (a) The weight matrix of spectralmask, (b) the MD before applying the mask (conventionalmethod), and (c) the MD after applying the mask (proposedmethod).where W t,f denotes a weight coefﬁcient of the spectral mask.The weight matrix W is constructed by repeating a time-invariant frequency masking ﬁlter along the time axis, whosetransfer function is deﬁned as follows: W ( z ) = 1 − p (cid:88) k =1 ˜ α k z − k , (10)where ˜ α k denotes the k th linear prediction (LP) coefﬁcientwith the order p , obtained by averaging all spectra extractedfrom the training data. As shown in Fig. 1a, the weight ma-trix of the spectral mask is designed to represent the globalcharacteristics of the spectral formant structure. This enablesan emphasis on losses at the frequency regions of the spec-tral valleys, which are more sensitive to the human ear. Whencalculating the STFT loss (Fig. 1b), this ﬁlter is used to penal-ize losses in those regions (Fig. 1c). As a result, the trainingprocess can guide the model to further reduce the perceptualnoise in the synthesized speech .The merits of the proposed method are presented in Fig. 2which shows the log-spectral distance between the origi-nal and generated speech signals. The proposed perceptualweighting of MR-STFT losses enables an accurate estimationof speech spectra, and it is therefore expected that it willprovide more accurate training and generation results, to bediscussed further in the following section. Although the log-scale STFT-magnitude loss in equation (7) was de-signed to ﬁt small amplitude components [16], our preliminary experimentsveriﬁed that applying the masking ﬁlter to this loss was also beneﬁcial tosynthetic quality.

Fig. 2 : Log-spectral distance (LSD; dB) between the originaland generated speech signals

Table 1 : Utterances in speech sets by Korean male (KRM)and Korean female (KRF) speakers (SPK).

SPK Training validation TestingKRF 5,085 (5.5 h) 360 (0.4 h) 180 (0.2 h)KRM 5,382 (7.4 h) 290 (0.4 h) 140 (0.2 h)

4. EXPERIMENTS4.1. Experimental setup

The experiments used two phonetically and prosodically richspeech corpora recorded by Korean male and female pro-fessional speakers. The speech signals were sampled at 24kHz, and each sample was quantized by 16 bits. Table 1shows the number of utterances in each set. The acoustic fea-tures were extracted using an improved time-frequency tra-jectory excitation vocoder at the analysis intervals of 5 ms[23], and these features included 40-dimensional line spectralfrequencies (LSFs), fundamental frequency, energy, voicingﬂag, a 32-dimensional slowly evolving waveform, and a 4-dimensional rapidly evolving waveform, all of which consti-tuted a 79-dimensional feature vector.

Although there are many state-of-the-art acoustic architec-tures available [24–26], we used a Tacotron model withphoneme alignment [27, 28] for its fast and stable genera-tion and competitive synthesis quality. The left section ofFig. 3 presents the acoustic model which consists of threesub-modules, namely, context analysis, context embedding,and Tacotron decoding.In the context analysis module, a grapheme-to-phonemeconverter was applied to the input text by the Korean stan-dard pronunciation grammar, and then phoneme-level featurevectors were extracted by the internal context information-labeling program. These were composed of 330 binary fea-tures for categorical linguistic contexts and 24 features for ig. 3 : Block diagram of the TTS framework.

Table 2 : Vocoding model details, including size and inference speed: Note that inference speed, k , indicates that a system wasable to generate waveforms k times faster than real-time. This evaluation was conducted on a server with a single NVIDIATesla V100 GPU. System Model MR-STFT Perceptual Noise Number of Model Inferenceloss weighting shaping layers size speedBaseline 1 WaveNet - - - 24 3.71 M 0.34 × − Baseline 2 WaveNet + NS - - Yes 24 3.81 M 0.34 × − Baseline 3 Parallel WaveGAN Yes - - 30 1.83 M 50.57Baseline 4 Parallel WaveGAN + NS Yes - Yes 30 1.83 M 47.70Proposal Parallel WaveGAN + PW Yes Yes - 30 1.83 M 50.57 numerical linguistic contexts. By inputting those linguisticfeatures, the corresponding phoneme duration was estimatedthrough three fully connected (FC) layers with 1,024, 512,256 units followed by a unidirectional long short-term mem-ory (LSTM) network with 128 memory blocks. Based onthis estimated duration, the phoneme-level linguistic featureswere then up-sampled to frame-level by adding the two nu-merical vectors of phoneme duration and relative position.In the context embedding module, the linguistic featuresare transformed into high-level context vectors. The mod-ule in this experiment consisted of three convolution layerswith 10 × × Xavier initialization [29] and

Adam op-timization was used [30]. The learning rate was scheduled tobe decayed from 0.001 to 0.0001 via a decaying rate of 0.33per 100 K steps.

Table 2 presents details of the vocoding models includingtheir size and inference speed. As baseline systems, weused two autoregressive WaveNet vocoders, namely, a plainWaveNet (Baseline 1) [3] and a WaveNet with noise-shaping(NS) method (Baseline 2) [17]. We adopted continuous Gaus-sian output distributions for both the baseline systems [7],instead of using the categorical distributions. These two ap-proaches used the same network architecture but differed inthe target output; the plain WaveNet system was designedto predict speech signals, whereas the latter method was de-signed to predict the noise-shaped residual signals. Note thata time-invariant noise-shaping ﬁlter was obtained by averag-ing all spectra extracted from the training data. This externalﬁlter was used to extract the residual signal before the train-ing process, and its inverse ﬁlter was applied to reconstructthe speech signal in the synthesis step.The WaveNet systems consisted of 24 layers of dilatedresidual convolution blocks with four dilation cycles. Therewere 128 residual and skip channels, and the ﬁlter size was setto three. The model was trained for 1 M steps with a RAdamoptimizer. The learning rate was set to 0.001, and this wasreduced by half every 200 K steps. The minibatch size wasset to eight, and each audio clip was set to 12 K time samples(0.5 seconds).The experiment involved three Parallel WaveGAN sys-tems, namely, the plain Parallel WaveGAN (Baseline 3) [10],a Parallel WaveGAN with the same noise-shaping method asbefore (Baseline 4), and the proposed method with the per-ceptually weighted (PW) criteria (Proposal). All had the samenetwork architecture consisting of 30 dilated residual convo-lution block layers with three exponentially increasing dila- able 3 : The details of the MR-STFT loss calculations. AHanning window was applied before the FFT process.

STFT loss FFT size Window size Frame shift L (1)stft

512 240 (10 ms) 50 ( ≈ L (2)stft L (3)stft tion cycles. The number of residual and skip channels was setto 64, and the convolution ﬁlter size was three. The discrimi-nator consisted of 10 layers of non-causal dilated 1-D convo-lutions with leaky ReLU activation function ( α = 0 . ). Thestrides were set to 1, and linearly increasing dilations wereapplied to the 1-D convolutions, except the ﬁrst and last lay-ers, from 1 to 8. The number of channels and ﬁlter size werethe same as the generator. We applied weight normalizationto all convolutional layers for both the generator and the dis-criminator [31].The MR-STFT loss was calculated by summing threeSTFT losses as shown in Table 3, which had been deﬁned inits original version [10]. In the proposed method, to obtainthe time-invariant masking ﬁlter in equation (10), all the LSFs( p = 40 ) collected from the training data were averaged, andconverted to the corresponding LP coefﬁcients [32]. For astable convergence, the masking ﬁlter’s magnitude responsewas normalized to have a range from 0.5 to 1.0 before ap-plying it to the MR-STFT loss. The discriminator loss wascomputed by the average of per-time-step scalar predictionswith the discriminator. The value of hyperparameter, λ adv , inequation (1) was chosen to be 4.0. The models were trainedfor 400 K steps with RAdam optimization ( (cid:15) = 1 e − ) tostabilize training [33]. The discriminator was ﬁxed for theﬁrst 100 K steps, and both the generator and discriminatorwere jointly trained afterwards. The minibatch size was set to8, and the length of each audio clip was set to 24 K time sam-ples (1.0 second). The initial learning rate was set to 0.0001and 0.00005 for the generator and discriminator, respectively.The learning rate was reduced by half every 200 K steps.Across all vocoding models, the input auxiliary featureswere up-sampled by nearest neighbor up-sampling followedby 2-D convolutions so that the time-resolution of the auxil-iary features matched the sampling rate of the speech wave-forms [9, 34]. In the synthesis step, the acoustic feature vectors were pre-dicted by the acoustic model with the given input text. Toenhance spectral clarity, an LSF-sharpening ﬁlter was appliedto the spectral parameters [23]. By using these features asthe conditional inputs, vocoding models such as WaveNet andParallel WaveGAN generated corresponding time-sequencesof the waveforms.

Table 4 : Naturalness MOS test results with 95% conﬁdenceintervals for the TTS systems with respect to the differentvocoding models: The MOS results for the proposed systemare in bold font. The KRF and KRM denote Korean femaleand male speakers, respectively.

Index Model KRF KRMTest 1 WaveNet 3.64 ± ± ± ± ± ± ± ± Test 5 Parallel WaveGAN + PW 4.26 ± ± Test 6 Raw 4.64 ± ± Naturalness MOS tests were conducted to evaluate the per-ceptual quality of the proposed system . 20 native Koreanspeakers were asked to make quality judgments about the syn-thesized speech samples using the ﬁve following possible re-sponses: 1 = Bad; 2 = Poor; 3 = Fair; 4 = Good; and 5 = Excel-lent. In total, 30 utterances were randomly selected from thetest set and synthesized using the different generation models.Table 4 presents the MOS test results for the TTS sys-tems with respect to the different vocoding models, and theanalysis can be summarized as follows: First, in systemswith autoregressive WaveNet vocoders, applying the noise-shaping ﬁlter performed signiﬁcantly better than the plainsystems (Tests 1 and 2). This conﬁrms that reducing au-ditory noise in the spectral valley regions was beneﬁcial toperceptual quality. However, the effectiveness of the noise-shaping ﬁlter was not evident for the Parallel WaveGANsystems (Tests 3 and 4). Since the training and generationprocesses are both non-autoregressive, it might be that thecharacteristics of a noise-shaped target signal were difﬁcultfor the model to capture without previous time-step infor-mation. Second, the systems with Parallel WaveGAN andthe proposed perceptually weighted MR-STFT loss functiondemonstrated improved quality of synthesized speech (Tests3 and 5). Because the weighting helped the model reducegeneration errors in the spectral valleys, and because the ad-versarial training method helped capture the characteristicsof realistic speech waveforms, the system was able to gen-erate a natural voice within a non-autoregressive framework,providing the 14.87 K times faster inference than the bestautoregressive model (Test2). Consequently, the TTS sys-tem with the proposed Parallel WaveGAN vocoder achieved4.21 and 4.26 MOS results for female and male speakers,respectively. Generated audio samples are available at the following URL: https://sewplay . github . io/demos/wavegan-pwsl . CONCLUSIONS This paper proposed a spectral-domain perceptual weight-ing technique for Parallel WaveGAN-based TTS systems. Afrequency-dependent masking ﬁlter was applied to the MR-STFT loss function, enabling the system to penalize errorsnear the spectral valleys. As a result, the generation errors inthose frequency regions were reduced, which improved thequality of the synthesized speech. The experimental resultsveriﬁed that a TTS system with the proposed Parallel Wave-GAN vocoder performs better than systems with conven-tional methods. Future research includes further improvingthe Parallel WaveGAN’s perceptual quality by replacing thetime-invariant spectral masking ﬁlter with a signal-dependentadaptive predictor.

6. REFERENCES [1] H. Zen, A. Senior, and M. Schuster, “Statistical para-metric speech synthesis using deep neural networks,” in

Proc. ICASSP , 2013, pp. 7962–7966.[2] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan,O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, andK. Kavukcuoglu, “WaveNet: A generative model forraw audio,” arXiv preprint arXiv:1609.03499 , 2016.[3] A. Tamamori, T. Hayashi, K. Kobayashi, K. Takeda,and T. Toda, “Speaker-dependent WaveNet vocoder,” in

Proc. INTERSPEECH , 2017, pp. 1118–1122.[4] M.-J. Hwang, F. Soong, E. Song, X. Wang, H. Kang,and H.-G. Kang, “LP-WaveNet: Linear prediction-based WaveNet speech synthesis,” arXiv preprintarXiv:1811.11913 , 2018.[5] E. Song, K. Byun, and H.-G. Kang, “ExcitNet vocoder:A neural excitation model for parametric speech synthe-sis systems,” in

Proc. EUSIPCO , 2019, pp. 1–5.[6] A. van den Oord, Y. Li, I. Babuschkin, K. Simonyan,O. Vinyals, K. Kavukcuoglu, G. van den Driessche,E. Lockhart, L. C. Cobo, F. Stimberg et al. , “ParallelWaveNet: Fast high-ﬁdelity speech synthesis,” in

Proc.ICML , 2018, pp. 3915–3923.[7] W. Ping, K. Peng, and J. Chen, “ClariNet: Parallel wavegeneration in end-to-end text-to-speech,” in

Proc. ICLR ,2019.[8] D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen,I. Sutskever, and M. Welling, “Improved variational in-ference with inverse autoregressive ﬂow,” in

Proc. NIPS ,2016, pp. 4743–4751.[9] R. Yamamoto, E. Song, and J.-M. Kim, “Probabilitydensity distillation with generative adversarial networks for high-quality parallel waveform generation,” in

Proc.INTERSPEECH , 2019, pp. 699–703.[10] ——, “Parallel WaveGAN: A fast waveform generationmodel based on generative adversarial networks withmulti-resolution spectrogram,” in

Proc. ICASSP , 2020,pp. 6199–6203.[11] R. Yamamoto, E. Song, M.-J. Hwang, and J.-M. Kim,“Parallel waveform synthesis based on generative adver-sarial networks with voicing-aware conditional discrim-inators,” arXiv preprint arXiv:2010.14151 , 2020.[12] K. Kumar, R. Kumar, T. de Boissiere, L. Gestin, W. Z.Teoh, J. Sotelo, A. de Br´ebisson, Y. Bengio, and A. C.Courville, “MelGAN: Generative adversarial networksfor conditional waveform synthesis,” in

Proc. NeurIPS ,2019, pp. 14 881–14 892.[13] G. Yang, S. Yang, K. Liu, P. Fang, W. Chen, andL. Xie, “Multi-band MelGAN: Faster waveform gen-eration for high-quality text-to-speech,” arXiv preprintarXiv:2005.05106 , 2020.[14] J. Yang, J. Lee, Y. Kim, H.-Y. Cho, and I. Kim,“VocGAN: A high-ﬁdelity real-time vocoder with ahierarchically-nested adversarial network,” in

Proc. IN-TERSPEECH , 2020, pp. 200–204.[15] M. R. Schroeder, B. S. Atal, and J. Hall, “Optimizingdigital speech coders by exploiting masking propertiesof the human ear,”

Journal of Acoust. Soc. of America ,vol. 66, no. 6, pp. 1647–1652, 1979.[16] S. ¨O. Arık, H. Jun, and G. Diamos, “Fast spectro-gram inversion using multi-head convolutional neuralnetworks,”

IEEE Signal Procees. Letters , vol. 26, no. 1,pp. 94–98, 2019.[17] K. Tachibana, T. Toda, Y. Shiga, and H. Kawai, “Aninvestigation of noise shaping with perceptual weight-ing for WaveNet-based speech generation,” in

Proc.ICASSP , 2018, pp. 5664–5668.[18] T. Okamoto, T. Toda, Y. Shiga, and H. Kawai, “Real-time neural text-to-speech with sequence-to-sequenceacoustic model and WaveGlow or single Gaussian Wa-veRNN vocoders,” in

Proc. INTERSPEECH , 2019, pp.1308–1312.[19] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, andS. Paul Smolley, “Least squares generative adversarialnetworks,” in

Proc. ICCV , 2017, pp. 2794–2802.[20] Q. Tian, X. Wan, and S. Liu, “Generative adversar-ial network based speaker adaptation for high ﬁdelityWaveNet vocoder,” in

Proc. SSW , 2019, pp. 19–23.21] B. Bollepalli, L. Juvela, and P. Alku, “Generative ad-versarial network-based glottal waveform model for sta-tistical parametric speech synthesis,” in

Proc. INTER-SPEECH , 2017, pp. 3394–3398.[22] S. Pascual, A. Bonafonte, and J. Serr`a, “SEGAN:Speech enhancement generative adversarial network,” in

Proc. INTERSPEECH , 2017, pp. 3642–3646.[23] E. Song, F. K. Soong, and H.-G. Kang, “Effectivespectral and excitation modeling techniques for LSTM-RNN-based speech synthesis systems,”

IEEE/ACMTrans. Audio, Speech, and Lang. Process. , vol. 25,no. 11, pp. 2152–2161, 2017.[24] Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J.Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Ben-gio, Q. Le, Y. Agiomyrgiannakis, R. Clark, and R. A.Saurous, “Tacotron: Towards end-to-end speech synthe-sis,” in

Proc. INTERSPEECH , 2017, pp. 4006–4010.[25] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly,Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Skerrv-Ryan et al. , “Natural TTS synthesis by conditioning WaveNeton Mel spectrogram predictions,” in

Proc. ICASSP ,2018, pp. 4779–4783.[26] N. Li, S. Liu, Y. Liu, S. Zhao, M. Liu, and M. T. Zhou,“Neural speech synthesis with Transformer network,” in

Proc. AAAI , 2019, pp. 6706–6713.[27] T. Okamoto, T. Toda, Y. Shiga, and H. Kawai,“Tacotron-based acoustic model using phoneme align-ment for practical neural text-to-speech systems,” in

Proc. ASRU , 2019, pp. 214–221.[28] E. Song, M.-J. Hwang, R. Yamamoto, J.-S. Kim,O. Kwon, and J.-M. Kim, “Neural text-to-speech witha modeling-by-generation excitation vocoder,” in

Proc.INTERSPEECH , 2020, pp. 3570–3574.[29] X. Glorot and Y. Bengio, “Understanding the difﬁcultyof training deep feedforward neural networks,” in

Proc.AISTATS , 2010, pp. 249–256.[30] D. P. Kingma and J. Ba, “Adam: A method for stochas-tic optimization,”

CoRR , vol. abs/1412.6980, 2014.[Online]. Available: http://arxiv . org/abs/1412 . Proc. NIPS , 2016, pp. 901–909.[32] F. Soong and B. Juang, “Line spectrum pair (LSP) andspeech data compression,” in

Proc. ICASSP , 1984, pp.37–40. [33] L. Liu, H. Jiang, P. He, W. Chen, X. Liu, J. Gao, andJ. Han, “On the variance of the adaptive learning rateand beyond,” arXiv preprint arXiv:1908.03265 , 2019.[34] A. Odena, V. Dumoulin, and C. Olah, “Deconvolutionand checkerboard artifacts,”

Distill , 2016. [Online].Available: http://distill ..