[PDF] PeriodNet: A non-autoregressive waveform generation model with a structure separating periodic and aperiodic components

Abstract

We propose PeriodNet, a non-autoregressive (non-AR) waveform generation model with a new model structure for modeling periodic and aperiodic components in speech waveforms. The non-AR waveform generation models can generate speech waveforms parallelly and can be used as a speech vocoder by conditioning an acoustic feature. Since a speech waveform contains periodic and aperiodic components, both components should be appropriately modeled to generate a high-quality speech waveform. However, it is difficult to decompose the components from a natural speech waveform in advance. To address this issue, we propose a parallel model and a series model structure separating periodic and aperiodic components. The features of our proposed models are that explicit periodic and aperiodic signals are taken as input, and external periodic/aperiodic decomposition is not needed in training. Experiments using a singing voice corpus show that our proposed structure improves the naturalness of the generated waveform. We also show that the speech waveforms with a pitch outside of the training data range can be generated with more naturalness.

Full PDF

PPERIODNET: A NON-AUTOREGRESSIVE WAVEFORM GENERATION MODEL WITH ASTRUCTURE SEPARATING PERIODIC AND APERIODIC COMPONENTS

Yukiya Hono, Shinji Takaki, Kei Hashimoto, Keiichiro Oura, Yoshihiko Nankaku, and Keiichi Tokuda

Department of Computer Science, Nagoya Institute of Technology, Nagoya, Japan

ABSTRACT

We propose PeriodNet, a non-autoregressive (non-AR) waveformgeneration model with a new model structure for modeling periodicand aperiodic components in speech waveforms. The non-AR wave-form generation models can generate speech waveforms parallellyand can be used as a speech vocoder by conditioning an acousticfeature. Since a speech waveform contains periodic and aperiodiccomponents, both components should be appropriately modeled togenerate a high-quality speech waveform. However, it is difﬁcult todecompose the components from a natural speech waveform in ad-vance. To address this issue, we propose a parallel model and a seriesmodel structure separating periodic and aperiodic components. Thefeatures of our proposed models are that explicit periodic and ape-riodic signals are taken as input, and external periodic/aperiodic de-composition is not needed in training. Experiments using a singingvoice corpus show that our proposed structure improves the natu-ralness of the generated waveform. We also show that the speechwaveforms with a pitch outside of the training data range can begenerated with more naturalness.

Index Terms — Neural vocoder, generative adversarial network,singing voice synthesis, text-to-speech synthesis, signal processing

1. INTRODUCTION

In recent years, speech synthesis technology has rapidly improvedwith the introduction of deep neural networks. In particular,WaveNet [1], which has an autoregressive (AR) structure, directlymodels the distributions of waveform samples and has demonstratedremarkable performance. WaveNet can be used as a speech vocoderby conditioning auxiliary features such as Mel-spectrogram andacoustic features extracted by conventional signal processing-basedvocoder [2]. This is also used in state-of-the-art speech synthe-sis systems, which greatly contributes to improving the quality ofsynthesized speech [3, 4]. However, WaveNet suffers from slowinference speed because of the AR mechanism and huge network ar-chitectures. Although compact AR models [5,6] have been proposedto accelerate inference speed, it is limited because audio samplesmust be generated sequentially. Thus, such models are not suited forreal-time TTS applications.Recently, signiﬁcant efforts have been devoted to buildingnon-AR models to resolve this problem. Parallel WaveNet [7]and ClariNet [8] introduce the teacher-student knowledge distilla-tion. This framework transfers the knowledge from an AR teacherWaveNet to an inverse autoregressive ﬂow (IAF)-based non-ARstudent model [9]. The IAF student model is highly parallelizableand can synthesize high-quality waveforms. However, the trainingprocedure is complicated because it requires a well-trained teachermodel as well as a mix of distilling and other perceptual trainingcriteria. WaveGlow [10] and FloWaveNet [11] with ﬂow-based gen-erative models have been proposed as well. Although these models can be directly learned by minimizing the negative log-likelihood ofthe training data, they need a huge number of parameters and requiremany GPU resources to obtain optimal results for a single speakermodel.Another approach for parallel waveform generation is to usegenerative adversarial networks (GANs) [12]. A GAN is a pow-erful generative model that has been successfully used in variousresearch ﬁelds such as image generation [13], speech synthesis [14],and singing voice synthesis [15]. GAN-based models have also beenproposed for waveform generation [16, 17]. Since these trainingframeworks enable models to effectively capture the time-frequencydistribution of the speech waveform and improve training stability,these GAN-based models are much easier to train than the conven-tional non-AR methods described above.A neural vocoder can generate high-ﬁdelity waveforms since itcan restore missing information from the acoustic feature in a data-driven fashion and is less limited by the knowledge and assumptionsof the conventional vocoder [18, 19]. However, this also results in alack of acoustic controllability and robustness. In fact, it is difﬁcultfor a neural vocoder to generate a speech waveform with accuratepitches outside the range of the training data. Some methods withexplicit periodic signals [20,21] and methods with a pitch-dependentconvolution mechanism [22] address this problem.It is known that both periodic and aperiodic components aremixed in speech waveforms. Although neural vocoders often modelspeech waveforms as single signals without considering these mixedcomponents, it is important to take them into account to modelspeech waveforms more effectively. In particular, when the neuralvocoder is used in a singing voice synthesis system [23, 24], theaccuracy of pitch and breath sound reproduction has a signiﬁcanteffect on quality and naturalness. Several methods for decompos-ing the periodic and aperiodic components contained in the speechwaveform have been proposed [25, 26]. However, it is still difﬁcultto decompose them, and it is not optimal to use decomposed wave-forms, including decomposition errors, as the training data for theneural vocoders.In this paper, we consider speech waveform modeling in termsof the model structures and propose PeriodNet, a non-autoregressiveneural vocoder for better speech waveform modeling. We introducetwo versions with different model structures, a parallel model anda series model, assuming that the periodic and aperiodic waveformscan be generated from the explicit periodic and aperiodic signals,such as a sine wave and a noise sequence, respectively. Our pro-posed methods also can generate a waveform that includes a pitchoutside the range of the training data. Moreover, our models havethe robustness of the input pitch since these generate the periodicand aperiodic waveforms with two separate neural networks. a r X i v : . [ ee ss . A S ] F e b . WAVEFORM MODELING2.1. Autoregressive neural vocoder In neural vocoders with an AR structure [1, 5, 6], a speech wave-form at each timestep is modeled as a probability distribution con-ditioned on past speech samples and auxiliary features such as Mel-spectrograms and acoustic features. An overview of the AR neuralvocoder is shown in Fig. 1(a). In this paper, we use WaveNet [1]as a neural network to generate waveforms (referred to as a genera-tor in this paper). WaveNet has a stack of dilated causal convolutionwith a gated activation function, and it is capable of modeling speechwaveforms with complex periodicity. However, there is a problemthat it cannot make a parallel inference and takes time to generatewaveform because of the AR structure.

In non-AR neural vocoders [7, 8, 10, 11, 16, 17], the neural networkrepresents the mapping function from a pre-generated input signal,such as Gaussian noise, to the speech waveform. Hence, all wave-form samples can be generated in parallel without incurring the ex-pense of having to make predictions autoregressively. However, itis difﬁcult to predict a speech waveform with autocorrelation from anoise sequence without autocorrelation properly. Prior studies [20,21] have proposed methods that use explicit periodic signals suchas sine waves. These methods provide high pitch accuracy and cansynthesize waveforms with a pitch not included in the training data.In this paper, following these attempts, we use the sine wave, noise,and voiced/unvoiced (V/UV) sequence as input signals, as shown inFig 1(b). Note that this V/UV sequence is smoothed in advance.Various architectures can be used for the generator in Fig 1(b); weuse a Parallel WaveGAN [16]-based architecture. Details of modelarchitectures will be described in Sec. 3.2.

3. PROPOSED MODEL STRUCTURES SEPARATINGPERIODIC AND APERIODIC COMPONENTS3.1. Model structures

A speech waveform contains periodic and aperiodic waveforms. Inthe structure shown in Fig. 1(b), the generation process of the pe-riodic and aperiodic waveforms is represented by a single model.However, this structure is not always optimal for waveform model-ing, especially when the accuracy of pitch and breath sound repro-duction has a signiﬁcantly affects quality and naturalness, such as insinging voice synthesis. We assume that the speech waveform is thesum of periodic and aperiodic components. The periodic and aperi-odic components are expected to be easily created from the periodicand aperiodic signal (such as the sine waves and noise sequences),respectively. Thus, in this paper, we propose a parallel mode struc-ture and a series model structure based on these assumptions.The parallel model structure is shown in Fig 1(c). This structureassumes that the periodic and aperiodic waveforms are independentof each other. An explicit periodic signal consisting of a sine waveand V/UV sequence is used to predict the periodic waveform, and anexplicit aperiodic signal consisting of noise and V/UV sequence isused to predict the aperiodic waveform.The series model structure is shown in Fig 1(d). In this struc-ture, we assume that the aperiodic waveform depends on the pe-riodic waveform, considering the possibility that there is an aperi-odic waveform corresponding to the phase of the periodic waveform.Speciﬁcally, we introduce a residual connection between two gener-

Previousoutput Generator OutputAuxiliary feature (a) AR model

SineNoiseV/UV Generator OutputAuxiliary feature (b) Non-AR baseline model

SineV/UV PeriodicGeneratorAperiodicGenerator Output+NoiseV/UV Auxiliary featureAuxiliary feature (c) Non-AR parallel model

PeriodicGeneratorAperiodicGenerator Output+SineV/UVNoiseV/UV Auxiliary featureAuxiliary feature (d) Non-AR series model

Fig. 1 : Structures for speech waveform modelingators so that the latter generator can predict the aperiodic componenttaking into account the dependence of the periodic component.In the parallel model and the series model, different acousticfeatures can be selected for the auxiliary features of the periodic andaperiodic generators, making it possible to obtain more robust neuralvocoders with proper conditioning.

In this paper, we incorporate Parallel WaveGAN [16]-based frame-work into our non-AR baseline and proposed models, as shown inFig. 1(b), Fig. 1(c), and Fig. 1(d). Each generator has the same archi-tecture as the generator of [16], which is a modiﬁed WaveNet-basedmodel with non-causal convolution. On the other hand, for the dis-criminators, we utilize a multi-scale architecture with three discrimi-nators that have identical network structures but operate on differentaudio scales, following [17]. Each discriminator has the same ar-chitecture as the discriminator of [16]. These models are trained byoptimizing the combination of multi-resolution short-time Fouriertransform loss and adversarial loss in the same fashion as [16].In the training vocoder with the parallel and series model struc-tures, the ﬁnal output sequence, which is the sum of two generators’output sequence, is only evaluated. This is the same as the baselinemodel with the single model structure. From the assumptions pre-sented in Sec. 3.1, by inputting the sine wave and noise sequenceseparately, each generator should be trained to predict periodic andaperiodic waveforms, respectively.

4. EXPERIMENTS4.1. Experimental conditions

Seventy Japanese children’s songs (total: 70 min) performed by onefemale singer were used for the experiments. Sixty songs were usedfor training, and the rest were used for testing. Singing voice sig-nals were sampled at 48 kHz, and each sample was quantized by 16bits. The auxiliary features consisted of 50-dimensional WORLDmel-cepstral coefﬁcients [19], 25-dimensional mel-cepstral analysisaperiodicity measures, one-dimensional continuous log fundamentalfrequency ( F ) value, and one-dimensional voiced/unvoiced binarycode. Feature vectors were extracted with a 5-ms shift, and the fea-tures were normalized to have zero mean and unit variance beforetraining.In the training stage, the sine waves for the input of the non-AR neural vocoder were generated based on the glottal closure pointextracted from a natural speech using REAPER [27]. The purpose of .0 1.0 2.0 3.0 4.0 5.0Time [s]05101520 F r e q u e n c y [ k H z ] (a) Waveform of the periodic generator’s output F r e q u e n c y [ k H z ] (b) Waveform of the aperiodic generator’s output F r e q u e n c y [ k H z ] (c) Waveform after the sum of two signals Fig. 2 : Spectrograms of generated waveform by non-AR parallel model F r e q u e n c y [ k H z ] (a) Waveform of the periodic generator’s output F r e q u e n c y [ k H z ] (b) Waveform of the aperiodic generator’s output F r e q u e n c y [ k H z ] (c) Waveform after the sum of two signals Fig. 3 : Spectrograms of generated waveform by non-AR series modelthis is to input a sine wave that is close in phase to the target’s naturalspeech during training. Meanwhile, the sine waves were generatedbased on the F values in the synthesis stage.The following seven systems were compared.• WN : The AR WaveNet [1].• BM1 : The non-AR baseline model, as shown in Fig. 1(b) thatused noise and a V/UV signal as the generator input and isconditioned on all auxiliary features.•

BM2 : The non-AR baseline model, as shown in Fig. 1(b) thatused a sine wave and a V/UV signal as the generator inputand is conditioned on all auxiliary features.•

BM3 : The non-AR baseline model, as shown in Fig. 1(b) thatused noise, a sine wave, and a V/UV signal as the generatorinput and is conditioned on all auxiliary features.•

PM1 : The non-AR parallel model, as shown in Fig. 1(c). Theperiodic generator takes a sine wave and a V/UV signal as in-put, and the aperiodic generator takes noise and a V/UV sig-nal as input. Both generators are conditioned on all auxiliaryfeatures.•

PM2 : The non-AR parallel model, as shown in Fig. 1(c). Un-like

PM1 , the aperiodic generator is conditioned by auxiliaryfeatures other than F .• SM : The non-AR series model, as shown in Fig. 1(d). Theperiodic generator takes a sine wave and a V/UV signal asinput, and the aperiodic generator takes noise, a V/UV signal,and the output signal of the periodic generator as input. Bothgenerators are conditioned on all auxiliary features. WN consisted of 30 layers of dilated residual convolutionblocks with causal convolution. The dilations of WN were set to , , , . . . , , and the 10 dilation layers were stacked three times.The channel size for dilation, residual block, and skip-connection in WN was set to 256, and the ﬁlter size in WN was set to two. Thesinging voice waveforms to train WN were quantized from 16 bitsto 8 bits by using the µ -law algorithm [28]. The generators of BM1 , BM2 , and

BM3 , and periodic genera-tor of

PM1 , PM2 , and SM consisted of 30 layers of dilated residualconvolution blocks with three dilation cycles, the same as WN . Theaperiodic generators of PM1 , PM2 , and SM consisted of 10 lay-ers of dilated residual convolution blocks without dilation cycles.The channel size for dilation, residual block, and skip-connectionwas set to 64, and the ﬁlter size was set to three. The discrimina-tors of BM1 , BM2 , BM3 , PM1 , PM2 , and SM had the multi-scalearchitecture with three discriminators. The discriminators took 48kHz full-resolution waveforms, and 24 kHz and 16 kHz downsam-pled waveforms. The downsampling was performed using averagepooling. Each discriminator consisted of 10 non-causal dilated con-volutions with leaky ReLU activation function. We applied weightnormalization [29] to all convolutional layers.All models were trained using the RAdam optimizer [30] with1000K iterations. Speciﬁcally, in BM1 , BM2 , BM3 , PM1 , PM2 ,and SM , the discriminators were ﬁxed for the ﬁrst 100K iterations,and then both the generator and discriminator were jointly trainedafterward. Fig. 2 and Fig. 3 show the spectrograms in

PM1 and SM , respec-tively. Each ﬁgure has three spectrograms of the waveform of theperiodic generator’s output, the aperiodic generator’s output, and thesum of two predicted signals. Fig. 2(a) and Fig. 2(b) show that thewaveform of the periodic generator contains many harmonic com-ponents, and that of the aperiodic generator contains the other fre-quency components. As seen in the highlighted boxes on the leftand in the center, which represent parts of the breath and unvoicedplosives “/t/”, respectively, it can be seen that the spectra of theseunvoiced sounds only appear in the output of the aperiodic gener-ator. These tendencies can also be seen in Fig. 3(a) and Fig. 3(b).These results indicate that two generators in the parallel model andthe series model work on modeling the transformation from the sinewaves and the noise sequence to the periodic and the aperiodic wave-forms. Comparing the highlighted box in the lower right of Fig. 2(b) N BM1 BM2 BM3 NAT2345 M e a n o p i n i o n s c o r e Fig. 4 : Subjective evaluation results of experiment 1

BM3 PM1 PM2 SM NAT2345 M e a n o p i n i o n s c o r e Fig. 5 : Subjective evaluation results of experiment 2and Fig. 3(b), the output waveform of the aperiodic generator in SM contains more harmonic components than in PM1 . This suggeststhat the periodic waveform as the input of the aperiodic generator in SM may have leaked into to the output of the aperiodic generatorbecause the output waveform of the periodic generator fed into theaperiodic generator. It should be noted that some harmonic compo-nents are also included in Fig. 2(b) since the periodic and aperiodicwaveforms were not explicitly decomposed in the training stage. We conducted a listening test using WN , BM1 , BM2 , BM3 , and

NAT to compare neural vocoders the with and without the AR struc-ture and the input signals for the non-AR neural vocoder. Note that

NAT indicates a recorded natural waveform. The naturalness of thesynthesized singing voice was assumed using the mean opinion score(MOS) test method. The participants were sixteen native Japanesespeakers, and each participant evaluated ten phrases randomly se-lected from the test data. After listening to each test sample in theMOS test, the participants were asked to score the naturalness of thesample out of ﬁve (1 = Bad; 2 = Poor; 3 = Fair; 4 = Good; and 5 =Excellent).The results of the subjective evaluation are shown in Fig. 4.

BM1 yielded a lower MOS value than WN , indicating that it is difﬁ-cult to generate high-quality singing voices from noise. On the otherhand, BM2 showed the same score as WN . By inputting a periodicsignal, the neural vocoder can appropriately synthesize waveformswith periodicity the lack of the AR structure. However, the wave-form of WN contains quantization noise, so the quality of BM2 wasinsufﬁcient.

BM3 , which inputs both explicit periodic and aperiodicsignals, has reached the MOS value close to

NAT . This indicates theeffectiveness of using both explicit periodic and aperiodic signals asinputs for non-AR neural vocoders.

BM3 PM1 PM2 SM2345 M e a n o p i n i o n s c o r e Fig. 6 : Subjective evaluation results of experiment 3

To compare the model structures of non-AR neural vocoders, weconducted two subjective evaluation experiments using

BM3 , PM1 , PM2 , and SM . In these experiments, the samples were generated byfour vocoders conditioned on two different F0 scales: original anddouble scale. In the experiment with the original F scale, we alsoused the natural waveform NAT for comparison.The results are presented in Fig. 5 and Fig. 6. These ﬁgures showthat

PM1 , PM2 , and SM attained higher naturalness than BM3 . Thisindicates that it is effective for the non-AR neural vocoders using theexplicit periodic signal to introduce a parallel or series structure. Al-though the difference between

PM1 , PM2 , and SM was negligiblewhen conditioning on the original F as shown in Fig. 5, PM2 wasthe best performance when conditioning on the doubled F as shownin Fig. 6. The waveform samples generated by BM3 , PM1 , and SM tended to contain more aperiodic waveforms than those generatedby PM2 . In

BM3 , the period and aperiodic components were notmodeled separately, and speech waveforms were generated from asingle generator conditioned by auxiliary features including F . In PM1 and SM , although the networks for modeling these compo-nents were separate, both aperiodic generators were conditioned onauxiliary features including F . In particular, the aperiodic generatorin SM also depended on periodic waveforms predicted by the peri-odic generator. Therefore, it was assumed that BM3 , PM1 , and SM could not generate aperiodic waveforms when these vocoders tookout-of-range F as the acoustic features in the synthesis stage. PM2 is more robust for an unseen F outside the F range of the trainingdata because the aperiodic generator in PM2 does not depend on theperiodic signal or F .

5. CONCLUSIONS

We introduced PeriodNet, a non-AR neural vocoder with new modelstructures, to appropriately modeling the periodic and aperiodiccomponents in the speech waveform. Each generator in the paral-lel or series model structure can model the periodic and aperiodicwaveforms without the use of decomposition techniques. The ex-perimental results showed that the proposed methods were able togenerate high-ﬁdelity speech waveforms and improve the ability togenerate waveforms with a pitch outside the range of the trainingdata. Future work includes investigating the effect of proposed meth-ods on different datasets, such as a multi-speaker and multi-singerdataset.

6. ACKNOWLEDGEMENTS

This work was supported by JSPS KAKENHI Grant NumberJP19H04136 and JP18K11163. . REFERENCES [1] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan,O. Vinyals, A. Graves, N. Kalchbrenner, A. W. Senior, andK. Kavukcuoglu, “WaveNet: A generative model for raw au-dio,” arXiv preprint arXiv:1609.03499 , 2016.[2] A. Tamamori, T. Hayashi, K. Kobayashi, K. Takeda, andT. Toda, “Speaker-dependent WaveNet vocoder.” in

Proccd-ings of Interspeech , 2017, pp. 1118–1122.[3] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang,Z. Chen, Y. Zhang, Y. Wang, R. Skerrv-Ryan et al. , “NaturalTTS synthesis by conditioning WaveNet on mel spectrogrampredictions,” in

Proceedings of ICASSP , 2018, pp. 4779–4783.[4] W. Ping, K. Peng, A. Gibiansky, S. O. Arik, A. Kannan,S. Narang, J. Raiman, and J. Miller, “Deep voice 3: Scalingtext-to-speech with convolutional sequence learning,” arXivpreprint arXiv:1710.07654 , 2017.[5] N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury,N. Casagrande, E. Lockhart, F. Stimberg, A. v. d. Oord,S. Dieleman, and K. Kavukcuoglu, “Efﬁcient neural audio syn-thesis,” arXiv preprint arXiv:1802.08435 , 2018.[6] J.-M. Valin and J. Skoglund, “LPCNet: Improving neuralspeech synthesis through linear prediction,” in

Proceedings ofICASSP , 2019, pp. 5891–5895.[7] A. van den Oord, Y. Li, I. Babuschkin, K. Simonyan,O. Vinyals, K. Kavukcuoglu, G. Driessche, E. Lockhart,L. Cobo, F. Stimberg et al. , “Parallel WaveNet: Fast high-ﬁdelity speech synthesis,” in

Proceedings of ICML , 2018, pp.3918–3926.[8] W. Ping, K. Peng, and J. Chen, “ClariNet: Parallelwave generation in end-to-end text-to-speech,” arXiv preprintarXiv:1807.07281 , 2018.[9] D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen,I. Sutskever, and M. Welling, “Improved variational inferencewith inverse autoregressive ﬂow,” in

Advances in Neural Infor-mation Processing Systems , 2016, pp. 4743–4751.[10] R. Prenger, R. Valle, and B. Catanzaro, “WaveGlow: A ﬂow-based generative network for speech synthesis,” in

Proceedingsof ICASSP , 2019, pp. 3617–3621.[11] S. Kim, S.-g. Lee, J. Song, J. Kim, and S. Yoon,“FloWaveNet: A generative ﬂow for raw audio,” arXivpreprint arXiv:1811.02155 , 2018.[12] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative ad-versarial nets,” in

Advances in Neural Information ProcessingSystems , 2014, pp. 2672–2680.[13] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, andH. Lee, “Generative adversarial text to image synthesis,”

Pro-ceedings of ICML , vol. 48, pp. 1060–1069, 20–22 Jun 2016.[14] Y. Saito, S. Takamichi, and H. Saruwatari, “Statistical paramet-ric speech synthesis incorporating generative adversarial net-works,”

IEEE/ACM Transactions on Audio, Speech, and Lan-guage Processing , vol. 26, no. 1, pp. 84–96, 2018.[15] Y. Hono, K. Hashimoto, K. Oura, Y. Nankaku, and K. Tokuda,“Singing voice synthesis based on generative adversarial net-works,” in

Proceedings of ICASSP , 2019, pp. 6955–6959. [16] R. Yamamoto, E. Song, and J.-M. Kim, “Parallel WaveGAN: Afast waveform generation model based on generative adversar-ial networks with multi-resolution spectrogram,” in

Proceed-ings of ICASSP , 2020, pp. 6199–6203.[17] K. Kumar, R. Kumar, T. de Boissiere, L. Gestin, W. Z. Teoh,J. Sotelo, A. de Br´ebisson, Y. Bengio, and A. C. Courville,“MelGAN: Generative adversarial networks for conditionalwaveform synthesis,” in

Advances in Neural Information Pro-cessing Systems , 2019, pp. 14 910–14 921.[18] H. Kawahara, I. Masuda-Katsuse, and A. De Cheveigne, “Re-structuring speech representations using a pitch-adaptive time–frequency smoothing and an instantaneous-frequency-basedF0 extraction: Possible role of a repetitive structure in sounds,”

Speech communication , vol. 27, no. 3, pp. 187–207, 1999.[19] M. Morise, F. Yokomori, and K. Ozawa, “WORLD: A vocoder-based high-quality speech synthesis system for real-time ap-plications,”

IEICE Transactions on Information and Systems ,vol. 99, no. 7, pp. 1877–1884, 2016.[20] X. Wang, S. Takaki, and J. Yamagishi, “Neural source-ﬁlterwaveform models for statistical parametric speech synthesis,”

IEEE/ACM Transactions on Audio, Speech, and LanguageProcessing , vol. 28, pp. 402–415, 2019.[21] K. Oura, K. Nakamura, K. Hashimoto, Y. Nankaku, andK. Tokuda, “Deep neural network based real-time speechvocoder with periodic and aperiodic inputs,” in

Procedings ofISCA SSW10 , 2019, pp. 13–18.[22] Y.-C. Wu, T. Hayashi, T. Okamoto, H. Kawai, and T. Toda,“Quasi-periodic Parallel WaveGAN: A non-autoregressive rawwaveform generative model with pitch-dependent dilated con-volution neural network,” arXiv preprint arXiv:2007.12955 ,2020.[23] M. Nishimura, K. Hashimoto, K. Oura, Y. Nankaku, andK. Tokuda, “Singing voice synthesis based on deep neural net-works,” in

Proccdings of Interspeech , 2016, pp. 2478–2482.[24] Y. Hono, S. Murata, K. Nakamura, K. Hashimoto, K. Oura,Y. Nankaku, and K. Tokuda, “Recent development of theDNN-based singing voice synthesis system – Sinsy,” in

Pro-ceedings of APSIPA , 2018, pp. 1003–1009.[25] X. Serra and J. Smith, “Spectral modeling synthesis: Asound analysis/synthesis system based on a deterministic plusstochastic decomposition,”

Computer Music Journal , vol. 14,no. 4, pp. 12–24, 1990.[26] P. Zubrycki and A. Petrovsky, “Accurate speech decomposi-tion into periodic and aperiodic components based on discreteharmonic transform,” in

Proceedings of European Signal Pro-cessing Conference , 2007, pp. 2336–2340.[27] “REAPER: Robust epoch and pitch estimator,” https://github.com/google/REAPER.[28] “Pulse code modulation (PCM) of voice frequencies,” in

ITU-TRecommendation G.711 , 1988.[29] T. Salimans and D. P. Kingma, “Weight normalization: A sim-ple reparameterization to accelerate training of deep neural net-works,” in

Advances in Neural Information Processing Sys-tems , 2016, pp. 901–909.[30] L. Liu, H. Jiang, P. He, W. Chen, X. Liu, J. Gao, and J. Han,“On the variance of the adaptive learning rate and beyond,” in