LVCNet: Efficient Condition-Dependent Modeling Network for Waveform Generation
LLVCNET: EFFICIENT CONDITION-DEPENDENT MODELING NETWORKFOR WAVEFORM GENERATION
Zhen Zeng, Jianzong Wang ∗ , Ning Cheng, Jing Xiao Ping An Technology (Shenzhen) Co., Ltd.
ABSTRACT
In this paper, we propose a novel conditional convolutionnetwork, named location-variable convolution, to model thedependencies of the waveform sequence. Different fromthe use of unified convolution kernels in WaveNet to cap-ture the dependencies of arbitrary waveform, the location-variable convolution uses convolution kernels with differentcoefficients to perform convolution operations on differentwaveform intervals, where the coefficients of kernels is pre-dicted according to conditioning acoustic features, such asMel-spectrograms. Based on location-variable convolutions,we design LVCNet for waveform generation, and apply it inParallel WaveGAN to design more efficient vocoder. Experi-ments on the LJSpeech dataset show that our proposed modelachieves a four-fold increase in synthesis speed compared tothe original Parallel WaveGAN without any degradation insound quality, which verifies the effectiveness of location-variable convolutions.
Index Terms — speech synthesis, waveform generation,vocoder, location-variable convolution
1. INTRODUCTION
In rencet years, deep generative models have witnessed ex-traordinary success in waveform generation, which promotesthe development of speech synthesis systems with human-parity sounding. Early researches on autoregressive modelin waveform synthesis, such as WaveNet [1] and WaveRNN[2], have shown much superior performance over traditionalparameters vocoders. However, low inference efficiency ofautoregressive neural network limits its application in real-time scenarios.In order to address the limitation and improve the gen-eration speed, many non-autoregressive models have beenstudied to generate waveforms in parallel. One family relieson knowledge distillation, including Parallel WaveNet [3]and Clarinet [4], where an parallel feed-forward network isdistilled from an autoregressive WaveNet model based onthe inverse auto-regressive flows (IAF) [5]. Although theIAF models is capable of generating high-fidelity speech inreal time, the requirement for a well-trained teacher model ∗ Corresponding author: Jianzong Wang, [email protected] and the intractable density distillation lead to a complicatedmodel training process. The other family is flow-based gen-eration models, including WaveGlow [6] and WaveFlow [7].They are implemented by a invertible network and trainedusing only a single likelihood loss function on the trainingdata. While inference is fast on high-performance GPU,the large size of mode limits their application in memory-constrained scenarios. Meanwhile, as a family of generationmodels, Generative Adversarial Network (GAN) [8] is alsoapplied in waveform generation, such as MelGAN [9], Paral-lel WaveGAN [10] and Multi-Band MelGAN [11], in which agenerator is designed to produce samples as close as possibleto real speech, and a discriminator is implemented to distin-guish generated speech from real speech. They have a verysmall amount of parameters, achieve a synthesis speech farexceeding real-time. Impressively, the Multi-band MelGAN[11] runs at more than 10x faster than real-time on CPU. Inaddition, WaveGrad [12] and DiffWave [13] apply diffusionprobabilistic models [14] for waveform generation, whichconverts the white noise signal into structured waveform inan interative manner.These models are almost implemented by an wavenet-likenetwork, in which the dilated causal convolution is appliedto capture the long-term dependencies of waveform, and themel-spectrum is used as the local conditional input for thegated activation unit. In order to efficiently capture time-dependent features, a large number of convolution kernels arerequired in wavenet-like network. In this work, we proposethe location-variable convolution to model time-dependentfeatures more efficiently. In detail, the location-variable con-volution uses convolution kernels with different coefficientsto perform convolution operations on different waveformintervals, where the coefficients of kernels is predicted bya kernel predictor according to conditioning acoustic fea-tures, such as mel-spectrograms. Based on location-variableconvolutions, we design LVCNet for waveform generation,and apply it in Parallel WaveGAN to achieve more efficientvocoder. Experiments on the LJSpeech dataset [15] show thatour proposed model achieves a four-fold increase in synthesisspeed without any degradation in sound quality. And the main contributions of our works as follow: Audio samples in https://github.com/ZENGZHEN-TTS/LVCNet a r X i v : . [ ee ss . A S ] F e b A novel convolution method, named location-variableconvolution, is proposed to efficiently model the time-dependent features, which uses different convolutionkernels to perform convolution operations on differentwaveform intervals;• Based on location-variable convolutions, we design anetwork for waveform generation, named LVCNet, andapply it in Parallel WaveGAN to achieve more efficientvocoder;• A comparative experiment was conducted to demon-strate the effectiveness of the location-variable convo-lutions in waveform generation.
2. PROPOSED METHODS
In order to model the long-term dependencies of waveformsmore efficiently, we design a novel convolution network,named location-variable convolution, which is applied to theParallel WaveGAN to verify its performance. The designdetails are described in the section.
In the traditional linear prediciton vocoder [16], a simple all-pole linear filter is used to generate waveform in autoregres-sive way, of which the linear prediction coefficients is calcu-lated according to the acoustic features. This process is simi-lar to the autoregressive wavenet vocoder, except that the co-efficients of the linear predictor is variable for different frameswhile the coefficients of convolution kernels in wavenet is thesame in all frames. Inspired by this, we try to design a novelconvolution network with variable convolution kernel coeffi-cients in order to improve the ability to model long-term de-pendencies for waveform generation.Define the input sequence to the convolution as x = { x , x , . . . , x n } , and define the local conditioning sequenceas h = { h , h , . . . , h m } . An element in the local condition-ing sequence is associated with a continuous interval in theinput sequence. In order to effectively use the local correla-tion to model the feature of the input sequence, the location-variable convolution uses a novel convolution method, wheredifferent intervals in the input sequence use different con-volution kernels to implement the convolution operation. Indetail, a kernel predictor is designed to predict multiple setsof convolution kernels according to the local conditioningsequence. Each element in the local conditioning sequencecorresponds to a set of convolution kernels, which is used toperform convolution operations on the associated intervalsin the input sequence. In other words, elements in differentintervals of the input sequence use their related and corre-sponding convolution kernels to extract features. And theoutput sequence is spliced by the convolution results on eachinterval. K e r n e l P r e d i c t o r C ond iti on i ng h Input x Output z Gated Activation
Fig. 1 . An example of convolution process in the location-variable convolution. According to the conditioning se-quence, the kernel predictor generates multiple sets of convo-lution kernels, which are used to perform convolution opera-tions on the associated intervals in the input sequence. Eachelement in the conditioning sequence corresponds to 4 ele-ments in the input sequence.Similar to WaveNet, the gated activation unit is also ap-plied, and the local condition convolution can be expressedas { x ( i ) } m = split ( x ) (1) { W f ( i ) , W g ( i ) } m = Kernel Predictor ( h ) (2) z ( i ) = tanh( W f ( i ) ∗ x ( i ) ) (cid:12) σ ( W g ( i ) ∗ x ( i ) ) (3) z = concat ( z ( i ) ) (4)where x ( i ) denotes the intervals of the input sequence associ-ated with h i , W f ( i ) and W g ( i ) denote the filter and gate con-volution kernels for x ( i ) .For a more visual explanation, Figure 1 shows an exam-ple of the location-variable convolution. In our opinion, sincelocation-variable convolutions can generate different kernelsfor different conditioning sequences, it has more powerfulcapability of modeling the long-term dependency than tradi-tional convolutional network. We also experimentally analyzeits performance for waveform generation in next section. By stacking multiple layers of location-variable convolutionswith different dilations, we design the LVCNet for waveformgeneration, as shown in Figure 2(a). The LVCNet is com-posed of multiple LVCNet blocks, and each LVCNet blockcontains multiple location-variable convolution (LVC) layerswith inscreasing factorial dilation coefficients to improve the VC Layer d =2 n-1 LVC Layer d =2 LVC Layer d =1 LinearLinear
Input SequenceOutput Sequence C ond iti on i ng ( M e l - s p ec t r u m ) (a) Architecture of LVCNet RealFake λ update parameter x update parameter Multi-resolutionSTFT LossAdversarialLossDiscriminatorLoss
Generator(LVCNet)Discriminator x Natual speech (Conditioning)
Random noiseMel-spectrum (b) Architecture of Parallel WaveGAN with LVCNet
Fig. 2 . (a) The architecture of LVCNet, which is composed of multiple LVCNet blocks, and each LVCNet block containsmultiple location-variable convolution (LVC) layers with inscreasing factorial dilation coefficients to improve the receptivefield. (b) The architecture of Parallel WaveGAN with LVCNet. The generator is implemented using LVCNet.receptive field. A linear layer is applied on the input and out-put sides of the network to achieve channel conversion. Theresidual connection is deployed in each LVCNet block insteadof each LVC layer, which can achieve more stable and satis-factory results according to our experimental analysis. In ad-dition, the conditioning sequence is input into the kernel pre-dictor to predict coefficients of convolution kernels in eachLVC layer. The kernel predictor consists of multiple residuallinear layer with leaky ReLU activation function ( α = 0 . ),of which the output channel is determined by the amount ofconvolution kernel coefficients to be predicted. In order to verify the performance of location-variable convo-lutions, we choose Parallel WaveGAN as the baseline model,and use the LVCNet to implement the network of the gener-ator, as shown in Figure 2(b). The generator is also condi-tioned on the mel-sepctrum, and transforms the input noise tothe output waveform. For a fair comparison, the discriminatorof our model maintains the same structure as that of ParallelWaveGAN, and the same loss function and training strategyas Parallel WaveGAN are used to train our model.Note that, the design of the kernel predictor is based on thecorrespondence between the mel-spectrum and the waveform.A 1D convolution with 5 of kernel size and zero of paddingis firstly used to adjust the alignment between the condition-ing mel-spectrum sequence and the input waveform sequence,and subsequent multiple stacked × convolutional layers tooutput convolution kernels of the location-variable convolu-tions. In addition, we remove the residual connection of thefirst LVCNet block in the generator for better model perfor-mance.
3. EXPERIMENTS3.1. Experimental setup
We train our model on the LJSpeech dataset [15]. The en-tire dataset is randomly divided into 3 sets: 12,600 utterancesfor training, 400 utterances for validation, and 100 utterancesfor test. The sample rate of audio is set to 22,050 Hz. Themel-spectrograms are computed through a short-time Fouriertransform with Hann windowing, where 1024 for FFT size,1024 for window size and 256 for hop size. The STFT mag-nitude is transformed to the mel scale using 80 channel melfilter bank spanning 80 Hz to 7.6 kHz.
In our proposed model, the generator is implemented by theLVCNet, and the network structure of the discriminator isconsistent with that in original Parallel WaveGAN. The gener-ator is composed of three LVCNet blocks, where each blockscontains 10 LVC layers, and the residual channels is set to 8.The kernel size of the location-variable convolution is set tothree, and the dilation coefficients is the factorial of 2 in eachLVCNet block. The kernel predictor consists of one × con-volutional layer and three × residual convolutional layers,where the hidden residual channel is set to 64. The weightnormalization is applied in all convolutional layer.We choose Parallel WaveGAN [10] as the baseline model,and use the same strategy to train our model and baselinemodel for a fair comparison. In addition, in order to verifythe efficiency of the location-variable convolution in mod-eling waveform dependencies, we conduct a set of detailedomparison experiments. Our proposed model is trained andcompared with the Parallel WaveGAN under different resid-ual channels (4, 8, 16 for our model and 32, 64, 128 for Par-allel WaveGAN). In order to evaluate the performance of these vocoder, weuse the mel-spectrograms extracted from test utterances as in-put to obtain synthetic audios, which is rated together withthe ground truth audio (GT) by 20 testers with headphonesin a conventional mean opinion score (MOS) evaluation. Atthe same time, the audios generated by Griffin-Lim algorithm[17] are also rated together.The scoring results of our proposed model and ParallelWaveGAN with different residual convolutional channels areshown in Table 1, where the real-time factor implemented(RTF) on CPU is also illustrated. We find that our proposedmodel achieves almost the same results as Parallel Wave-GAN, but the inference speed is increased by four times. Thereasan for the speech increase is that small number fo residualchannels greatly reduces the amount of convolution opera-tions in our model. Meanwhile, our unoptimized model cansynthesizes multiple utterances at approximately 300 MHz onan NVIDIA V100 GPU, which is much faster than 35 MHzof Parallel WaveGAN.In addition, as the residual channels decreases, the rate ofperformance degradation of our model is significantly slowerthan that of Parallel WaveGAN. In our opinion, even if theresidual channels is very small (such as 4, 8), the convolutioncoefficients are adjusted according to mel-spectrums in ourmodel, which still guarantees effective feature modeling.
To verify the effectiveness of the proposed model as thevocoder in the TTS framework, we combine it with Trans-former TTS [18] and AlignTTS [19] for testing. In detail,according to the texts in the test dataset, Transformer TTSand AlignTTS predict mel-spectrums respectively, which isused as the input of our model (with 8 of residual channels)and Parallel WaveGAN to generate waveforms for MOSevaluation. The results are shown in Table 2. Comparedwith Parallel WaveGAN, our model significantly improvesthe speed of speech synthesis without degradation of soundquality in feed-forward TTS systems.In our opinion, due to the mutual independence of theacoustic features (such as mel-specturms), we can use dif-ference convolution kernels to implement convolution opera-tions on difference time intervals to obtain more effective fea-ture modeling capabilities. In this work, we just use the LVC-Net to design a new generator for waveform generation, and
Table 1 . The comparison between our proposed model (LVC-Net) and Parallel WaveGAN (PWG) with different residualchannels.
Method Size MOS RTF (CPU) GT . ± . − Griffin-Lim . ± . − PWG-32 0.44 M . ± .
23 2 . PWG-48 0.83 M . ± .
10 3 . PWG-64 1.35 M . ± .
08 3 . LVCNet-4 0.47 M . ± .
15 0 . LVCNet-6 0.84 M . ± .
12 0 . LVCNet-8 1.34 M . ± .
10 0 . Table 2 . The comparison between our proposed model (LVC-Net) and Parall WaveGAN (PWG) in TTS systems.
Method MOS Time (s) GT − − Transformer + PWG . ± .
14 2 . ± . AlignTTS + PWG . ± .
09 0 . ± . Transformer + LVCNet . ± .
15 2 . ± . AlignTTS + LVCNet . ± . . ± . obtain a faster vocoder without degradation of sound qual-ity. Considering our previous experiments [20], the effective-ness of the location-variable convolution has been sufficientlydemonstrated, and there is potential for optimization.
4. CONCLUSION
In this work, we propose the location-variable convolution fortime-dependent feature modeling. which uses different ker-nels to perform convlution operations on different intervals ofinput sequence. Based on it, we design LVCNet and imple-ment it as the generator of Parallel WaveGAN framework toachieve more efficient waveform generation model. Exper-iments on LJSpeech dataset show that our proposed modelis four times faster than the base Parallel WaveGAN modelin inferece speed without any degradation in sound quality,which verifies the effectiveness of the location-variable con-volution.
5. ACKNOWLEDGMENT
This paper is supported by National Key Research and Devel-opment Program of China under grant No. 2017YFB1401202,No. 2018YFB1003500, and No. 2018YFB0204400. Corre-sponding author is Jianzong Wang from Ping An Technology(Shenzhen) Co., Ltd. . REFERENCES [1] A¨aron van den Oord, Sander Dieleman, Heiga Zen,Karen Simonyan, Oriol Vinyals, Alex Graves, NalKalchbrenner, Andrew Senior, and Koray Kavukcuoglu,“Wavenet: A generative model for raw audio,” in , 2016.[2] Nal Kalchbrenner, Erich Elsen, Karen Simonyan, SebNoury, Norman Casagrande, Edward Lockhart, FlorianStimberg, Aaron van den Oord, Sander Dieleman, andKoray Kavukcuoglu, “Efficient neural audio synthe-sis,” in
International Conference on Machine Learning(ICML) , 2018.[3] Aaron van den Oord, Yazhe Li, and et. al., “ParallelWaveNet: Fast high-fidelity speech synthesis,” in
In-ternational Conference on Machine Learning (ICML) ,2018.[4] Wei Ping, Kainan Peng, and Jitong Chen, “Clarinet:Parallel wave generation in end-to-end text-to-speech,”in
International Conference on Learning Representa-tions (ICLR) , 2018.[5] Durk P Kingma, Tim Salimans, Rafal Jozefowicz,Xi Chen, Ilya Sutskever, and Max Welling, “Improvedvariational inference with inverse autoregressive flow,”in
Advances in Neural Information Processing Systems(NIPS) . 2016.[6] Ryan Prenger, Rafael Valle, and Bryan Catanzaro,“Waveglow: A flow-based generative network forspeech synthesis,” in
International Conference onAcoustics, Speech and Signal Processing (ICASSP) ,2019.[7] Wei Ping, Kainan Peng, Kexin Zhao, and Zhao Song,“WaveFlow: A compact flow-based model for raw au-dio,” in
International Conference on Machine Learning(ICML) , 2020.[8] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza,Bing Xu, David Warde-Farley, Sherjil Ozair, AaronCourville, and Yoshua Bengio, “Generative adversar-ial nets,” in
Advances in Neural Information ProcessingSystems (NIPS) , 2014.[9] Kundan Kumar, Rithesh Kumar, Thibault de Boissiere,Lucas Gestin, Wei Zhen Teoh, Jose Sotelo, Alexandrede Br´ebisson, Yoshua Bengio, and Aaron C Courville,“Melgan: Generative adversarial networks for condi-tional waveform synthesis,” in
Advances in Neural In-formation Processing Systems (NIPS) . 2019.[10] R. Yamamoto, E. Song, and J. Kim, “Parallel wavegan:A fast waveform generation model based on genera- tive adversarial networks with multi-resolution spectro-gram,” in
International conference on acoustics, speechand signal processing (ICASSP) , 2020.[11] Geng Yang, Shan Yang, Kai Liu, Peng Fang, Wei Chen,and Lei Xie, “Multi-band MelGAN: Faster waveformgeneration for high-quality text-to-speech,” in
IEEESpoken Language Technology Workshop (SLT) , 2021.[12] Nanxin Chen, Yu Zhang, Heiga Zen, Ron J Weiss, Mo-hammad Norouzi, and William Chan, “WaveGrad: Esti-mating gradients for waveform generation,” in
Interna-tional Conference on Learning Representations (ICLR) ,2021.[13] Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao,and Bryan Catanzaro, “DiffWave: A versatile dif-fusion model for audio synthesis,” arXiv preprintarXiv:2009.09761 , 2020.[14] Jonathan Ho, Ajay Jain, and Pieter Abbeel, “De-noising diffusion probabilistic models,” arXiv preprintarXiv:2006.11239 , 2020.[15] Keith Ito, “The lj speech dataset,” https://keithito.com/LJ-Speech-Dataset/, 2017.[16] Bishnu S Atal and Suzanne L Hanauer, “Speech analysisand synthesis by linear prediction of the speech wave,”
The journal of the acoustical society of America , vol.50, no. 2B, pp. 637–655, 1971.[17] Daniel Griffin and Jae Lim, “Signal estimation frommodified short-time fourier transform,”
IEEE Transac-tions on Acoustics, Speech, and Signal Processing , vol.32, no. 2, pp. 236–243, 1984.[18] Naihan Li, Shujie Liu, Yanqing Liu, Sheng Zhao, MingLiu, and M Zhou, “Neural speech synthesis with trans-former network,” in
AAAI Conference on Artificial In-telligence (AAAI) , 2019.[19] Zhen Zeng, Jianzong Wang, Ning Cheng, Tian Xia,and Jing Xiao, “Aligntts: Efficient feed-foward text-to-speech system without explicit alignment,” in
Interna-tional conference on acoustics, speech and signal pro-cessing (ICASSP) , 2020.[20] Zhen Zeng, Jianzong Wang, Ning Cheng, Tian Xia,and Jing Xiao, “MelGlow: Efficient waveform gener-ative network based on location-variable convolution,”in