HooliGAN: Robust, High Quality Neural Vocoding
HHooliGAN: Robust, High Quality Neural Vocoding
Ollie McCarthy * , Zohaib Ahmed Resemble AI, Toronto, Canada [email protected], [email protected]
Abstract
Recent developments in generative models have shown thatdeep learning combined with traditional digital signal pro-cessing (DSP) techniques could successfully generate convinc-ing violin samples [1], that source-excitation combined withWaveNet yields high-quality vocoders [2, 3] and that genera-tive adversarial network (GAN) training can improve natural-ness [4, 5].By combining the ideas in these models we introduceHooliGAN, a robust vocoder that has state of the art results,fine-tunes very well to smaller datasets ( <
30 minutes of speechdata) and generates audio at 2.2MHz on GPU and 35kHz onCPU. We also show a simple modification to Tacotron-basedmodels that allows seamless integration with HooliGAN.Results from our listening tests show the proposed model’sability to consistently output high-quality audio with a varietyof datasets, big and small. We provide samples at the follow-ing demo page: https://resemble-ai.github.io/hooligan_demo/
Keywords : neural vocoder, text to speech, DDSP, GAN, NSF
1. Introduction
Since the introduction of WaveNet [6], deep neural networkbased vocoders have shown to be vastly superior in natural-ness compared to traditional parametric vocoders. Unfortu-nately, the original WaveNet suffers from slow generation per-formance due to its high complexity and auto-regressive gen-eration. Other works address this issue by reducing param-eters and complexity (WaveRNN [7], LPCNet [8], FFTNET[9]) and/or replacing the auto-regressive generation with paral-lel generation (Parallel WaveNet [10], WaveGlow [11], Clarinet[12],MelGAN [4], Parallel WaveGAN [5]).Most of these vocoders consume log mel spectrograms thatare predicted by a text to acoustic model such as Tacotron[13, 14]. However, if there is sufficient noise in these pre-dicted features entropy increases in the vocoder creating arti-facts in the output signal. Phase is also an issue. If a vocoder istrained directly with discrete samples (e.g., with a cross-entropyloss between predicted and ground truth samples) it can resultin a characteristic “smeared” sound quality. This is becausea periodic signal composed of different harmonics can havean infinite amount of variation in its discrete waveform whilesounding identical. In this training scenario, the vocoder willbe forced to “solve for phase”.Differentiable Digital Signal Processing (DDSP) [1], whilenot strictly a vocoder, showed that it is possible to leverage tra-ditional DSP components like oscillators, filters, amplitude en-velopes and convolutional reverb to generate convincing violinsounds and that it can be done so without needing to explic-itly “solve for phase”. Meanwhile, the Neural Source Filter * Corresponding AuthorPREPRINT - UNDER REVIEW (NSF) [2, 3] family of vocoders show that an F0 (fundamentalfrequency) driven source-excitation combined with neural filterblocks (i.e., simplified WaveNet blocks) generates outputs withnaturalness competitive with vocoders that only take log melspectrograms as input.In this work we take the ideas behind DDSP and NSF andcombine them into an efficient and robust model, whereby thesource excitation is inspired by DDSP and the filtering is in-spired by NSF’s neural filter blocks. To improve naturalnesswe also utilise the discriminator module and adversarial train-ing outlined in MelGAN. What we arrive at is a model withimpressive sound-quality, inference speed and robustness.In the next section we review some key ideas from theDDSP, NSF and MelGAN models that we use in this work.In section 3 we outline the HooliGAN model, in section 4 weevaluate the proposed model’s naturalness, robustness and per-formance. We then discuss our findings in section 5.
2. Background
DDSP [1] comprises of three main parts, an encoder that takesin log mel spectrograms, a decoder that predicts sequencesof parameters that drive an additive oscillator, noise filter co-efficients (via spectra predictions), loudness distributions forthe oscillator harmonics and amplitude envelopes for both thewaveform and noise. Finally the signal is convolved with alearned impulse response which in effect applies reverberationto the output signal.While this model excels at generating realistic violin sam-ples, when it comes to modelling speech, we found that it cannotmodel highly detailed transients on the sample level since theprimary waveshaping components, i.e., the filter and envelopes,are operating on the frame level. Also, we found that the sinu-soidal oscillator cannot generate convincing speech waveformson its own.Our interest in DDSP primarily concerns the additive os-cillator and the model’s ability to learn time-varying amplitudeenvelopes.
NSF [2, 3] comprises of two excitation sources, namely a fun-damental pitch (F0) driven sinusoidal signal and a constantGaussian noise signal, each of which are then gated by an un-voiced/voiced (UV) switch, filtered by a simplified WaveNet(which they refer to as a neural filter) and passed through a Fi-nite Impulse Response (FIR) filter.While the sound quality of these vocoders is competitive,we surmise that some of the heavy lifting of the 6 WaveNetstacks could be offloaded to computationally lighter DSP com-ponents such as the additive oscillator and loudness envelopesas described in Section 3.2. a r X i v : . [ ee ss . A S ] A ug igure 1: Schematic diagram of HooliGAN.
MelGAN [4] and Parallel WaveGAN [5] are the first competi-tive GAN-based vocoder models. While Parallel WaveGAN hassuperior naturalness in its output, MelGAN’s discriminator haswidely-strided, large kernel size convolutions that are particu-larly well suited to a phase-agnostic training objective and weincorporate that into our proposed model.
3. HooliGAN
In the proposed model, as shown in Figure 1, we take as inputlog mel spectrograms X ˆ ij , an F0 pitch sequence f ˆ i and a UVvoicing sequence v ˆ i which we extract from f ˆ i with: v ˆ i = (cid:40) if f ˆ i > otherwise (1)Note, for all equations ˆ i is the time axis on the frame leveland i is the time axis on the sample level while j will be usedinterchangeably for the channel/harmonic axes. X ˆ ij , f ˆ i and v ˆ i are concatenated together into C ˆ ij and putthrough two 1d convolutional layers with 256 channels, kernelsize of 5 and LeakyRELU non-linearity [15]. Then a fully con-nected layer outputs a sequence of vectors that is split in twoalong the channel axis [ H osc , H noise ] for controlling both theoscillator and noise generator. The additive oscillator module takes in H osc from the encoder,transforms it with a fully-connected layer and applies the mod-ified sigmoid non-linearity described in [1]. We then use linearinterpolation to upsample from the frame level to the samplelevel. Here the parameters are split up into a sequence of distri- butions A ij for controlling the loudness of each harmonic andan overall amplitude envelope α i for all harmonics.Before generating the harmonics we create frequency val-ues F ij for k harmonics by taking the input F0 sequence f ˆ i ,upsampling it to the sample-level with linear interpolation toget f i and multiplying that with the integer sequence: F ij = jf i ∀ j ∈ [1 , , , ..., k ] . (2)Please note that before we upsample f ˆ i , we interpolateacross the unvoiced segments in order to avoid glissando arti-facts resulting from the quick jumps from 0Hz to the voicedfrequency.We then create a mask M ij for all frequency values in F ij above 3.3kHz. We found that if we did not mask out higherfrequencies, the WaveNet would become an identity functionearly in training and sound quality would not improve with fur-ther training. If s is the sampling rate: M ij = (cid:40) if F ij > (3300 /s )1 otherwise (3)We get the oscillator phase for each time-step and harmonicwith the cumulative summation operator along the time axis: θ ij = 2 π i (cid:88) n =1 F nj . (4)We then generate all harmonics with: P ij = α i M ij A ij sin( θ ij + φ j ) , (5)where φ j is randomised with values in [- π , π ]. Finally we zeroout the unvoiced part of the oscillator output P ij by upsampling v ˆ i to v i with nearest neighbours upsampling and broadcast-multiplying: O ij = v i P ij . (6)able 1: MOS test result with a 95% confidence interval forinverting ground truth acoustic features from LJSpeech.
Model MOSMelGAN 3.02 0.08WaveGlow 3.77 0.07WaveRNN 3.77 0.07
HooliGAN 4.07 0.06
Ground Truth 4.29 0.06
The noise generator takes in H noise from the encoder and trans-forms it with a fully connected layer with modified sigmoid [1]to sequence β ˆ i . This is upsampled with linear interpolation tothe sample level to get β i , the amplitude envelope for the noise.We then get the output with: z i = aβ i n i , (7)where a is a learned parameter initialised with (2 π ) − and n i ∼ N (0 , . We then convolve z i with a 257 length impulseresponse h noise with learnable parameters: z i = z i ∗ h noise . (8) We concatenate the stacked harmonics from the oscillator O ij with shaped noise z i to get I ij , the direct input to the WaveNetmodule. C ˆ ij is upsampled with linear interpolation to C ij andused as side-conditioning in the residual blocks.We remove the gating function in favour of a simple tanhactivation, similar to [2, 3] and also remove the residual skipcollection. However unlike NSF, where the WaveNet channelsare reduced to 1 dimension at the output of each stack, we leavethe amount of channels constant throughout.For the WaveNet hyperparameters we use 3 stacks, eachwith 10 layers. The convolutional layers have 64 channels andthe kernel-size is 5 with a dilation exponent of 2. We also notethat this is approximately 2 times more computationally effi-cient than the NSF WaveNet which have a total 6 stacks.Finally we convolve the output of the WaveNet w i witha 257 length learned impulse response h output to get the pre-dicted output: ˆ y i = w i ∗ h output . (9)While this component was originally designed to model re-verberation via a long 1-2 second impulse response, early ex-periments had unwanted echo artifacts. However, with furthertweaking we found that a much shorter impulse would helpshape the frequency response in the output so we kept this com-ponent during development of the model. We use a multi-STFT (Short Time Fourier Transform) loss sim-ilar to DDSP [1] for both the output of the WaveNet ˆ y i and theWaveNet input I ij where we sum along the stacked axis andadd the noise to get o i a 1d signal: o i = (cid:88) j I ij . (10) Table 2: MOS test result with a 95% confidence interval forinverting acoustic features predicted by Tacotron2 trained onsmall datasets.
Dataset Vocoder Model MOSSpeakerA WaveRNN 4.10 0.07SpeakerA
HooliGAN 4.49 0.05
SpeakerB WaveRNN 3.18 0.09SpeakerB
HooliGAN 4.05 0.06
Then we get the full multi-STFT loss by: L stft = L mag ( y i , o i ) + L mag ( y i , ˆ y i ) , (11) L mag ( y, ˆ y ) = 1 n (cid:88) n (cid:18) (cid:107) S n ( y ) − S n (ˆ y ) (cid:107) + (cid:107) log S n ( y ) − log S n (ˆ y )) (cid:107) (cid:19) , (12)where y i is the ground truth audio and S n computesthe magnitude of the STFT with FFT sizes n ∈ [2048 , , , , , and using 75% overlap. We use similar adversarial training as described in [4, 5] andadapt unofficial open-source code where we use multiple Mel-GAN discriminators D k , ∀ k ∈ [1 , , of the exact same archi-tecture. To get the generator’s adversarial loss L adv we use: L adv = 1 k (cid:88) k (cid:107) − D k (ˆ y i ) (cid:107) . (13)The generator’s feature matching loss L fm , where l denoteseach convolutional layer of the discriminator model: L fm = 1 kl (cid:88) k (cid:88) l (cid:107) D lk ( y i ) − D lk (ˆ y i ) (cid:107) . (14)Our final generator loss L G , with τ = 4 and λ = 25 to preventthe multi-STFT loss overpowering the adversarial and feature-matching losses: L G = L stft + τ ( L adv + λ L fm ) . (15)The discriminator loss L D is calculated with: L D = 1 k (cid:88) k (cid:18) (cid:107) − D k ( y i ) (cid:107) + (cid:107) D k (ˆ y i ) (cid:107) (cid:19) . (16) First we pretrain the generator with only the multi-STFT loss L stft for 100k steps similar to [5] after which we switch to thefull adversarial loss ( L G and L D ). We use the RAdam opti-miser [16] with a fixed learning rate throughout of − for thegenerator and × − for the discriminator with (cid:15) = 10 − and no weight-decay for both optimisers.We train with a batch size of 16 with y i having a duration11,000 samples. The mel-spectrograms have 80 frequency bins, https://github.com/kan-bayashi/ParallelWaveGAN able 3: MOS test result with a 95% confidence interval forinverting Tacotron2 predicted features that was trained theLJSpeech Dataset.
Model MOSWaveRNN 3.65 0.08
HooliGAN 4.18 0.07
Ground Truth 4.28 0.06a hop-size of 12.5ms, a window of 50ms and FFT-size of 2048.We alter the F0 frames from Parselmouth/Praat such that theyare centered like Librosa’s spectrograms and have an equal hop-size. We use a sampling rate of 22.05kHz. A single Nvidia2080TI-RTX card is utilised for all experiments and we trainfor 500k - 1.5M steps.
Our text to acoustic model is Tacotron2 [14] which we modifyas follows:1. During training, we predict an F0 value along with eachmel-spectrogram frame and re-scale the F0 to be in range[0, 1] by dividing by the maximum frequency parameterin the F0 estimation algorithm. During inference, we re-scale back to hertz and divide by the sample-rate to getthe correct frequency value for the oscillator.2. The predicted F0 bypasses the prenet and instead is con-catenated to the prenet output and then reshaped with afully connected layer before entering the decoder RNN.3. the training objective is altered to become: L tts = (cid:107) S ˆ ij − ˆ S ˆ ij (cid:107) + κ (cid:107) f ˆ i − ˆ f ˆ i (cid:107) . (17)where S ˆ ij is the mel spectrogram and κ = 2 .
4. Experiments
We use four English language datasets in total for testing sound-quality:1.
LJSpeech [17] : a 24 hour single-speaker female datasetfeaturing Linda Johnson from LibriVox with 13,100 tran-scribed utterances.2.
VCTK [18] : a multi-speaker dataset with 109 uniquespeakers. Each speaker reads approximately 400 sen-tences. We downsample from 48kHz to 22.05kHz forour experiments.3.
SpeakerA : a 30 minute proprietary single-speaker fe-male dataset with 582 utterances.4.
SpeakerB : a 30 minute proprietary single-speaker maledataset with 500 utterances.
For each experiment we recruit 30 workers from Amazon Me-chanical Turk (AMT) for a Mean Opinion Score (MOS) study.We require that the workers have AMT’s high-performance“Master” status and live in English-speaking countries. Weevaluate each model in each experiment with 20 samples. Inorder to avoid the “louder sounds better” perceptual bias we Table 4:
Inference speeds for all models in our experiments.Default Pytorch settings for Intel(R) Core(TM) i9-7920X CPU@ 2.90GHz and an Nvidia 2080Ti-RTX GPU
Model Parameters CPU GPUWaveGlow 87.9M 6kHz 155kHzWaveRNN 4.5M 20kHz 43kHz
HooliGAN . To test the proposed model’s sound-quality we design four ex-periments.
We compare the ability to invert ground-truth acoustic featuresof the proposed model against WaveRNN, MelGAN and Wave-Glow. For MelGAN and WaveGlow we use the pretrained mod-els released by Descript and Nvidia on Pytorch Hub . ForWaveRNN we use the pretrained Mixture of Logistics (MOL)model available in our github repository . All pretrained mod-els are trained on the LJSpeech dataset [17]. Since we cannotfully control the train/validation/test split of all these pretrainedmodels we need to ensure that the evaluation data we use wasnot seen by the models during training. To this end, we gathersome recordings of Linda Johnson that were published after therelease of the original LJSpeech dataset . We note that thesenewer recordings have an almost identical recording quality tothose in the original dataset.In Table 1 we see that HooliGAN achieves a leading MOSscore of 4.07. While the output is clear and high-quality, wedo notice that the transients can sometimes be too short, witha click-like quality. We also note that both WaveRNN andWaveGlow have a “smeared” characteristic in their outputs,most likely from those models being trained directly on discretewaveform. While we surmise that MelGAN performed poorlymainly because its generator network is under-powered and itsreceptive field too small. In the second experiment, we test the ability to invert acous-tic features predicted by Tacotron2 trained on LJSpeech. Wepick sentences from the same evaluation data in Section 4.3.1.Table 3 summarises the results with HooliGAN outperformingWaveRNN by a wide margin. We also note that the HooliGANMOS is quite close to that of ground truth in this experiment.
In the third experiment, we compare the combination ofTacotron2 with the proposed model and WaveRNN when fine-tuning on datasets with only 30 minutes of data. We pretrain https://tech.ebu.ch/docs/r/r128-2014.pdf https://pytorch.org/hub/ https://github.com/fatchord/WaveRNN https://librivox.org/the-great-events-by-famous-historians-volume-3-by-charles-f-horne/ acotron2 and the vocoder models with VCTK and finetune af-terwards, picking the best performing checkpoints for all mod-els. In Table 2 we see again that HooliGAN outperforms Wa-veRNN. While there is an improvement for both speakers, wenote that the improvement is larger for the male speaker. Weconclude that having an explicit pitch signal feeding the vocoderis more important for male voices than female as the harmon-ics tend to be too compressed in mel spectrograms from malevoices. Finally, we test the inference speed of all models in thispaper for both CPU and GPU with standard, non-optimisedPytorch[19] code. In Table 4 we see that MelGAN clearlyout-performs all models. However HooliGAN still performs re-spectably, and while it has less parameters than MelGAN, thedeeply-stacked nature of WaveNet limits overall speed. Wa-veRNN is slowed down by its auto-regressive generation andWaveGlow’s theoretically fast parallel generation is limited byhigh complexity from the large parameter count.
5. Conclusions and Future Work
As we can see from the outlined experiments, HooliGAN out-performs all tested models by a large margin in a variety of test-ing scenarios. We conclude that the source excitation methodcombined with traditional DSP techniques not only reduces thecomplexity of the model, but also improves the overall soundquality. In future work we will explore ways to better modeltransients, investigate other methods of source excitation to fur-ther reduce complexity, explicitly model background noise withDSP components and raise the sampling rate to CD-quality44.1kHz.
6. Acknowledgements
We would like to thank Jeremey Hsu, Corentin Jemine, JohnMeade, Zihan Jin, Zak Semenov, Aditya Tirumala Bukkapat-nam, Haris Khan, Tedi Papajorgji and Saqib Muhammad fromResemble AI for their feedback and support.
7. References [1] J. Engel, L. Hantrakul, C. Gu, and A. Roberts, “Ddsp: Differen-tiable digital signal processing,” 2020.[2] X. Wang, S. Takaki, and J. Yamagishi, “Neural source-filter wave-form models for statistical parametric speech synthesis,” 2019.[3] X. Wang and J. Yamagishi, “Neural harmonic-plus-noise wave-form model with trainable maximum voice frequency for text-to-speech synthesis,” 2019.[4] K. Kumar, R. Kumar, T. de Boissiere, L. Gestin, W. Z. Teoh,J. Sotelo, A. de Brebisson, Y. Bengio, and A. Courville, “Mel-gan: Generative adversarial networks for conditional waveformsynthesis,” 2019.[5] R. Yamamoto, E. Song, and J.-M. Kim, “Parallel wavegan: A fastwaveform generation model based on generative adversarial net-works with multi-resolution spectrogram,” 2019.[6] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals,A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu,“Wavenet: A generative model for raw audio,” 2016.[7] N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury,N. Casagrande, E. Lockhart, F. Stimberg, A. van den Oord,S. Dieleman, and K. Kavukcuoglu, “Efficient neural audiosynthesis,” 2018. [8] J.-M. Valin and J. Skoglund, “Lpcnet: Improving neural speechsynthesis through linear prediction,” 2018.[9] Z. Jin, A. Finkelstein, G. J. Mysore, and J. Lu, “FFTNet: a real-time speaker-dependent neural vocoder,” in
The 43rd IEEE Inter-national Conference on Acoustics, Speech and Signal Processing(ICASSP) , Apr. 2018.[10] A. van den Oord, Y. Li, I. Babuschkin, K. Simonyan, O. Vinyals,K. Kavukcuoglu, G. van den Driessche, E. Lockhart, L. C. Cobo,F. Stimberg, N. Casagrande, D. Grewe, S. Noury, S. Dieleman,E. Elsen, N. Kalchbrenner, H. Zen, A. Graves, H. King, T. Wal-ters, D. Belov, and D. Hassabis, “Parallel wavenet: Fast high-fidelity speech synthesis,” 2017.[11] R. Prenger, R. Valle, and B. Catanzaro, “Waveglow: A flow-basedgenerative network for speech synthesis,” 2018.[12] W. Ping, K. Peng, and J. Chen, “Clarinet: Parallel wave generationin end-to-end text-to-speech,” 2018.[13] Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss,N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, Q. Le,Y. Agiomyrgiannakis, R. Clark, and R. A. Saurous, “Tacotron:Towards end-to-end speech synthesis,” 2017.[14] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang,Z. Chen, Y. Zhang, Y. Wang, R. Skerry-Ryan, R. A. Saurous,Y. Agiomyrgiannakis, and Y. Wu, “Natural tts synthesis by condi-tioning wavenet on mel spectrogram predictions,” 2017.[15] A. L. Maas, “Rectifier nonlinearities improve neural networkacoustic models,” 2013.[16] L. Liu, H. Jiang, P. He, W. Chen, X. Liu, J. Gao, and J. Han, “Onthe variance of the adaptive learning rate and beyond,” 2019.[17] K. Ito, “The lj speech dataset,” https://keithito.com/LJ-Speech-Dataset/, 2017.[18] K. M. Christophe Veaux, Junichi Yamagishi,
English Multi-speaker Corpus for CSTR Voice Cloning Toolkit . [On-line]. Available: http://homepages.inf.ed.ac.uk/jyamagis/page3/page58/page58.html[19] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison,A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani,S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala,“Pytorch: An imperative style, high-performance deep learninglibrary,” in