Universal Neural Vocoding with Parallel WaveNet
Yunlong Jiao, Adam Gabrys, Georgi Tinchev, Bartosz Putrycz, Daniel Korzekwa, Viacheslav Klimkov
UUNIVERSAL NEURAL VOCODING WITH PARALLEL WAVENET
Yunlong Jiao (cid:63) , Adam Gabry´s (cid:63) , Georgi Tinchev † , Bartosz Putrycz (cid:63) , Daniel Korzekwa (cid:63) , Viacheslav Klimkov (cid:63)(cid:63) Amazon.com † University of Oxford
ABSTRACT
We present a universal neural vocoder based on ParallelWaveNet, with an additional conditioning network calledAudio Encoder. Our universal vocoder offers real-time high-quality speech synthesis on a wide range of use cases. Wetested it on 43 internal speakers of diverse age and gen-der, speaking 20 languages in 17 unique styles, of which 7voices and 5 styles were not exposed during training. Weshow that the proposed universal vocoder significantly out-performs speaker-dependent vocoders overall. We also showthat the proposed vocoder outperforms several existing neuralvocoder architectures in terms of naturalness and universality.These findings are consistent when we further test on morethan 300 open-source voices.
Index Terms — Neural vocoder, Text-to-speech, Scalabil-ity
1. INTRODUCTION
As voice-based human-machine interaction becomes an in-creasingly crucial part in artificial intelligence, text-to-speech(TTS) remains an important yet challenging problem. Whilesome TTS systems synthesise speech from normalised textor phonemes in an end-to-end manner [1, 2], most TTS sys-tems address the problem in a two-step approach. The firststep transforms text to a lower-resolution intermediate rep-resentations, such as time-aligned acoustic features [3], orspectral features such as mel-spectrogram [4, 5]. The sec-ond step transforms the intermediate representations to high-fidelity audio signal using a model, referred to as vocoder .State-of-the-art vocoders are neural network-based gener-ative models [6, 7, 8, 9, 10, 11]. Neural vocoders are capableof synthesising natural-sounding speech, but typically proneto overfitting to the training data, and do not generalise well tounseen voices [12]. Training speaker-dependent vocoders re-quire significant computational resources and large amountsof audio data for each target speaker [13]. The need for ahigh-quality speaker-independent vocoder, or so-called uni-versal vocoder, is key to scaling up production of TTS sys-tems that are specifically designed to support many voices.A few recent studies investigated the possibility of build-ing universal vocoders. There were some early reports
Corresponding author email: [email protected]. Work done whileGeorgi Tinchev was an intern at Amazon. We would like to thank AlexisMoinet and Vatsal Aggarwal for insightful research discussions. that speaker-independent vocoders underperform speaker-dependent vocoders [14, 15]. Deep Voice 2 [16] modelledspeaker identities by using trainable speaker embeddings intheir vocoder as part of a multi-speaker TTS system. Thesesystems require to model speaker identities explicitly, hencecannot handle unseen speakers out-of-the-box. There arealso reports of neural vocoders that are capable of synthe-sising unseen speakers or styles without having to explicitlymodel speaker identities [17, 10, 9, 11, 18]. However, noneof these vocoders were thoroughly evaluated to claim uni-versality. In particular, it was not clear how well a speaker-independent vocoder performs on a target voice comparedto a dedicated vocoder built specifically for that voice. Theclosest setting to ours is the work of Lorenzo-Trueba et al.[19], where a WaveRNN-based universal vocoder is capableof synthesising a wide range of speakers, styles, and condi-tions. Unfortunately, Universal WaveRNN is autoregressive,and thus inherently slow in sample generation, posing sig-nificant difficulties for most real-time applications. To thebest of our knowledge, it remains unclear whether any non-autoregressive neural vocoders can be universal.The contributions of this work are: 1) We present a uni-versal neural vocoder based on Parallel WaveNet [8]. Thekey component of our universal vocoder is an additional con-ditioning network, called
Audio Encoder , which auto-encodesreference waveforms into utterance-level global conditioning.2) Based on a large-scale evaluation, we show that the pro-posed universal vocoder significantly outperforms speaker-dependent vocoders overall. It is capable of synthesising awide range of in-domain and out-of-domain voices, speakingstyles, and languages. 3) We perform extensive benchmarkstudies on internal and open-source voices comparing severalexisting neural vocoder architectures in terms of naturalnessand universality. Results show that our universal vocoder hasa clear advantage against other candidates.
2. SYSTEM DESCRIPTION2.1. Parallel WaveNet
Parallel WaveNet (PW) [8] is a non-autoregressive neuralvocoder architecture that transforms a sequence of inputnoise into audio waveforms in parallel. It can synthesisesamples very efficiently by fully exploiting the computationalpower of modern deep learning hardware.In our early experiments, we found that PW trained on a a r X i v : . [ ee ss . A S ] F e b udioEncoder(Training) Parallel WaveNetBiLSTM BroadcastConcat+ Upsample
TargetAudioMel-specReference AudioAvgPoolAvgPool AudioEncodingLayerAudioEncodingLayerAudioEncodingLayer
ReflectionPad1D
WeightNormConv1D+ LeakyReLUWNConv1DWithStride+ LeakyReLUWeightNormConv1D+ LeakyReLU GlobalMaxPool
Concat+ Dense × eμσz ~ N (0, 1) + GlobalMaxPoolGlobalMaxPool
Fig. 1:
Universal Parallel WaveNet with Audio Encoder.multi-speaker dataset underperforms speaker-dependent PW.We conjecture that it can be empirically difficult to obtaina non-autoregressive vocoder that is able to faithfully recon-struct the phase structure of speech signals from speakers withdiverse age, gender, speaking styles, and languages.
Our universal vocoder is based on PW, that generates speechby conditioning on mel-spectrogram. In order to make PWuniversal, we propose an additional conditioning networkcalled
Audio Encoder , designed to explicitly model aspects ofspeech signals that are not provided by the mel-spectrogramconditioning. The Audio Encoder encodes a reference wave-form into a fixed-dimensional feature vector, which is thenfed as utterance-level global conditioning into PW. In the restof the paper, we refer to PW with additional Audio Encoderconditioning as Universal Parallel WaveNet (UPW).The block diagram of the conditioning networks of theproposed UPW is shown in Figure 1. The vocoder modelhas two conditioning networks — an Audio Encoder and amel-spectrogram conditioner. First, the Audio Encoder is amulti-scale architecture of an audio feature extractor, heav-ily inspired by the design of MelGAN’s discriminator [10].It consists of 3 identical audio encoding layers that operateon different time-scales of the reference waveform, achievedby average pooling in between layers. Each audio encodinglayer uses a sequence of strided convolutional layers with alarge kernel size [10, Appendix A], where each convolutionallayer is weight normalised and activated by Leaky ReLU,and each outputs 16 channels in the last layer followed byglobal max pooling and a dense layer. To prevent informa-tion leakage to the vocoder, we applied amortised variationalencoding [20] to the output of Audio Encoder. In the end,we obtain a total of 48-dimensional audio feature vector asan utterance-level global conditioning. Second, we adopt themel-spectrogram conditioner proposed by [3], consisting of2 bidirectional LSTMs with a hidden size of 128 channels.The mel-spectrogram was extracted from ground-truth audiowith 80 coefficients and frequencies ranging from 50 Hz to 12kHz. Finally, outputs from the two conditioning networks arebroadcast–concatenated, and upsampled by repetition fromframe-level (80 Hz) to sample-level (24 kHz). During training, the target waveform is naturally used asthe reference waveform. This way, the entire architecture canbe viewed as a conditioned Variational Auto-Encoder (VAE)on audio waveform, where the Audio Encoder network is anencoder and PW network is a decoder conditioned on mel-spectrogram. Once the model is trained, doing inference withthe proposed UPW requires a reference waveform input to theencoder, or use a pre-generated audio features in place of theoutput of Audio Encoder conditioner. We investigated sev-eral inference strategies (e.g. using speaker-specific or style-specific centroid embedding in place of e ), and found thatusing e = generates high-quality audio (Figure 1). In fact, e = corresponds to the speaker-agnostic centroid embed-ding of the VAE prior distribution, and using it for inferenceis shown to improve generalisation of UPW to unseen voices.This simple yet effective approach results in a UPW with lit-tle computational overhead compared to a basic PW, thereforeboth sharing the same real time factor in production.Following the “teacher-student” training paradigm [8],we first train a Universal WaveNet teacher, and then train aUPW student from it. For the teacher network, we use 24 lay-ers with 4 dilation doubling cycles, 128 residual/gating/skipchannels, kernel size 3, and output distribution of a 10-component mixture of Logistics. For the student network, weuse [10, 10, 10, 30] flow layers with dilation reset every 10layers, 64 residual channels, and no skip connections. Bothmodels were trained on sliced mel-spectrogram conditioningcorresponding to short audio clips, with the Adam optimizer[21] and a constant learning rate − until convergence. Theteacher uses batch size 64 and 0.3625s of audio clips. Thestudent uses batch size 16 and 0.85s of audio clips. Dur-ing distillation, student reuses the pre-trained conditioningnetworks of the teacher (including the Audio Encoder). Wefound that training conditioning networks from scratch oftenleads to a worse student, a phenomenon also observed in [1].
3. EXPERIMENTAL PROTOCOL
Our training and evaluation protocol was inspired by [19],and adapted with a particular focus on universal vocoding ofspeech. We collected a multi-speaker multi-lingual trainingset for the proposed universal vocoder. It consists of 78 dif-ferent internal, high-quality voices (20 males and 58 females)with approximately 3,000 utterances per speaker, and a totalof 28 languages (including dialects) in 16 unique speakingstyles (e.g. neutral, long-form reading, and several emotionalstyles in different degrees of intensity). This training set wasdesigned with the expectation that vocoders should be capableof synthesising a variety of voices, styles, and languages.We perform analysis–synthesis on natural recordings, anddesign two types of evaluations on re-synthesised samples. • In Section 4.1, we compare the proposed UPW withspeaker-dependent PW (SDPW) on internal voices for We also performed experiments on TTS samples using spectrogramsgenerated by a Tacotron2-based architecture. Conclusions drawn from re-synthesised samples and TTS samples remain consistent. ection Test sets Recordingquality
Table 1:
Summary of test sets. Note that “seen/unseen” refers to whether they were exposed during training.which we have trained a high-quality SDPW. We showthat the proposed vocoder is universal, in the sense thatit does not show degradation when compared to speaker-dependent vocoders specific for each voice.• In Section 4.2, we benchmark UPW vs several other pop-ular neural vocoder architectures in terms of universality.The competing vocoders include Universal WaveRNN(UWRNN) [19], Parallel WaveGAN (PWGAN) [11], andWaveGlow (WGlow) [9]. All of these systems were re-trained on the same training set as our UPW, using anopen-source implementation or reimplementing the defaultsetup of each paper. We evaluate these vocoders on inter-nal high-quality voices as well as external voices that arerecorded in vastly different conditions.Table 1 summarises the statistics of the test sets in our exper-iments. Note that test voices are selected such that they arebalanced according to gender and age.The naturalness perceptual evaluation was designed asa MUltiple Stimuli with Hidden Reference and Anchor(MUSHRA) [24], where participants were presented withthe systems being evaluated side-by-side, asked to rate themin terms of naturalness and audio quality (glitches, clicks,noise, etc.) from 0 (poorest) to 100 (best). Each test utteranceis evaluated by 10 to 15 listeners that are either native oreducated speakers of the target language. We also includerecordings in all MUSHRA tests as the hidden upper-anchorsystem, and we do not force at least one 100 rated system.Paired two-sided Student T-tests with Holm-Bonferronicorrection were used to validate the statistical significance ofthe differences between two systems at a p -value threshold of0.05. We refer to the ratio between the mean MUSHRA scoreof a system and natural recordings as relative MUSHRA (de-noted by Rel.) Relative MUSHRA illustrates the gap betweenthe system being evaluated with the reference.
4. RESULTS4.1. Comparison with speaker-dependent vocoders
In this section, we compare the proposed universal vocoderUPW with speaker-dependent vocoder (SDPW). MUSHRAevaluations are carried out on 24 high-quality internal voicesin distinct speaking styles and languages, using Amazon’s in-ternal evaluation platform where listeners are professionallytrained for voice quality assessment.Results show that UPW significantly outperforms SDPWoverall with a relative MUSHRA of 84.24% vs 83.12% ( p -value = 0 ). As we break down to each voice, UPW showsstatistically significant improvement to SDPW on 7 voices, Fig. 2:
MUSHRA evaluation for comparison with speaker-dependent vocoders.
Voice (Sex) Age Rec. SDPW UPW UPW Rel. p -val.British Eng. (F) Adult 71.64 65.69 Australian Eng. (M) Adult 73.52 *US Eng. (M) Senior 70.40 57.65
US Spanish (F) Adult 73.71 48.07
Table 2:
MUSHRA scores on internal voices (unseen markedby *). p -value signifies the difference between UPW vsSDPW. Note that the voices are selected so that the rela-tive MUSHRA of UPW is evenly distributed within the rangefrom highest to lowest. Style Rec. SDPW UPW UPW Rel. p -val.Emotional 71.59 60.74 Long-form reading 68.60
Singing 71.94 49.96
Table 3:
MUSHRA scores on typical styles. p -value signifiesthe difference between SDPW vs UPW.and is comparable to SDPW on 17 voices without statisticallysignificant difference. Table 2 lists some voices for whichthe relative MUSHRA achieved by UPW is evenly distributedwithin the range from highest (94.45%) to lowest (65.62%).This is a very strong result for the proposed universal vocoder,since not only can it avoid degradation compared to speaker-specific vocoders on all tested voices, but also it improvesthe vocoding quality on many voices by using informationcontained in the speech signal of related speakers. Moreover,Figure 2 shows that UPW consistently outperforms SDPWon voices and speaking styles that were either exposed duringtraining (in-domain) or not (out-of-domain).We now focus our evaluation on different speaking styles, oice (Sex) Rec. PWGAN WGlow UWRNN UPW UPW Rel.Italian (M) 65.97 55.05 50.67 59.46 Table 4:
MUSHRA scores on internal voices (unseen markedby *). All speakers are adult. Note that the voices are selectedso that the relative MUSHRA of UPW is evenly distributedwithin the range from highest to lowest.as we found that it is usually challenging to vocode highlyexpressive speech even for a well-trained SDPW. Table 3summarises some typical styles in our evaluation. We findthat UPW is comparable to SDPW on neutral, emotional(e.g. excited, disappointed), and long-form reading style. Forsome expressive styles such as conversational, news briefing,and singing style, UPW statistically significantly outperformsSDPW. In particular, the most challenging style we considerin our evaluation is singing, because we believe it is the mostexpressive type of speech. While both UPW and SDPWindeed achieved the least relative MUSHRA on this style(79.06% vs 69.46%), UPW sees the greatest improvementfrom SDPW. Notably, UPW outperforms SDPW on singingstyle by 6.91 MUSHRA points on average, closing the gapbetween recordings and SDPW by 31.44%. This stronglyevidences the superiority of the proposed universal vocoder.
In this section, we compare UPW with other popular neu-ral vocoder architectures, including Universal WaveRNN(UWRNN) [19], Parallel WaveGAN (PWGAN) [11] andWaveGlow (WGlow) [9]. Note that MUSHRA evaluations inthis section are carried out by Clickworker [25].
Internal voices.
We first evaluate competing systems on 19high-quality internal voices. The results in Table 4 show thatUPW is the best-performing vocoder overall ( p -value = 0 ),achieving the highest average relative MUSHRA of 94.82%among all four competing vocoders. Some voices are listedfor which the relative MUSHRA achieved by UPW is evenlydistributed within the range from highest (99.78%) to lowest(81.82%). Compared to the other non-autoregressive candi-dates, UPW statistically significantly outperforms WGlow onall 19 tested voices. UPW statistically significantly outper-forms PWGAN on 16 voices, and both systems are compa-rable on 3 voices without statistically significant difference.Compared to the autoregressive candidate, UPW is statisti-cally significantly better than UWRNN on 13 voices, both arecomparable on 2 voices, but UPW underperforms UWRNNon 4 voices. However, it is worth noting that, due to its au-toregressive nature, UWRNN has an inference speed that isslower than UPW typically by orders of magnitude. Dataset Rec. PWGAN WGlow UWRNN UPW UPW Rel.LibriTTS-clean 70.42 67.40 66.72 68.30
Table 5:
MUSHRA scores on external voices.
External voices.
We further study the robustness of the pre-trained UPW on open-source voices. To this end, we preparedthree test sets with decreasing recording quality: LibriTTS-clean [22], LibriTTS-other [22], and Common Voice [23].Note that LibriTTS is a multi-speaker corpus of Englishspeech in audiobook reading style, and Common Voice is adatabase of multi-lingual multi-speaker user recording. Theresults in Table 5 show that UPW is a top-performing systemacross all three sets of external voices. (a) On high-qualityvoices (LibriTTS-clean), UPW consistently outperforms allother systems, achieving a relative MUSHRA of 98.77%.This implies that UPW can generalise well to out-of-domainvoices when the recording conditions are studio-quality. (b)On medium-quality voices (LibriTTS-other), UPW has aclear advantage over other systems. This strongly suggeststhat UPW is still capable of synthesising naturally soundingspeech in the presence of a reasonable level of noise. (c) Onlow-quality voices (Common Voice), WGlow and UPW arein fact comparably good ( p -value = 0 . ), while UWRNNis the least robust to recording conditions where a significantamount of background noise is present in low-quality speech.
5. CONCLUSION
In this work, we presented a universal neural vocoder basedon Parallel WaveNet, trained on a multi-speaker multi-lingualspeech dataset. It is capable of synthesising a wide rangeof voices, styles, and languages, and particularly suitable forscaling up production of real-time TTS. The key componentof the proposed universal vocoder is an additional condi-tioning network called Audio Encoder, which auto-encodesreference waveforms into utterance-level global condition-ing. Based on large-scale evaluation, our universal vocoderoutperforms speaker-dependent vocoders overall. We havealso reported extensive studies benchmarking several existingneural vocoder architectures in terms of naturalness and uni-versality, and showed that our universal vocoder has a clearadvantage of being non-autoregressive and superior in termsof quality in a vast majority of cases.There are still interesting research directions we will leaveto future work. First, it is interesting to generalise the pro-posed Audio Encoder to encode reference waveforms intolocal conditioning, which can better represent the localisedfeatures of speech signals. Second, we can study whetherthe proposed Audio Encoder would also benefit multi-speakertraining of other neural vocoders such as WaveGlow [9] andParallel WaveGAN [11]. Third, it is worth investigating howour universal vocoder performs on challenging vocoding sce-narios, such as overlapping voices, and non-speech vocalisa-tions (e.g. shouts, breath, reverberation). . REFERENCES [1] W. Ping, K. Peng, and J. Chen, “Clarinet: Parallel wavegeneration in end-to-end text-to-speech,” in
Interna-tional Conference on Learning Representations (ICLR) ,2019.[2] J. Donahue, S. Dieleman, M. Binkowski, E. Elsen, andK. Simonyan, “End-to-end adversarial text-to-speech,”
CoRR , vol. abs/2006.03575, 2020.[3] S. ¨O. Arik, M. Chrzanowski, A. Coates, et al., “Deepvoice: Real-time neural text-to-speech,” in
Proceedingsof the 34th International Conference on Machine Learn-ing ICML , 2017, pp. 195–204.[4] Y. Wang, R. J. Skerry-Ryan, D. Stanton, et al.,“Tacotron: A fully end-to-end text-to-speech synthesismodel,”
CoRR , vol. abs/1703.10135, 2017.[5] J. Shen, R. Pang, R. J. Weiss, et al., “Natural TTSsynthesis by conditioning wavenet on MEL spectrogrampredictions,” in
ICASSP , 2018, pp. 4779–4783.[6] A. van den Oord, S. Dieleman, H. Zen, et al., “Wavenet:A generative model for raw audio,” in
The 9th ISCASpeech Synthesis Workshop , 2016, p. 125.[7] N. Kalchbrenner, E. Elsen, K. Simonyan, et al., “Effi-cient neural audio synthesis,” in
Proceedings of the 35thInternational Conference on Machine Learning ICML ,2018, pp. 2415–2424.[8] A. van den Oord, Y. Li, I. Babuschkin, et al., “Parallelwavenet: Fast high-fidelity speech synthesis,” in
Pro-ceedings of the 35th International Conference on Ma-chine Learning ICML , 2018, pp. 3915–3923.[9] R. Prenger, R. Valle, and B. Catanzaro, “Waveglow: Aflow-based generative network for speech synthesis,” in
ICASSP , 2019, pp. 3617–3621.[10] K. Kumar, R. Kumar, T. de Boissiere, et al., “Melgan:Generative adversarial networks for conditional wave-form synthesis,” in
Advances in Neural InformationProcessing Systems 32 , 2019, pp. 14881–14892.[11] R. Yamamoto, E. Song, and J. Kim, “Parallel wave-gan: A fast waveform generation model based on gen-erative adversarial networks with multi-resolution spec-trogram,” in
ICASSP , 2020, pp. 6199–6203.[12] S. ¨O. Arik, H. Jun, and G. F. Diamos, “Fast spectro-gram inversion using multi-head convolutional neuralnetworks,”
IEEE Signal Process. Lett. , vol. 26, no. 1,pp. 94–98, 2019. [13] X. Wang, J. Lorenzo-Trueba, S. Takaki, L. Juvela,and J. Yamagishi, “A comparison of recent waveformgeneration and acoustic modeling methods for neural-network-based speech synthesis,” in
ICASSP . 2018, pp.4804–4808, IEEE.[14] O. Barbany, A. Bonafonte, and S. Pascual, “Multi-speaker neural vocoder,” in
Fourth International Con-ference on IberSPEECH . 2018, pp. 30–34, ISCA.[15] E. Song, J. Kim, K. Byun, and H. Kang, “Speaker-adaptive neural vocoders for statistical parametricspeech synthesis systems,”
CoRR , vol. abs/1811.03311,2018.[16] A. Gibiansky, S. ¨O. Arik, G. F. Diamos, et al., “Deepvoice 2: Multi-speaker neural text-to-speech,” in
Ad-vances in Neural Information Processing Systems 30 ,2017, pp. 2962–2970.[17] T. Hayashi, A. Tamamori, K. Kobayashi, K. Takeda,and T. Toda, “An investigation of multi-speaker trainingfor wavenet vocoder,” in , 2017,pp. 712–718.[18] J. Rohnke, T. Merritt, J. Lorenzo-Trueba, et al., “ParallelWaveNet conditioned on VAE latent vectors,”
CoRR ,vol. abs/2012.09703, 2020.[19] J. Lorenzo-Trueba, T. Drugman, J. Latorre, et al., “To-wards achieving robust universal neural vocoding,” in
Interspeech , 2019, pp. 181–185.[20] D. P. Kingma and M. Welling, “Auto-encoding vari-ational bayes,” in , 2014.[21] D. P. Kingma and J. Ba, “Adam: A method for stochas-tic optimization,” in , 2015.[22] H. Zen, V. Dang, R. Clark, et al., “LibriTTS: A corpusderived from librispeech for text-to-speech,” in
Inter-speech , 2019, pp. 1526–1530.[23] Common Voice Database, “https://voice.mozilla.org/,”March, 2020.[24] I. Recommendation, “BS. 1534-1. method for thesubjective assessment of intermediate sound qual-ity (MUSHRA),”