[PDF] Speaker Conditional WaveRNN: Towards Universal Neural Vocoder for Unseen Speaker and Recording Conditions

Abstract

Recent advancements in deep learning led to human-level performance in single-speaker speech synthesis. However, there are still limitations in terms of speech quality when generalizing those systems into multiple-speaker models especially for unseen speakers and unseen recording qualities. For instance, conventional neural vocoders are adjusted to the training speaker and have poor generalization capabilities to unseen speakers. In this work, we propose a variant of WaveRNN, referred to as speaker conditional WaveRNN (SC-WaveRNN). We target towards the development of an efficient universal vocoder even for unseen speakers and recording conditions. In contrast to standard WaveRNN, SC-WaveRNN exploits additional information given in the form of speaker embeddings. Using publicly-available data for training, SC-WaveRNN achieves significantly better performance over baseline WaveRNN on both subjective and objective metrics. In MOS, SC-WaveRNN achieves an improvement of about 23% for seen speaker and seen recording condition and up to 95% for unseen speaker and unseen condition. Finally, we extend our work by implementing a multi-speaker text-to-speech (TTS) synthesis similar to zero-shot speaker adaptation. In terms of performance, our system has been preferred over the baseline TTS system by 60% over 15.5% and by 60.9% over 32.6%, for seen and unseen speakers, respectively.

Full PDF

SSpeaker Conditional WaveRNN: Towards Universal Neural Vocoder forUnseen Speaker and Recording Conditions

Dipjyoti Paul , Yannis Pantazis and Yannis Stylianou Computer Science Department, University of Crete Inst. of Applied and Computational Mathematics, Foundation for Research and Technology - Hellas [email protected], [email protected], [email protected]

Abstract

Recent advancements in deep learning led to human-level per-formance in single-speaker speech synthesis. However, thereare still limitations in terms of speech quality when generalizingthose systems into multiple-speaker models especially for un-seen speakers and unseen recording qualities. For instance, con-ventional neural vocoders are adjusted to the training speakerand have poor generalization capabilities to unseen speakers.In this work, we propose a variant of WaveRNN, referredto as speaker conditional WaveRNN (SC-WaveRNN). We tar-get towards the development of an efﬁcient universal vocodereven for unseen speakers and recording conditions. In con-trast to standard WaveRNN, SC-WaveRNN exploits additionalinformation given in the form of speaker embeddings. Us-ing publicly-available data for training, SC-WaveRNN achievessigniﬁcantly better performance over baseline WaveRNN onboth subjective and objective metrics. In MOS, SC-WaveRNNachieves an improvement of about 23% for seen speaker andseen recording condition and up to 95% for unseen speaker andunseen condition. Finally, we extend our work by implementinga multi-speaker text-to-speech (TTS) synthesis similar to zero-shot speaker adaptation. In terms of performance, our systemhas been preferred over the baseline TTS system by 60% over15.5% and by 60.9% over 32.6%, for seen and unseen speakers,respectively.

Index Terms : Universal Vocoder, Speech Synthesis, Wav-eRNN, Text-to-Speech, Zero-shot TTS.

1. Introduction

Speech synthesis has received attention in the research commu-nity as voice interaction systems have been implemented in var-ious applications, such as personalized Text-to-Speech (TTS)systems, voice conversion, dialogue systems and navigations[1, 2, 3, 4]. In the past, conventional statistical parametricspeech synthesis (SPSS) exhibited high naturalness under best-case conditions [5, 6]. Hybrid synthesis was also proposed as away to take advantage of both SPSS and unit-selection approach[7, 8]. Most of these TTS systems consist of two modules: theﬁrst module converts textual information into acoustic featureswhile the second one, i.e., the vocoder, generates speech sam-ples from the previously generated acoustic information.Traditional vocoder approaches mostly involved source-ﬁlter model for the generation of speech parameters [9, 10, 11,12]. The parameters were deﬁned by voicing decisions, funda-mental frequency (F0), spectral envelope or band aperiodicities.Algorithms like Grifﬁn-Lim utilized spectral representation togenerate speech [13, 14]. However, the speech quality of suchvocoders was restricted by the inaccuracies in parameter esti-mation. Recently, the naturalness of vocoders has been signiﬁ-cantly improved by beneﬁting from direct waveform modeling approach. Neural vocoders like WaveNet utilize a autoregres-sive generative model that can reconstruct waveform from in-termediate acoustic features [15, 16]. To overcome the timecomplexity at inference, parallel wave generation approach wasadopted to generate speech in real time [17, 18]. Wave Re-current Neural Networks (WaveRNN) which employs recur-rent layers increases the efﬁciency of sampling without com-promising their quality [19]. In particular, it can realize real-time high-quality synthesis by introducing a gated recurrentunit (GRU). Although, WaveRNN has been suggested focusingon text-to-speech synthesis, our work exercises it as a vocoderwhile changing the conditioning criteria from linguistic infor-mation to acoustic information. Other recent works have beenalso found in literature, notable among them are SampleRNN[20], WaveGlow [21], LPCNet [22] and MelNet [23].Techniques in neural vocoders involve data-driven learn-ing and are prone to specialize to the training data which leadsto poor generalization capabilities. Moreover, in multi-speakerscenarios, it is practically impossible to cover all possible in-domain (or seen) and out-of-domain (or unseen) cases in thetraining database. Previous studies also attempted to improveadaptation capabilities of vocoders [24], either with or withoutproviding speaker information [25, 26]. However, these studiesdid not address the generalization capabilities for unseen out-of-domain data. In [27], a potential universal vocoder was in-troduced claiming that speaker encoding is not essential to traina high-quality neural vocoder.Inspired by the performance and computational aspects ofWaveRNN, we propose a novel approach for designing a uni-versal WaveRNN vocoder. The proposed universal vocoder-speaker conditional WaveRNN (SC-WaveRNN) explores theeffectiveness of explicit speaker information, i.e., speaker em-beddings as a condition and improves the quality of generatedspeech across broadest possible range of speakers without anyadaptation or retraining. Even though conventional WaveRNNis capable of modeling good temporal structure for a singlespeaker, it fails to capture the dynamics of multiple speakers.We have experimentally demonstrated that our proposed SC-WaveRNN overcomes such limitation by modeling temporalstructure from a large variability of data, making it possible togenerate high-quality synthetic voices. Our work involves inde-pendent training of a speaker-discriminative neural encoder ona speaker veriﬁcation (SV) task using a state-of-the-art gener-alized end-to-end loss [28]. The SV model, trained on a largeamount of disjoint data, can attain robust speaker representa-tions that are independent of channel conditions and captureslarge space of speaker characteristics. Coupling such speakerinformation with the speech synthesis training also reduces theneed to obtain ample high-quality multi-speaker training data.At the same time, it increases the model’s ability to general-ize. Experimental results based on both objective and subjective a r X i v : . [ ee ss . A S ] A ug elspectrogram Linear ReLU LSTM Speaker EncoderBatch of Features spk1 spk2 spk3 emb1 emb2 emb3 Speaker Embeddings s1 s2 s3e11e21e31

Similarity Matrix

Negative labelPositive label

Figure 1:

System overview of speaker encoder [28]. Features, speaker embeddings and similarity scores from different speakers arerepresented by different color codes. ’spk’ denotes speakers and ’emb’ represents embedding vectors. evaluation conﬁrms that the proposed method achieves betterspeaker similarity and perceptual speech quality than baselineWaveRNN in both seen and unseen speakers.In parallel with the above-mentioned studies on univer-sal vocoder, there has been substantial development in multi-speaker TTS where speaker encoder is jointly trained with TTS[29, 30]. These jointly-trained speaker encoders lead to poorinference performance when applied on data which are notincluded in the training dataset. Fine-tuning pretrained TTSmodel in combination with speaker embeddings was addressedin [31, 32, 33]. Such approaches always require transcribedadaptation data along with more computational time and re-sources to adapt to a new speaker. To overcome this, TTS mod-els can be adapted from a few seconds of target speakers voicein a zero-shot manner by solely using speaker embedding with-out retraining the entire model. [34, 35, 36].Unfortunately, limitations still exist and human-level nat-uralness is not achieved yet. Additionally, prosody infor-mation was mismatched especially for unseen speakers. Toaddress those issues, we ﬁrst train a multi-speaker Tacotronwhich is conditioned on the speaker embeddings obtainedfrom the independently-trained speaker encoder. Tacotron[37] is a sequence-to-sequence network which predicts mel-spectrograms from text. Next, we incorporate the proposedSC-WaveRNN as a vocoder using the same speaker encoderand synthesize the temporal waveform from the sequence ofTacotron’s mel-spectrograms. We compare our system with thebaseline TTS method [36] which studies the effectiveness ofseveral neural speaker embeddings in the context of zero-shotTTS. Our results demonstrate that the proposed zero-shot TTSsystem outperforms baseline zero-shot TTS in [36] in-terms ofboth speech quality and speaker similarity on both seen and un-seen conditions.

2. Neural Speaker Encoder

Our work highlights the importance of speaker encoder in uni-versal vocoders through the application of generalized end-to-end (GE2E) SV task trained on thousands of speakers [28]. Theencoder network initially computes frame-level feature repre-sentation and then summarizes them to utterance-level ﬁxed-dimensional speaker embeddings. Next, the classiﬁer operateson GE2E loss, where embeddings from the same speaker havehigh cosine similarity and embeddings from different speakersare far apart in the embedding space. As depicted in Fig. 2, Uni-form Manifold Approximation and Projection (UMAP) showsthat the speaker embeddings are perfectly separated with largeinter-speaker distances and very small intra-speaker variance.

Speaker encoder structure is depicted in Figure 1. The log mel-spectrograms are extracted from speech utterance of arbitrarywindow length. The feature vectors are then assembled in theform of a batch that contains S different speakers, and eachspeaker has U utterances. Each feature vector x ij ( ≤ i ≤ S and ≤ j ≤ U ) represents the features extracted from speaker i utterance j . The features x ij are then passed to an encoder ar-chitecture. The ﬁnal embedding vector e ij is L2 normalized andthey are calculated by averaging on each window separately. D i m e n s i on Dimension 2

Figure 2:

UMAP projection of 10 utterances for each of the 10speakers. Different colors represent different speakers.

During training, embedding of all utterance for a particularspeaker should be closer to the centroid of that particular speak-ers embeddings, while far from other speakers centroids. Thesimilarity matrix SM ij,k is deﬁned as the scaled cosine similar-ities between each embedding vector e ij to all speaker centroids c k ( ≤ i, k ≤ S and ≤ j ≤ U ). SM ij,k = (cid:40) w · cos ( e ij , c − ji ) + b if k = iw · cos ( e ij , c k ) + b otherwise where c − ji = 1 U − U (cid:88) u =1; u (cid:54) = j e iu and c k = 1 U U (cid:88) u =1 e ku Here, w and b are trainable parameter. The ultimate GE2Eloss L is the accumulative loss over similarity matrix ( ≤ i ≤ S and ≤ j ≤ U ) on each embedding vector e ij : L ( x ; w ) = (cid:88) i,j L ( e ij ) = − SM ij,i + log S (cid:88) k =1 exp( SM ij,k ) The use of softmax function on similarity matrix makes the out-put equals to 1 iff k = i , otherwise the output is 0.

3. Speaker conditional WaveRNN

In literature, convolutional models have been thoroughly ex-plored and achieved excellent performance in speech synthesis[15, 18] yet they are prone to instabilities. Recurrent neural net-work (RNN) is expected to provide a more stable high-qualityspeech due to the persistence of the hidden state.

Our WaveRNN implementation is based on the repository which is heavily inspired by WaveRNN training [19]. This ar-chitecture is a combination of residual blocks and upsamplingnetwork, followed by GRU and FC layers as depicted in Fig.3. The architecture can be divided into two major networks:conditional network and recurrent network. The conditioning https://github.com/fatchord/WaveRNN etwork consists of a pair of residual network and upsamplingnetwork with three scaling factors. At the input, we ﬁrst mapthe acoustic features i.e., mel-spectrograms to a latent repre-sentation with the help of multiple residual blocks. The latentrepresentation is then split into four parts which will later befed as input to the recurrent network. The upsampling networkis implemented to match the desired temporal size of input sig-nal. The outputs of these two convolutional networks i.e., resid-ual and upsampling networks along with speech are fed intothe recurrent network. As part of the recurrent network, twouni-directional GRUs are employed with a few fully-connected(FC) layers at the end. By design, the overhead complexity isreduced with less parameters and takes advantage of temporalcontext for better prediction. U p sa m p li ng concatsplit F C l aye r s c on c a t Mel-spectrogram Speech R es i du a l B l o cks G RU M i x t u r e o f l og i s t i cs Speaker encoder V o c od e d S p eec h Lo ss S p eake r e m b e dd i ng s G RU c on c a t F C l aye r s F C l aye r s F C l aye r s c on c a t Figure 3:

Block diagram of proposed SC-WaveRNN training.

The above auto-regressive model can generate state-of-the-artnatural sounding speech, however, it needs large amounts oftraining data to train a stable high-quality model and scarcityof data remains a core issue. Moreover, a key challenge is itsgeneralization ability. We observe degradation in speech qual-ity and speaker similarity when the model generates waveformsfrom speakers that are not seen during training.In order to assist the development of a stable universalvocoder and remove data dependency, we propose in this paperan alternative training module referred to as speaker conditionalWaveRNN (SC-WaveRNN). In SC-WaveRNN, the output of thespeaker encoder is used as additional information to control thespeaker characteristics during both training and inference. Theadditional information plays a pivotal role in generating morestable high-quality speech across all speaker conditions. The di-rect estimation of raw audio waveform y = { y , y , · · · , y N } is described by the conditional probability distribution: sc - wavernn ( y ) = p ( y t | y t − ; h t ; e ; λ ) where e is the 256 dimension speaker embeddings vector. Thespeaker encoder is independently trained using large diversityof multi-speaker data that can generalize sufﬁciently to producemeaningful embeddings. The embedding vector e is computedin a utterance-wise manner. For each utterance, the ﬁnal embed-ding vector is averaged over all frames and hence it is ﬁxed forany utterance. The embedding vector is concatenated with theconditional network output and speech samples to form the con-ditional network. The details of the SC-WaveRNN algorithm ispresented in Figure 3. In addition, we apply continuous univari-ate distribution constituting a mixture of logistic distributions[17] which allows us to easily calculate the probability on the observed discretized value y . Finally, discretized mix logisticloss is applied on the discretized speech.

4. Zero-shot Text-to-Speech

The use of the auxiliary speaker encoder enables us to propose aTTS system capable of generating high-ﬁdelity synthetic voicefor unseen speakers without retraining Tacotron and vocodermodel. Such speaker adaptation to completely new speakersis called zero-shot learning. This speaker-aware TTS systemmimics voice characteristics from a completely unseen speakerwith only a few seconds of speech sample.

Speaker encoder

Text SC-WaveRNN

Synthetic Speech

Mel-spectrogramSpeaker embeddings

Multi-Speaker TacotronReference speech

Figure 4:

Block diagram of the proposed zero-shot TTS.

Our proposed system is composed of three separatelytrained networks, illustrated in Figure 4: (a) a neural speakerencoder, based on GE2E training, (b) a multi-speaker Tacotronarchitecture [37], which predicts a mel-spectrogram from text,conditioned on speaker embedding vector, and (c) the proposedspeaker conditional WaveRNN, which converts the spectrograminto time domain waveforms. First, the speaker embeddings areextracted from each target speakers’ utterance using the speakerencoder. At each time step, the embedding vector for the tar-get speaker is then concatenated with the embeddings of thecharacters before fed into encoder-decoder module. The ﬁ-nal output is mel-spectrograms. To convert the predicted mel-spectrograms into audio, we use SC-WaveRNN which is in-dependently trained by conditioning on the additional speakerembeddings. Due to generalization capabilities of the mod-els, combining multi-speaker Tacotron with SC-WaveRNN canachieve efﬁcient zero-shot adaptation for unseen speakers. Wecompare the proposed zero-shot system with a recently pro-posed zero-shot TTS [36] as baseline system. There, the bestperforming system uses multi-speaker Tacotron with gender-dependent WaveNet vocoders as TTS system and x-vector withlearnable dictionary encoding as speaker encoder network.

5. Experimental Setup

The speaker encoder training has been conducted on threepublic dataset: LibriSpeech, VoxCeleb1 and VoxCeleb2 con-taining utterances from over 8k speakers [34]. The log mel-spectrograms are ﬁrst extracted from audio frames of width25ms and step 10ms. Voice Activity Detection (VAD) and asliding window approach is used. The GE2E model consists of3 LSTM layers of 768 cells followed by a projection to 256 di-mensions. While training, each batch contains S = 64 speakersand U = 10 utterances per speaker.Tacotron and WaveRNN models are trained using VCTKEnglish corpus [38] from 109 different speakers. To evaluategeneralization performance, we consider three scenarios: seenspeakers-seen sound quality (SS-SSQ), unseen speakers-seensound quality (UNS-SSQ) and unseen speakers-unseen soundquality (UNS-USQ). Seen speakers refers to the speakers thatare already present in the training and unseen speakers are thenew speakers during testing. Sound quality refers to the record-ing condition such as recording equipment, reverberation etc.We train the network using 100 speakers leaving 9 speakers forUNS-SSQ scenarios that are chosen to be a mix of genders andaving enough unique utterances per speaker. CMU-ARCTICdatabase [39] is used for UNS-USQ scenario having 2 male and2 female speakers. Moreover, to overcome the limited linguisticvariability in VCTK data, we initially train Tacotron model onLJSpeech database as a “warm-start” training approach similarto [36]. Code and sound samples can be found in .

6. Results and Discussion

In this section, we evaluate the performance of vocoded speechshown in Table 1. To assess the effectiveness of speaker embed-dings in SC-WaveRNN, PESQ and STOI objective measures arecomputed from 50 random samples. We carry out evaluationson three conditions: SS-SSQ, UNS-SSQ and UNS-USQ. Thepurpose of each condition is to evaluate the proposed vocodernot only on seen or unseen speakers but also for the qualityof the recordings. As expected, seen scenarios perform bet-ter with respect to unseen samples. However, we observe thatSC-WaveRNN signiﬁcantly improves both the objective scoreswhen compared to baseline WaveRNN for all scenarios.Table 1:

Objective evaluation tests.

Methods SS-SSQ UNS-SSQ UNS-USQPESQ STOI PESQ STOI PESQ STOIWaveRNN 2.2575 0.8173 2.1497 0.7586 1.4850 0.8620SC-WaveRNN

Concerning the perceptual assessment of speech qualityand speaker similarity, two separate listening tests are reported:mean opinion score (MOS) and ’ABX’ preference test. Thesubjects are asked to rate the naturalness of generated utteranceson a scale of ﬁve-point (1:Bad, 2:Poor, 3:Fair, 4:Good, 5:Excel-lent). In the ABX test, experimental subjects have to decidewhether a given reference sentence X is closer in speaker iden-tity to one of A and B sentences, which are samples obtainedeither from the proposed or the baseline method, not necessar-ily in that order. Fifteen native and non-native English listen-ers participated in our listening tests. The evaluation results ofboth MOS and ’ABX’ tests are demonstrated in Figure 5. Errorbars represent 95% conﬁdence intervals. For all seen and un-seen scenarios, the MOS scores for the proposed SC-WaveRNNare much higher than the baseline WaveRNN (between 14%to 95% relative improvement). Under the same sound qual-ity conditions (SS-SSQ and UNS-SSQ), although, the proposedtechnique is preferred in terms of speaker similarity preferencetest, a majority of preference is given to same preference optionwhich indicates similar speaker characteristics for both meth-ods. In contrast, experimental analysis shows a signiﬁcant pref-erence score (92%) in unseen sound quality for proposed SC-WaveRNN. We conclude that additional speaker information inthe form of embeddings is effective for improvements in natu-ralness and speaker similarity especially for unseen data and ca-pable of achieving a truly universal vocoder. This is attributedby the fact that unseen scenarios are handled more efﬁciently bythe model since additional embeddings are able to capture broadspectrum of speaker characteristics. Moreover, SC-WaveRNNdoes not compromise the performance in seen conditions.

To evaluate the performance of the proposed zero-shot TTS,MOS and ’ABX’ test are employed, as depicted in Figure 6.We subjectively evaluate both baseline [36] and our methods by https://dipjyoti92.github.io/SC-WaveRNN/ SS-SSQ UNS-SSQ UNS-USQ M O S WaveRNNSC-WaveRNN

SS-SSQ UNS-SSQ UNS-USQ P r e f e r e n ce ( % ) Figure 5:

Vocoder Subjective listening test (MOS) for speechquality and preference test in (%) for speaker similarity. synthesizing sample utterances from seen speakers and unseenspeakers. Different sound qualities are not considered in theevaluation experiments of zero-shot TTS. As expected, a gapbetween seen and unseen speakers are visible: seen speakerssynthetic speech has slightly higher quality to unseen speakers.MOS scores indicate that proposed TTS is superior in qualitywith 19.2% and 14.5% relative improvement for seen and un-seen speakers respectively. We also found that our proposedTTS mimic better speaker characteristics and shows signiﬁcantimprovement under both conditions. With regard to speakersimilarity, the proposed TTS obtains the majority of preferenceswith 60% and 60.9% compared to 15.5% and 32.6% of the base-line TTS for seen and unseen speakers, respectively.

Seen speakers Unseen speakers M O S BaselineProposed

Seen speakers Unseen speakers P r e f e r e n ce ( % ) Baseline Proposed Same preference32.615.5 60 24.4 60.9 6.5

Figure 6:

Zero-shot TTS Subjective listening test (MOS) forspeech quality and preference test for (%) for speaker similarity.

7. Conclusions

In this paper, we proposed a robust universal SC-WaveRNNvocoder that is capable of synthesizing high-quality speech.The system was conditioned on extracted speaker embeddingswhich cover a very diverse range of seen and unseen conditions.The main advantage of SC-WaveRNN is its high controllabil-ity, since it improves multi-speaker vocoder training along withbetter generalization ability by allowing reliable transfer to un-seen speaker characteristics. Furthermore, speaker conditioningis typically more data efﬁcient and computationally less expen-sive than training separate models for each speaker. Subjec-tive and objective evaluation revealed that the proposed methodgenerated higher sound quality and speaker similarity than thebaseline method. In addition, we extended our approach in de-vising an efﬁcient zero-shot TTS system. We demonstrated thatthe proposed zero-shot TTS with universal vocoder can improvespeaker similarity and naturalness of synthetic speech for seenand unseen speakers. In future, we list more experimentationon speaker embeddings and its effectiveness with unseen data.

Acknowledgements: . References [1] T. Dutoit,

An introduction to text-to-speech synthesis . SpringerScience & Business Media, 1997, vol. 3.[2] P. Taylor,

Text-to-speech synthesis . Cambridge University Press,2009.[3] Y. Stylianou, O. Capp´e, and E. Moulines, “Continuous probabilis-tic transform for voice conversion,”

IEEE Transactions on speechand audio processing , vol. 6, no. 2, pp. 131–142, 1998.[4] D. Paul, Y. Pantazis, and Y. Stylianou, “Non-parallel voice con-version using weighted generative adversarial networks.” in

Proc.Interspeech , 2019, pp. 659–663.[5] H. Zen, K. Tokuda, and A. W. Black, “Statistical parametricspeech synthesis,”

Speech Communication , vol. 51, no. 11, pp.1039–1064, 2009.[6] S. King, “An introduction to statistical parametric speech synthe-sis,”

Sadhana , vol. 36, no. 5, pp. 837–852, 2011.[7] Y. Qian, F. K. Soong, and Z.-J. Yan, “A uniﬁed trajectory tilingapproach to high quality speech rendering,”

IEEE Transactionson Audio, Speech, and Language Processing , vol. 21, no. 2, pp.280–290, 2012.[8] T. Merritt, R. A. Clark, Z. Wu, J. Yamagishi, and S. King, “Deepneural network-guided unit selection synthesis,” in

Proc. ICASSP ,2016, pp. 5145–5149.[9] R. McAulay and T. Quatieri, “Speech analysis/synthesis basedon a sinusoidal representation,”

IEEE Transactions on Acoustics,Speech, and Signal Processing , vol. 34, no. 4, pp. 744–754, 1986.[10] E. Moulines and F. Charpentier, “Pitch-synchronous waveformprocessing techniques for text-to-speech synthesis using di-phones,”

Speech Communication , vol. 9, no. 5-6, pp. 453–467,1990.[11] H. Kawahara, I. Masuda-Katsuse, and A. De Cheveigne, “Re-structuring speech representations using a pitch-adaptive time–frequency smoothing and an instantaneous-frequency-based f0extraction: Possible role of a repetitive structure in sounds,”

Speech Communication , vol. 27, no. 3-4, pp. 187–207, 1999.[12] M. Morise, F. Yokomori, and K. Ozawa, “WORLD: a vocoder-based high-quality speech synthesis system for real-time applica-tions,”

IEICE Transactions on Information and Systems , vol. 99,no. 7, pp. 1877–1884, 2016.[13] D. Grifﬁn and J. Lim, “Signal estimation from modiﬁed short-time fourier transform,”

IEEE Transactions on Acoustics, Speech,and Signal Processing , vol. 32, no. 2, pp. 236–243, 1984.[14] N. Perraudin, P. Balazs, and P. L. Søndergaard, “A fast Grifﬁn-Limalgorithm,” in

IEEE Workshop on Applications of Signal Process-ing to Audio and Acoustics , 2013, pp. 1–4.[15] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals,A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu,“WaveNet: A generative model for raw audio,” in th ISCA SpeechSynthesis Workshop , pp. 125–125.[16] A. Tamamori, T. Hayashi, K. Kobayashi, K. Takeda, and T. Toda,“Speaker-dependent WaveNet vocoder.” in

Proc. Interspeech ,2017, pp. 1118–1122.[17] A. Oord, Y. Li, I. Babuschkin, K. Simonyan, O. Vinyals,K. Kavukcuoglu, G. Driessche, E. Lockhart, L. Cobo, F. Stimberg et al. , “Parallel WaveNet: Fast high-ﬁdelity speech synthesis,” in

International Conference on Machine Learning , 2018, pp. 3918–3926.[18] W. Ping, K. Peng, and J. Chen, “ClariNet: Parallel wave genera-tion in end-to-end text-to-speech,” in

International Conference onLearning Representations , 2019.[19] N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury,N. Casagrande, E. Lockhart, F. Stimberg, A. Oord, S. Diele-man, and K. Kavukcuoglu, “Efﬁcient neural audio synthesis,”in

International Conference on Machine Learning , 2018, pp.2410–2419. [20] S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain, J. Sotelo,A. Courville, and Y. Bengio, “SampleRNN: An unconditionalend-to-end neural audio generation model,” in

International Con-ference on Learning Representations , 2017.[21] R. Prenger, R. Valle, and B. Catanzaro, “WaveGlow: A ﬂow-based generative network for speech synthesis,” in

Proc. ICASSP ,2019, pp. 3617–3621.[22] J. M. Valin and J. Skoglund, “LPCnet: Improving neural speechsynthesis through linear prediction,” in

Proc. ICASSP , 2019, pp.5891–5895.[23] S. Vasquez and M. Lewis, “MelNet: A generative model for au-dio in the frequency domain,” arXiv preprint arXiv:1906.01083 ,2019.[24] B. Sisman, M. Zhang, and H. Li, “A voice conversion frameworkwith tandem feature sparse representation and speaker-adaptedwavenet vocoder.” in

Interspeech , 2018, pp. 1978–1982.[25] L. J. Liu, Z. H. Ling, Y. Jiang, M. Zhou, and L.-R. Dai, “WaveNetvocoder with limited training data for voice conversion.” in

Proc.Interspeech , 2018, pp. 1983–1987.[26] T. Hayashi, A. Tamamori, K. Kobayashi, K. Takeda, and T. Toda,“An investigation of multi-speaker training for wavenet vocoder,”in

Automatic Speech Recognition and Understanding Workshop(ASRU) , 2017, pp. 712–718.[27] J. Lorenzo-Trueba, T. Drugman, J. Latorre, T. Merritt, B. Putrycz,R. Barra-Chicote, A. Moinet, and V. Aggarwal, “Towards achiev-ing robust universal neural vocoding,” in

Proc. Interspeech , 2019,pp. 4879–4883.[28] L. Wan, Q. Wang, A. Papir, and I. L. Moreno, “Generalized end-to-end loss for speaker veriﬁcation,” in

Proc. ICASSP , 2018, pp.4879–4883.[29] Y. Chen, Y. Assael, B. Shillingford, D. Budden, S. Reed, H. Zen,Q. Wang, L. C. Cobo, A. Trask, B. Laurie et al. , “Sample efﬁcientadaptive text-to-speech,” in

International Conference on LearningRepresentations , 2018.[30] J. Park, K. Zhao, K. Peng, and W. Ping, “Multi-speaker end-to-end speech synthesis,” arXiv preprint arXiv:1907.04462 , 2019.[31] Y. Deng, L. He, and F. Soong, “Modeling multi-speaker la-tent space to improve neural TTS: Quick enrolling new speakerand enhancing premium voice,” arXiv preprint arXiv:1812.05253 ,2018.[32] Q. Hu, E. Marchi, D. Winarsky, Y. Stylianou, D. Naik, and S. Ka-jarekar, “Neural text-to-speech adaptation from low quality publicrecordings,” in

Speech Synthesis Workshop , vol. 10, 2019.[33] S. Arik, J. Chen, K. Peng, W. Ping, and Y. Zhou, “Neural voicecloning with a few samples,” in

Advances in Neural InformationProcessing Systems , 2018, pp. 10 019–10 029.[34] Y. Jia, Y. Zhang, R. Weiss, Q. Wang, J. Shen, F. Ren, P. Nguyen,R. Pang, I. L. Moreno, Y. Wu et al. , “Transfer learning fromspeaker veriﬁcation to multispeaker text-to-speech synthesis,” in

Advances in neural information processing systems , 2018, pp.4480–4490.[35] M. Chen, M. Chen, S. Liang, J. Ma, L. Chen, S. Wang, andJ. Xiao, “Cross-lingual, multi-speaker text-to-speech synthesis us-ing neural speaker embedding,” in

Proc. Interspeech , 2019, pp.2105–2109.[36] E. Cooper, C.-I. Lai, Y. Yasuda, F. Fang, X. Wang, N. Chen, andJ. Yamagishi, “Zero-shot multi-speaker text-to-speech with state-of-the-art neural speaker embeddings,” in

Proc. ICASSP , 2020,pp. 6184–6188.[37] Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss,N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio et al. ,“Tacotron: Towards end-to-end speech synthesis,” arXivpreprint:1703.10135 , 2017.[38] V. Christophe, Y. Junichi, and M. Kirsten, “CSTR VCTK corpus:English multi-speaker corpus for CSTR voice cloning toolkit,”