[PDF] Online Speaker Adaptation for WaveNet-based Neural Vocoders

Abstract

In this paper, we propose an online speaker adaptation method for WaveNet-based neural vocoders in order to improve their performance on speaker-independent waveform generation. In this method, a speaker encoder is first constructed using a large speaker-verification dataset which can extract a speaker embedding vector from an utterance pronounced by an arbitrary speaker. At the training stage, a speaker-aware WaveNet vocoder is then built using a multi-speaker dataset which adopts both acoustic feature sequences and speaker embedding vectors as this http URL the generation stage, we first feed the acoustic feature sequence from a test speaker into the speaker encoder to obtain the speaker embedding vector of the utterance. Then, both the speaker embedding vector and acoustic features pass the speaker-aware WaveNet vocoder to reconstruct speech waveforms. Experimental results demonstrate that our method can achieve a better objective and subjective performance on reconstructing waveforms of unseen speakers than the conventional speaker-independent WaveNet vocoder.

Full PDF

OOnline Speaker Adaptation for WaveNet-basedNeural Vocoders

Qiuchen Huang, Yang Ai, Zhenhua Ling

National Engineering Laboratory for Speech and Language Information Processing,University of Science and Technology of China, Hefei, ChinaE-mail: { qchuang, ay8067 } @mail.ustc.edu.cn, [email protected] Abstract —In this paper, we propose an online speaker adap-tation method for WaveNet-based neural vocoders in order toimprove their performance on speaker-independent waveformgeneration. In this method, a speaker encoder is ﬁrst constructedusing a large speaker-veriﬁcation dataset which can extracta speaker embedding vector from an utterance pronouncedby an arbitrary speaker. At the training stage, a speaker-aware WaveNet vocoder is then built using a multi-speakerdataset which adopts both acoustic feature sequences and speakerembedding vectors as conditions. At the generation stage, weﬁrst feed the acoustic feature sequence from a test speaker intothe speaker encoder to obtain the speaker embedding vector ofthe utterance. Then, both the speaker embedding vector andacoustic features pass the speaker-aware WaveNet vocoder toreconstruct speech waveforms. Experimental results demonstratethat our method can achieve a better objective and subjectiveperformance on reconstructing waveforms of unseen speakersthan the conventional speaker-independent WaveNet vocoder.

Index Terms —WaveNet, neural vocoder, speech synthesis,speaker adaptation, speaker embedding vector

I. I

NTRODUCTION

In recent years, speech synthesis has become an essentialtechnique for intelligent speech applications, such as audio-book, customer service, speech translation, etc. At present,speech synthesis also faces more and more challenges, such ashigh quality, high efﬁciency, and better generalization abilitytoward multi-speakers.Statistical parametric speech synthesis (SPSS) is one of themainstream speech synthesis approaches, which is achieved byacoustic modeling and vocoder-based waveform generation.It has advantages of smoothness, ﬂexibility and coherence.Acoustic models predict acoustic features from input linguisticfeatures and can be built based on hidden Markov models(HMM) [1], neural networks or other deep learning methods.Then, vocoders reconstruct speech waveforms from the pre-dicted acoustic features. Traditional vocoders usually adopt thesource-ﬁlter signal processing model, i.e., passing a spectrallyﬂat excitation (impulse train or noise) through a linear vocaltract ﬁlter, to reconstruct speech waveforms. Representedby STRAIGHT [2] and WORLD [3], these vocoders areconvenient and practical but have some deﬁciencies. Forexample, the process of real speech production containsnonlinear effects, which can not be reﬂected by linear ﬁltering. In addition, there is the loss of spectral details and phaseinformation in these vocoders.Recently, deep learning models have been widely appliedto various signal processing tasks. In the ﬁled of speechsynthesis, vocoders based on neural networks have also beenstudied. WaveNet [4], a non-linear autoregressive waveformgeneration model, has been proposed and WaveNet-basedneural vocoders [5] outperformed traditional vocoders on thenaturalness of generated speech. Some variants, includingWaveRNN [6], FloWaveNet [7], ClariNet [8] and WaveGlow[9], have also been proposed to improve the performance andefﬁciency of WaveNet vocoders. However, all of the aboveneural vocoders rely on speaker-dependent model training.For some applications such as personalized and expressivespeech synthesis, the training data of a target speaker isusually limited. Besides, separate models need to be storedfor different speakers which increases the footprint of speechsynthesis systems when adding speakers.The methods of acoustic model adaptation have been well-studied in traditional SPSS [10]. For building neural vocoders,speaker-dependent training methods have also been proposed[11], [12] to avoid the demand for large speaker-dependenttraining datasets. Liu et al. [11] initialized the WaveNet modelwith a multi-speaker corpus and then ﬁne-tuned it with thesmall amount of data from the target speaker. In this method,speaker embedding vectors were learnt simultaneously withWaveNet parameters at the initialization stage. Besides, aquasi-periodic WaveNet vocoder (QPNet) was also proposed[12], whose dilated convolution structure is adjusted to thefundamental frequency to enhance pitch controllibiliy forbetter speaker adaptation.These speaker adaptation methods of neural vocoders stillrequire a certain amount of adaptation data and an extraadaption process for each target speaker. Thus, they cannot achieve fully speaker-independent generation of speechwaveforms which is the advantage of traditional source-ﬁlter-based vocoders [2], [3]. Aiming at avoiding this deﬁciency,the methods of building speaker-independent neural vocodershave been studied using WaveNet [13] and WaveRNN [6]or combining speech production mechanisms [14]. Speaker-independent WaveNet vocoder [13] used a multi-speakercorpus to train a conditional WaveNet directly. However, a r X i v : . [ ee ss . A S ] A ug peaker embedding vector Speaker encoder trainingspeaker embedding vectorsSTRAIGHTSTRAIGHT Synthesis stage speech waveformtest acoustic featuresequence Speaker EncoderSpeaker-aware vocoder

Training stage

Multi-speaker datasetLarge speaker verification dataset acoustic featuresequences Speaker encoderacoustic featuresequencesSpeaker IDsnatural waveforms Vocoder training Speaker-aware vocoder

Fig. 1. The training and synthesis procedures of our proposed model. its performance degraded signiﬁcantly on unseen speakerscomparing with speaker-dependent counterpart [5]. The reasonis that we are unable to cover the voice characteristics of allpossible unseen speakers with the training set of vocoders,which makes the built vocoder models prone to overﬁt to thespeaker characteristics in the training set.On the other hand, the techniques of representing speaker-speciﬁc information using speaker identity embeddings havealso been investigated. There are mainly three types of speakeridentity embeddings: a speaker-code vector (e.g., one-hotvector), an acoustic-driven vector extracted using externalmodels such as i-vector [15] or d-vector [16], [17], and anacoustic-driven vector based on encoders jointly trained withacoustic models [17], [18]. These speaker identity embeddingshave been widely used in speaker identiﬁcation and speechrecognition tasks [19]–[22]. They have also achieved goodperformance on building acoustic models for multi-speakerspeech synthesis [15], [16], [23]–[25].In this paper, we integrate d-vectors into neural vocodersand propose an online speaker adaptation method in orderto improve the performance of speaker-independent WaveNetvocoders when dealing with unseen speakers. First, a speakerencoder is trained using a large speaker-veriﬁcatioin dataset.For each utterance in the multi-speaker vocoder training set, aspeaker embedding vector, i.e., d-vector, is extracted using thebuilt speaker encoder. Then, these speaker embedding vectors are utilized as auxiliary features to train a speaker-awareWaveNet vocoder. At the generation stage, we send the acous-tic feature sequence of a test utterance into the speaker encoderto extract its speaker embedding vector, which is combinedwith acoustic features and passed through the speaker-awareWaveNet vocoder for waveform reconstruction. Experimentresults show that our proposed method can synthesize speechwith better objective and subjective quality than the traditionalspeaker-independent WaveNet vocoder.Our paper is organized as follows. Section II introducesthe details of our proposed method. Section III describes thevocoders we built for comparison and the evaluation results.Section IV is the conclusion.II. M

ETHODS

As shown in Fig. 1, our proposed model is composed of twoseparately trained neural networks: a recurrent speaker encoderthat computes the speaker embedding vector from the inputacoustic features of each utterance, and an auto-regressiveWaveNet vocoder that utilizes the concatenation of speakerembedding and acoustic features as condition for waveformreconstruction.

A. Speaker Encoder

The speaker encoder is employed to extract speaker em-bedding vectors from the acoustic features of target speakersfor assisting the vocoder network to generate waveforms. Thextracted speaker embedding vectors are expected to reﬂect thespeaker characteristics of input acoustic features rather thantext contents or background noise.For building the speaker encoder, we refer to previous study[21] which proposed an efﬁcient and accurate text-independentspeaker veriﬁcation model based on the generalized end-to-end (GE2E) loss. This model mapped the acoustic features ofan utterance to a ﬁxed-dimensional speaker embedding vector,known as d-vector [16], [17]. By optimizing the GE2E loss,the d-vectors of training utterances from the same speakerachieved high cosine similarity, while those of utterances fromdifferent speakers became far apart in the embedding space.In our implementation, the speaker encoder is a 3-layerLSTM network, including 768 units with projection size 256.The embedding vector (d-vector) is deﬁned as the networkoutput at the last frame and its dimension is the same as theprojection size of LSTM network. At the inference stage ofprocessing each utterance, we apply a sliding window of ﬁxed160 frames with 50% overlap. The d-vector of each window isﬁst computed and the ﬁnal utterance-wise d-vector is generatedby averaging the window-wise ones with L2 normalization.Different from previous work for speaker veriﬁcation [21],the acoustic features used to train the speaker encoder hereare consistent with the input features of our WaveNet vocoder,including 40-dimensional mel-cepstra, an energy, an F0 and avoiced/unvoiced (V/UV) ﬂag for each frame. STRAIGHT [2]is used for natural acoustic features extraction. The windowsize is 400 samples (25ms) and the window shift is 80 samples(5ms).

B. Speaker-Aware WaveNet Vocoder

WaveNet [4] is a deep autoregression-based convolutionalneural network which can directly generate high-ﬁdelity audiosignal sample-by-sample. WaveNet vocoder [5] models thejoint distribution of waveform samples given auxiliary acousticfeatures which is factorized as a product of conditionalprobabilities as p ( x | h ) = T (cid:89) t =1 p ( x t | x , ..., x t − , h ) , (1)where x t is the waveform value at the t -th sample, T isthe waveform length, and the condition h is the sequence ofacoustic features. For modeling the conditional probabilities,WaveNet employs a stack of dilated causal convolution layers.The history waveforms and condition features pass throughthese convolution layers with gated activation functions, andpredict the posterior probability of current waveform samplewith µ -law quantization [26] using a softmax output layer.In our proposed method, a speaker-aware WaveNet vocoderis built, which means that the condition h contains not onlythe acoustic features of input utterance but also the speakerembedding vector extracted from these acoustic features by thespeaker encoder. By introducing speaker embedding vectors,we expect to improve the speaker-independency and thespeaker-generalization ability of the WaveNet model. In ourimplementation, the speaker embedding vector is concatenated with the acoustic features at each frame. Then, they passthrough a conditional network consisting of a × convolutionlayer, a stack of 4 dilated convolution layers and an upsam-pling layer before acting as the sample-wise local conditionsof the WaveNet model.As shown in Fig. 1 a), a multi-speaker dataset is adopted totrain the speaker-aware WaveNet vocoder. Acoustic featuresare ﬁrst extracted from all training utterances by STRAIGHT[2]. Then, a speaker embedding vector is extracted from eachutterance using the built speaker encoder. It is expected thatthe speaker embedding vectors can capture speaker-relatedinformation in the training set. Finally, the parameters ofthe speaker-aware WaveNet model are estimated under cross-entropy criterion using waveforms, acoustic features togetherwith speaker embedding vectors of the training set.At the synthesis stage, given the acoustic features of a testutterance, we ﬁrst extract its speaker embedding vector asshown in Fig. 1 b). Then, the extracted speaker embeddingvector are concatenated with the acoustic features for wave-form construction. Here, the online adaptation of WaveNetvocoder is achieved because the speaker embedding vector iscalculated for each input utterance from an arbitrary speakerand it is not necessary to conduct model adaptation ofﬂineusing pre-collected data.III. E XPERIMENTS

A. Datasets

The VCTK corpus [27] was adopted to build vocoders inour experiments. This dataset was downsampled to 16kHzand contained 44-hours utterances recorded from 109 nativespeakers of Engilish with various accents. The corpus wassplit into three disjoint sets for experiments. 34,977 utterancesfrom 99 speakers and 288 utterances from 10 unseen speakerswere chosen to construct the training set and the test set ofvocoders. The remaining 3,028 utterances from the 10 unseenspeakers were used as the ofﬂine adaptation set.For building the speaker encoder, a dataset with morespeakers is necessary in order to deal with diversiﬁed speakers.Thus, the subsets of Librispeech and Voxceleb1 corpora usedby previous study on speaker diarizization [22] was adoptedhere. The train-other-500 subset of Librispeech [28] contained148,688 utterances from 1,166 speakers, and the dev subset ofVoxceleb1 [29] contained over 147,935 utterances from 1,211speakers. And we used the same test set as that of vocoder toevaluate the performance of speaker encoder.

B. Model Construction

To investigate the effectiveness of our proposed method, webuilt three types of WaveNet-based vocoders for comparisonusing the same VCTK training set. The conﬁgurations of themare described as follow.

1) Speaker Independent (SI) Vocoder:

This vocoder wasbuilt by training a uniﬁed WaveNet model without speakerembeddings [13] and acted as the baseline in our experiments.The WaveNet conﬁgurations were the same as the ones of ourproposed models.

ABLE IS

PEAKER VERIFICATION

EER S (%) OF TWO SPEAKER ENCODERS ONUNSEEN SPEAKERS . System Speaker Encoder Training Datasets EER( % ) OSA1 Librispeech,Voxceleb1 2.96OSA2 Librispeech,Voxceleb1,VCTK 1.07

2) Ofﬂine Speaker Adaptation (SA) Vocoders:

For eachspeaker in the test set, ﬁve speaker-dependent WaveNetvocoders were built by ﬁne-tuning the SI vocoder using 20%,40%, 60%, 80%, and 100% adaptation data of this speakerrespectively.

3) Online Speaker Adaptation (OSA) Vocoders:

As shownin Table I, two proposed vocoders were built using the speakerencoders estimated with different training sets. For the OSA1vocoder, we used the Librispeech and Voxceleb1 datasetsintroduced above to train the speaker encoder. For the OSA2vocoder, we further added 99 speakers in the VCTK trainingset to train the speaker encoder. The outputs of both speakerencoders were 256-dimensional speaker embedding vectors.And the built WaveNet model had 4 convolutional blocks.Each block had 10 dilated casual convolution layers whoseﬁlter width was 2 and dilation coefﬁcients were { , , . . . , } .In the gated activation units, the number of gate channelswas 100. The number of residual channels and skip channelswas 100 and 256 in the residual architectures respectively.The waveform samples were quantized by 8-bit µ -law. In thecondition network, the 299-dimensional condition ﬁrst passedthrough a 1 × { , , , } for the dilated convolution layers. Finally the80-dimensional output was connected to the gated activationunits after upsampling and the upsampling was performed byrepeating the output within each frame. The training target wasto minimize the cross-entropy and an Adam optimizer [30] wasadopted to update the model parameters. The initial learningrate was 0.0001 and the learning rate halved every 100000steps. The model was totally trained for 400000 steps. Modelswere trained and evaluated on a single Nvidia 1080Ti GPU.

C. Performance of Speaker Encoders

To measure the performance of built speaker encoders, wecomputed the equal error rates (EERs) of speaker veriﬁcationon the VCTK test set and the results are shown in Table I. Weenrolled the utterances from the 10 speakers in the VCTK testset. For both speaker encoders, the enrolled and veriﬁcationspeakers were unseen at the training stage. EER was calculatedby pairing each test utterance with each enrolled speaker.From Table I, we can see that the EERs of both vocoderswere relatively low which indicated that both speaker encoderscan extract speaker-related information from acoustic featuresequences effectively. Besides, the speaker encoder of OSA2achieved lower EER than that of OSA1 because the VCTKtraining set was also utilized to train the speaker encoder ofOSA2, which may reduce the mismatch between the training and test data of the encoder.

D. Objective Evaluation Results

Five objective metrics comparing the waveforms generatedby vocoders with natural references, including signal-to-noiseratio (SNR), root mean square error of log amplitude spec-tra (RMSE-LAS), mel-cepstral distortion (MCD), root meansquare error of F0 (RMSE-F0), and voice/unvoiced error rate(V/UV Error), were used in our objective evaluation. Thedetailed formulae for calculating these metrics can be foundin previous studies [5], [13].The test set evaluation results are shown in Table II.Regarding with the ﬁve SA vocoders, we can see that theirperformances on SNR, RMSE-F0 and V/UV error rate wereimproved when using more adaptation data, while their spec-tral distortions (i.e., RMSE-LAS and MCD) were almostthe same. Comparing the proposed OSA vocoders with SAvocoders, we can see that the spectral distortions of OSAmodels were comparable with or better than that of SA models,while the SNRs of OSA models were worse than that of SAmodels, no matter how much adaptation data was used bySA models. The F0-RMSEs of OSA models were close tothat of the SA model using 40% adaptation data, and theV/UV error rate of OSA1 was comparable with that of theSA model using 20% adaptation data. Furthermore, althoughOSA2 achieved lower EER than OSA1 as shown in Table I,its objective performances were not clearly better than OSA1,except on the metric of F0-RMSE.

E. Subjective Evaluation Results

For comparing the subjective performance of the speechgenerated by SI, SA and OSA2 vocoders, four groups ofABX preference tests were conducted on the crowdsourcingplatform of Amazon Mechanical Turk with anti-cheatingconsiderations. In each test, 20 utterances generated by twocomparative vocoders were randomly picked out from thetest set. Each pair of voice was evaluated in random order.At least 30 English native speakers were asked to judgewhich utterance in each pair had better naturalness or soundedmore similar to the natural reference. In addition to averagepreference scores, the p-value of t-test was also calculatedto measure the signiﬁcance of the difference between twocomparative vocoders.The subjective evaluation results are shown in Table III. Wecan see that the OSA2 vocoder achieved better naturalnessof reconstructed speech than the SI vocoder signiﬁcantly(p < > ABLE IIO

BJECTIVE EVALUATION RESULTS ON TEST SET . H

ERE , SA

STANDS FOR THE VOCODER USING

ADAPTATION DATA OF EACH SPEAKER . Model SNR(dB) RMSE-LAS(dB) MCD(dB) RMSE-F0(cent) V/UV error rate(%)

SI 3.05 8.60 2.03 69.88 7.71SA 20% 3.45 8.49 2.00 62.90 6.27SA 40% 3.56 8.51 2.00 53.97 6.04SA 60% 3.69 8.47 2.00 49.82 5.90SA 80% 3.76 8.48 2.00 47.27 5.96SA

OSA1 3.29

VERAGE PREFERENCE SCORES (%)

ON NATURALNESS ANDSIMILARITYNESS AMONG SYSTEMS , WHERE

N/P

STANDS FOR “ NOPREFERENCE ” AND P DENOTES THE P - VALUE OF A T - TEST BETWEEN TWOVOCODERS . SI SA OSA1 OSA2 N/P p Naturalness 26.00 - - < - - 23.90 < - 28.06 36.94 < - - 46.56 < - 20.69 53.06 < HE P EARSON ’ S CORRELATION COEFFICIENTS BETWEEN UTTERANCEDURATIONS AND EVALUATION METRICS , WHERE C STANDS FOR THECOEFFICIENTS AND p STANDS FOR p - VALUE OF SIGNIFICANCE TEST . SNR RMSE-LAS MCD RMSE-F0 V/UV Error Rate

C 0.23 0.01 0.06 -0.08 0.09p < that the subjective performance of OSA1 was still not as goodas OSA2. One possible reason is the mismatch between thetraining and test data of the speaker encoder used by OSA1,which led to the higher EER and F0-RMSE of OSA1 as shownin Table I and II. F. Correlation Analysis between Model Performance and Ut-terance Duration

A correlation analysis was conducted to investigate therelationship between the objective performance of the OSA2vocoder and the duration of test utterances. 288 utteranceswere randomly selected from the VCTK test set and theirdurations varied from less than 2 seconds to more than 8seconds. These utterances were reconstructed into waveformsusing the OSA2 vocoder and the ﬁve metrics used in SectionIII.D were calculated for each utterance. The scatter diagramsand the Pearson’s correlation coefﬁcients between utterancedurations and evaluation metrics are shown in Fig. 2 and TableIV respectively. We can see that though there were no corre-lations between utterance durations and most other evaluationetrics, the SNR metric has a signiﬁcant weak correlationrelationship to duration. This indicates the performance of ourmodel is inﬂuenced somewhat by duration of test utterancesand the longer utterance generated is more likely to achievebetter quality. IV. C

ONCLUSIONS

In this paper, we have proposed an online speaker adaptationmethod based on a discriminatively-trained speaker encoderand a speaker-aware WaveNet vocoder to improve the per-formance of traditional speaker-independent neural vocoders.To demonstrate the effectiveness of our proposed model, wealso built speaker-independent model (SI) and ofﬂine speakeradaptation (SA) vocoders for comparison in our experiments.Experimental results have demonstrated that our method canachieve lower distortion and better naturalness of reconstructedwaveforms than the SI vocoder when dealing with unseenspeakers. Although this paper focuses on WaveNet vocoders,it is also possible to apply our proposed online adaptationmethod to other neural vocoders, which will be a task of ourfuture work. A

CKNOWLEDGMENT

This work was supported by the National Nature ScienceFoundation of China (Grant No. 61871358).R

EFERENCES[1] K. Tokuda, Y. Nankaku, T. Toda, H. Zen, J. Yamagishi, and K. Oura,“Speech synthesis based on hidden Markov models,”

Proceedings of theIEEE , vol. 101, no. 5, pp. 1234–1252, 2013.[2] H. Kawahara, I. Masuda-Katsuse, and A. De Cheveigne, “Restructuringspeech representations using a pitch-adaptive time–frequency smoothingand an instantaneous-frequency-based F0 extraction: Possible role of arepetitive structure in sounds,”

Speech communication , vol. 27, no. 3-4,pp. 187–207, 1999.[3] M. Morise, F. Yokomori, and K. Ozawa, “WORLD: a vocoder-basedhigh-quality speech synthesis system for real-time applications,”

IEICETRANSACTIONS on Information and Systems , vol. 99, no. 7, pp. 1877–1884, 2016.[4] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals,A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “WaveNet:A generative model for raw audio,” in , pp. 125–125.[5] A. Tamamori, T. Hayashi, K. Kobayashi, K. Takeda, and T. Toda,“Speaker-dependent WaveNet vocoder.” in

Interspeech , vol. 2017, 2017,pp. 1118–1122.[6] N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury, N. Casagrande,E. Lockhart, F. Stimberg, A. van den Oord, S. Dieleman, andK. Kavukcuoglu, “Efﬁcient neural audio synthesis,” in

ICML , 2018.[7] S. Kim, S.-g. Lee, J. Song, J. Kim, and S. Yoon, “FloWaveNet: Agenerative ﬂow for raw audio,” arXiv preprint arXiv:1811.02155 , 2018.[8] W. Ping, K. Peng, and J. Chen, “Clarinet: Parallel wave generationin end-to-end text-to-speech,” in

International Conference on LearningRepresentations , 2018.[9] R. Prenger, R. Valle, and B. Catanzaro, “WaveGlow: A ﬂow-basedgenerative network for speech synthesis,” in

ICASSP 2019-2019 IEEEInternational Conference on Acoustics, Speech and Signal Processing(ICASSP) . IEEE, 2019, pp. 3617–3621.[10] J. Yamagishi, T. Kobayashi, Y. Nakano, K. Ogata, and J. Isogai,“Analysis of speaker adaptation algorithms for HMM-based speechsynthesis and a constrained smaplr adaptation algorithm,”

IEEETransactions on Audio, Speech, and Language Processing , vol. 17, no. 1,pp. 66–83, 2009.[11] L.-J. Liu, Z.-H. Ling, Y. Jiang, M. Zhou, and L.-R. Dai, “WaveNetvocoder with limited training data for voice conversion.” in

Interspeech ,2018, pp. 1983–1987. [12] Y.-C. Wu, T. Hayashi, P. L. Tobing, K. Kobayashi, and T. Toda, “Quasi-periodic WaveNet vocoder: A pitch dependent dilated convolution modelfor parametric speech generation,” arXiv preprint arXiv:1907.00797 ,2019.[13] T. Hayashi, A. Tamamori, K. Kobayashi, K. Takeda, and T. Toda,“An investigation of multi-speaker training for WaveNet vocoder,” in . IEEE, 2017, pp. 712–718.[14] L. Juvela, V. Tsiaras, B. Bollepalli, M. Airaksinen, J. Yamagishi, P. Alku et al. , “Speaker-independent raw waveform model for glottal excitation,”in

Interspeech , 2018.[15] Z. Wu, P. Swietojanski, C. Veaux, S. Renals, and S. King, “A studyof speaker adaptation for DNN-based speech synthesis,” in

SixteenthAnnual Conference of the International Speech CommunicationAssociation , 2015.[16] R. Doddipatla, N. Braunschweiler, and R. Maia, “Speaker adaptation inDNN-based speech synthesis using d-vectors.” in

INTERSPEECH , 2017,pp. 3404–3408.[17] E. Variani, X. Lei, E. McDermott, I. L. Moreno, and J. Gonzalez-Dominguez, “Deep neural networks for small footprint text-dependentspeaker veriﬁcation,” in . IEEE, 2014, pp.4052–4056.[18] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur,“X-vectors: Robust DNN embeddings for speaker recognition,” in . IEEE, 2018, pp. 5329–5333.[19] O. Abdel-Hamid and H. Jiang, “Fast speaker adaptation of hybridNN/HMM model for speech recognition based on discriminativelearning of speaker code,” in . IEEE, 2013, pp. 7942–7946.[20] G. Heigold, I. Moreno, S. Bengio, and N. Shazeer, “End-to-end text-dependent speaker veriﬁcation,” in . IEEE, 2016,pp. 5115–5119.[21] L. Wan, Q. Wang, A. Papir, and I. L. Moreno, “Generalized end-to-endloss for speaker veriﬁcation,” in . IEEE, 2018,pp. 4879–4883.[22] A. Zhang, Q. Wang, Z. Zhu, J. Paisley, and C. Wang, “Fully supervisedspeaker diarization,” in

ICASSP 2019-2019 IEEE International Confer-ence on Acoustics, Speech and Signal Processing (ICASSP) . IEEE,2019, pp. 6301–6305.[23] Y. Jia, Y. Zhang, R. Weiss, Q. Wang, J. Shen, F. Ren, P. Nguyen,R. Pang, I. L. Moreno, Y. Wu et al. , “Transfer learning from speakerveriﬁcation to multispeaker text-to-speech synthesis,” in

Advances inneural information processing systems , 2018, pp. 4480–4490.[24] A. Gibiansky, S. Arik, G. Diamos, J. Miller, K. Peng, W. Ping,J. Raiman, and Y. Zhou, “Deep voice 2: Multi-speaker neural text-to-speech,” in

Advances in neural information processing systems , 2017,pp. 2962–2970.[25] W. Ping, K. Peng, A. Gibiansky, S. O. Arik, A. Kannan, S. Narang,J. Raiman, and J. Miller, “Deep voice 3: Scaling text-to-speechwith convolutional sequence learning,” in

International Conference onLearning Representations , 2018.[26] C. Recommendation, “Pulse code modulation (PCM) of voicefrequencies,” in

ITU , 1988.[27] C. Veaux, J. Yamagishi, K. MacDonald et al. , “Superseded-CSTR VCTKcorpus: English multi-speaker corpus for CSTR voice cloning toolkit,”2016.[28] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech:An ASR corpus based on public domain audio books,” in . IEEE, 2015, pp. 5206–5210.[29] A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb: A large-scalespeaker identiﬁcation dataset,”

Proc. Interspeech 2017 , pp. 2616–2620,2017.[30] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980arXiv preprint arXiv:1412.6980