Voice Cloning: a Multi-Speaker Text-to-Speech Synthesis Approach based on Transfer Learning
Giuseppe Ruggiero, Enrico Zovato, Luigi Di Caro, Vincent Pollet
VVOICE CLONING: A MULTI-SPEAKER TEXT-TO-SPEECH SYNTHESIS APPROACHBASED ON TRANSFER LEARNING
Giuseppe Ruggiero (cid:63)
Enrico Zovato (cid:63)
Luigi Di Caro (cid:63)
Vincent Pollet † (cid:63) Universit`a degli Studi di Torino † Cerence Inc.
ABSTRACT
Deep learning models are becoming predominant in many fields ofmachine learning. Text-to-Speech (TTS), the process of synthesiz-ing artificial speech from text, is no exception. To this end, a deepneural network is usually trained using a corpus of several hours ofrecorded speech from a single speaker. Trying to produce the voiceof a speaker other than the one learned is expensive and requireslarge effort since it is necessary to record a new dataset and retrainthe model. This is the main reason why the TTS models are usuallysingle speaker. The proposed approach has the goal to overcomethese limitations trying to obtain a system which is able to model amulti-speaker acoustic space. This allows the generation of speechaudio similar to the voice of different target speakers, even if theywere not observed during the training phase.
Index Terms — text-to-speech, deep learning, multi-speakerspeech synthesis, speaker embedding, transfer learning
1. INTRODUCTION
Text-to-Speech (TTS) synthesis, the process of generating naturalspeech from text, remains a challenging task despite decades of in-vestigation. Nowadays there are several TTS systems able to getimpressive results in terms of synthesis of natural voices very closeto human ones. Unfortunately, many of these systems learn to syn-thesize text only with a single voice. The goal of this work is to builda TTS system which can generate in a data efficient manner naturalspeech for a wide variety of speakers, not necessarily seen duringthe training phase. The activity that allows the creation of this typeof models is called Voice Cloning and has many applications, suchas restoring the ability to communicate naturally to users who havelost their voice or customizing digital assistants such as Siri.Over time, there has been a significant interest in end-to-endTTS models trained directly from text-audio pairs; Tacotron 2 [1]used WaveNet [2] as a vocoder to invert spectrograms generatedby sequence-to-sequence with attention [3] model architecture thatencodes text and decodes spectrograms, obtaining a naturalnessclose to the human one. It only supported a single speaker. Gibian-sky et al. [4] proposed a multi-speaker variation of Tacotron ableto learn a low-dimensional speaker embedding for each trainingspeaker. Deep Voice 3 [5] introduced a fully convolutional encoder-decoder architecture which supports thousands of speakers fromLibriSpeech [6]. However, these systems only support synthesis ofvoices seen during training since they learn a fixed set of speakerembeddings. Voiceloop [7] proposed a novel architecture which cangenerate speech from voices unseen during training but requires tensof minutes of speech and transcripts of the target speaker. In recentextensions, only a few seconds of speech per speaker can be usedto generate new speech in that speaker’s voice. Nachmani et al. [8] for example, extended Voiceloop to utilize a target speaker encodingnetwork to predict speaker embedding directly from a spectrogram.This network is jointly trained with the synthesis network to ensurethat embeddings predicted from utterances by the same speaker arecloser than embeddings computed from different speakers. Jia et al.[9] proposed a speaker encoder model similar to [8], except that theyused a network independently-trained exploring transfer learningfrom a pre-trained speaker verification model towards the synthesismodel.This work is similar to [9] however introduces different archi-tectures and uses a new transfer learning technique still based on apre-trained speaker verification model but exploiting utterance em-beddings rather than speaker embeddings. In addition, we use adifferent strategy to condition the speech synthesis with the voiceof speakers not observed before and compared several neural archi-tectures for the speaker encoder model. The paper is organized asfollows: Section 2 describes the model architecture and its formaldefinition; Section 3 reports experiments and results done to evalu-ate the proposed solution; finally conclusions are reported in Section4.
2. MODEL ARCHITECTURE
Following [9], the proposed system consists of three components:a speaker encoder , which computes a fixed-dimensional embeddingvector from a few seconds of reference speech of a target speaker; a synthesizer , which predicts a mel spetrogram from an input text andan embedding vector; a neural vocoder , which infers time-domainwaveforms from the mel spectrograms generated by the synthesizer.At inference time, the speaker encoder takes as input a short refer-ence utterance of the target speaker and generates, according to itsinternal learned speaker characteristics space, an embedding vector.The synthesizer takes as input a phoneme (or grapheme) sequenceand generates a mel spectrogram, conditioned by the speaker en-coder embedding vector. Finally the vocoder takes the output of thesynthesizer and generates the speech waveform. This is illustratedin Figure 1.
Consider a dataset of N speakers each of which has M utterances inthe time-domain. Let’s denote the j-th utterance of the i-th speakeras u ij while the feature extraced from the j-th utterance of the i-th speaker as x ij ( ≤ i ≤ N and ≤ j ≤ M ). We chose as featurevector x ij the mel spectrogram.The speaker encoder E has the task to produce meaningful em-bedding vectors that should characterize the voices of the speakers.It computes the embedding vector e ij corresponding to the utterance a r X i v : . [ c s . S D ] F e b ig. 1 . High level overview of the three components of the system. u ij as: e ij = E ( x ij ; w E ) (1)where w E represents the encoder model parameters. Let’s defineit utterance embedding . In addition to defining embedding at theutterance level, we can also define the speaker embedding : c i = 1 n n (cid:88) j =1 e ij (2)In [9], the synthesizer S predicts x ij given c ij and t ij , the transcriptof the utterance u ij : ˆ x ij = S ( c i , t ij ; w S ) (3)where w S represents the synthesizer model parameters. In our ap-proach, we propose to use the utterance embedding rather than thespeaker embedding: ˆ x ij = S ( e ij , t ij ; w S ) (4)We will motivate this choice in Paragraph 2.4.Finally, the vocoder V generates u ij given ˆ x ij . So we have: ˆ u ij = V (ˆ x ij ; w V ) (5)where w V represents the vocoder model parameters.This system could be trained in an end-to-end mode trying tooptimize the following objective function: min w E , w S , w V L V ( u ij , V ( S ( E ( x ij ; w E ) , t ij ; w S ) ; w V )) (6)where L V is a loss function in the time-domain. However, it requiresto train the three models using the same dataset, moreover, the con-vergence of the combined model could be hard to reach. To over-come this drawback, the synthesizer can be trained independentlyto directly predict the mel spectrogram x ij of a target utterance u ij trying to optimize the following objective function: min w S L S ( x ij , S ( e ij , t ij ; w S )) (7)where L S is a loss function in the time-frequency domain. It is nec-essary to have a pre-trained speaker encoder model available to com-pute the utterance embedding e ij .The vocoder can be trained either directly on the mel spectro-grams predicted by the synthesizer or on the groundtruth mel spec-trograms: min w V L V ( u ij , V ( x ij ; w V )) or min w V L V ( u ij , V (ˆ x ij ; w V )) (8)where L V is a loss function in the time-domain. In the second case,a pre-trained synthesizer model is needed.If the definition of the objective function was quite simple forboth the synthesizer and the vocoder, unfortunately this is not the Fig. 2 . Speaker encoder model architecture. Input is composed of atime sequence of dimension 40. The last linear layer takes the hiddenstate of the last GRU layer as input.case for the speaker encoder. The encoder does not have labels to betrained on because its task is only to create the space of character-istics necessary to create the embedding vectors. The GeneralizedEnd-to-End (GE2E) [10] loss brings a solution to this problem andit allows the training of the speaker encoder independently. Conse-quently, we can define the following objective function: min w E L G ( S ; w E ) = (cid:88) j,i L ( e ji ) (9)where S represents a similarity matrix and L G is the GE2E loss func-tion. The speaker encoder must be able to produce an embedding vec-tor that meaningfully represents speaker characteristics in the trans-formed space starting from a target speaker’s utterance. Further-more, the model should identify these characteristics using a shortspeech signal, regardless of its phonetic content and backgroundnoise. This can be achieved by training a neural network model ona text-independent speaker verification task that tries to optimize theGE2E loss so that embeddings of utterances from the same speakerhave high cosine similarity, while those of utterances from differentspeakers are far apart in the embedding space.The network maps a sequence of mel spectrogram frames to afixed-dimensional embedding vector, known as d-vector [11, 12]. In-put mel spectrograms are fed to a network consisting of one Conv1D[13] layer of 512 units followed by a stack of 3 GRU [14] layers of512 units, each followed by a linear projection of 256 dimension.Following [9], the final embedding dimension is 256 and it is cre-ated by L2-normalizing the output of the top layer at the final frame.This is shown in Figure 2. We noticed that this architecture was thebest among the various tried and tested, as we will see in Section 3.During the training phase, all the utterances are split into partialutterances that are 1.6 seconds long (160 frames). Also at infer-ence time, the input utterance is split into segments of 1.6 secondswith 50% overlap and the model processes each segment individu-ally. Following [9, 10], the final utterance-wise d-vector is gener-ated by L2 normalizing the window-wise d-vectors and taking theelement-wise average.
The synthesizer component of the system is a sequence-to-sequencemodel with attention [1, 3] which is trained on pairs of text derivedoken sequences and audio derived mel spectrogram sequences. Fur-thermore, the network is trained in a transfer learning configuration(see Paragraph 2.4), using an independently-trained speaker encoderto extract embedding vectors useful to condition the outcomes of thiscomponent. In view of reproducibility, the adopted vocoder compo-nent of the system is a Pytorch github implementation of the neuralvocoder WaveRNN [15]. This model is not directly conditioned onthe output of the speaker encoder but just on the input mel spectro-gram. The multi-speaker vocoder is simply trained by using datafrom many speakers (see Section 3). The conditioning of the synthesizer via speaker encoder is the fun-damental part that makes the system multi-speaker: the embeddingvectors computed by the speaker encoder allow the conditioning ofthe mel spectrograms generated by the synthesizer so that they canincorporate the new speaker voice. In [9], the embedding vectorsare speaker embeddings obtained by Equation 2. We used the ut-terance embeddings computed by Equation 1. In fact, at inferencetime only one utterance of the target speaker is fed to the speakerencoder which therefore produces a single utterance-level d-vector.Thus, in this case, it is not possible to create an embedding at thespeaker level since the average operation cannot be applied. Thisimplies that only utterance embeddings can be used during the in-ference phase. In addition, an average mechanism could cause someloss in terms of accuracy. This is due to larger variations in pitch andvoice quality often occurring in utterances of the same speaker whileutterances have lower intra-variation. Following [9], the embeddingvectors computed by the speaker encoder are concatenated only withthe synthesizer encoder output in order to condition the synthesis.However, we experimented with a new concatenation technique: firstwe passed the embedding through a single linear layer and then weapplied the concatenation between the output of this layer and thesynthesizer encoder one. The goal was to exploit the weights of thelinear layer to make the embedding vector more meaningful, sincethe layer was trained together with the synthesizer. We noticed thatthis method achieved good convergence of training and was about75% times faster than the former vector concatenation.
3. EXPERIMENTS AND RESULTS
We used different publicly available datasets to train and evaluate thecomponents of the system. For the speaker encoder, different neuralnetwork architectures were tested. Each of them was trained usinga combination of three public sets: LibriTTS [16] train-other anddev-other; VoxCeleb [17] dev and VoxCeleb2 [18] dev. In this way,we obtained a number of speakers equal to 8,381 and a number ofutterances equal to 1,419,192, not necessarily all clean and noiseless.Furthermore, transcripts were not required. The models were trainedusing Adam [19] as optimizer with an initial learning rate equal to0.001. Moreover, we experimented with different learning rate decaystrategies.During the evaluation phase, we used a combination of the corre-sponding test sets of the training ones, obtaining a number of speak-ers equal to 191 and a number of utterances equal to 45,132. Bothtraining and test sets have been sampled at 16 kHz and input melspectrograms were computed from 25ms STFT analysis windowswith a 10ms step and passed through a 40-channel mel-scale filter-bank. https://github.com/fatchord/WaveRNN We separately trained the synthesizer and the vocoder using thesame training set given by the combination of the two “clean” setsof LibriTTS, obtaining a number of speakers equal to 1,151, a num-ber of utterances equal to 149,736 and a total number of hours equalto 245,14 of 22.05 kHz audio. We trained the synthesizer using theL1 loss [20] and Adam as optimizer. Moreover, the input texts wereconverted into phoneme sequences and target mel spectrogram fea-tures are computed on 50 ms signal windows, shifted by 12.5 msand passed through an 80-channel mel-scale filterbank. The vocoderwas trained using groundtruth waveforms rather than the synthesizeroutputs.
We choose as baseline for our work the Corentin Jemine’s real-time voice cloning system [21], a public re-implementation of theGoogle system [9] available on github . This system is composedout of three components: a recurrent speaker encoder consisting of3 LSTM [22] layers and a final linear layer, each of which has 256units; a sequence-to-sequence with attention synthesizer based on[1] and WaveRNN [15] as vocoder. To evaluate all the speaker encoder models and choose the best one,the Speaker Verification Equal Error Rate (SV-EER) was estimatedby pairing each test utterance with each enrollment speaker. Themodels implemented are:• rec conv network : 5 Conv1D layers, 1 GRU layer and a finallinear layer;• rec conv 2 network : 3 Conv1D layers, 2 GRU layers eachfollowed by a linear projection layer;• gru network : 3 GRU layers each followed by a linear projec-tion layer;• advanced gru network : 1 Conv1D layer and 3 GRU layerseach followed by a linear projection layer (Figure 2);• lstm network : 1 Conv1D layer and 3 LSTM [22] layers eachfollowed by a linear projection layer.All layers have 512 units except the linear ones which have 256.Moreover, dropout rate of 0.2 was used after all the layers exceptbefore the first and after the last. All the models were trained usinga batch size of 64 speakers and 10 utterances for each speaker. Theresults obtained are shown in Table 1.
Table 1 . Speaker Verification Equal Error Rates.
Name Step Time Train Loss SV-EER LR Decay rec conv
Exponentiallstm 1.08s 0.17 0.052 Exponential
We designed the advanced gru network trying to combine theadvantages of convolutional and gru networks. In fact, looking atthe table, this architecture was much faster than the gru network dur-ing training, and obtained the best SV-EER on the test set. Figure 3illustrates the projection in a two-dimensional space of the utteranceembeddings computed by the advanced gru network on the basis of https://github.com/CorentinJ/Real-Time-Voice-Cloning ig. 3 . Advanced Gru Network test utterance embeddings projec-tion.6 utterances extracted from 12 speakers of the test set. In Figure 4,the 12 speakers are 6 men and 6 women. The projections were madeusing UMAP [23]. Both the figures show that the model has createda space of internal features that is robust regarding the speakers, cre-ating well-formed clusters of speakers based on their utterances andnicely separating male speakers from female ones.The SV-EER obtained on the test set from the speaker encodermodel of the proposed system is 0.040 vs the baseline one which is0.049. Fig. 4 . Advanced Gru Network six utterances for six male vs sixutterances for six female taken from the test set.
To assess how similar the waveforms generated by the system werefrom the original ones, we transformed the audio signals producedinto utterance embeddings (using the speaker encoder advanced grunetwork) and then projected them in a two-dimensional space to-gether with the utterance embeddings computed on the basis of thegroundtruth audio. As test speakers, we randomly choose eight tar-get speakers: four speakers (two male and two female) were ex-tracted from the test-set-clean of LibriTTS [16], three (two male andone female) from VCTK [24] and finally a female proprietary voice.For each speaker we randomly extracted 10 utterances and comparedthem with the utterances generated by the system calculating the co-sine similarity. The speakers averaged values of cosine similaritybetween the generated and groundtruth utterance embeddings rangefrom 0.56 to 0.76. Figure 5 shows that synthesized utterances tendto lie close to real speech from the same speaker in the embeddingspace.
Finally, we evaluated how the generated utterances were, subjec-tively speaking, similar in terms of speech timbre to the original
Fig. 5 . Groundtruth utterance embeddings vs the corresponding gen-erated ones of the 8 speakers chosen for testing.ones. To do this, we gathered Mean Similarity Scores (MSS) basedon a 5 points mean opinion score scale, where 1 stands for “verydifferent” and 5 for “very similar”. Ten utterances of the proprietaryfemale voice were cloned using both the proposed and the baselinesystem and then 12 subjects, most of them TTS experts, were askedto listen to the 20 samples, randomly mixed, and rate them. Partici-pants were also provided with an original utterance as reference. Thequestion asked was: “How do you rate the similarity of these sam-ples with respect to the reference audio? Try to focus on vocal timbreand not on content, intonation or acoustic quality of the audio”. Theresults obtained are shown in Table 2. Although not conclusive, thisexperiment highlights a subjective evidence of the goodness of theproposed approach, despite the significant variance of both systems:this is largely due to the low number of test participants.
Table 2 . MSS of the baseline and the proposed systems.
System MSS baseline . ± . proposed . ± .
4. CONCLUSIONS
In this work, our goal was to build a Voice Cloning system whichcould generate natural speech for a variety of target speakers in a dataefficient manner. Our system combines an independently trainedspeaker encoder network with a sequence-to-sequence with attentionarchitecture and a neural vocoder model. Using a transfer learningtechnique from a speaker-discriminative encoder model based on ut-terance embeddings rather than speaker embeddings, the synthesizerand the vocoder are able to generate good quality speech also forspeakers not observed before. Despite the experiments showed areasonable similarity with real speech and improvements over thebaseline, the proposed system does not fully reach human-level nat-uralness in contrast to the single speaker results from [1]. Addition-ally, the system is not able to reproduce the speaker prosody of thetarget audio. These are consequences of the additional difficulty ofgenerating speech for a variety of speakers given significantly lessdata per speaker unlike when training a model on a single speaker.
5. ACKNOWLEDGEMENTS
The authors thank Roberto Esposito, Corentin Jemine, Quan Wang,Ignacio Lopez Moreno, Skjalg Lepsøy, Alessandro Garbo andJ¨urgen Van de Walle for their helpful discussions and feedback. . REFERENCES [1] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang,Z. Chen, Y. Zhang, Y. Wang, R. Skerrv-Ryan, R. A. Saurous,Y. Agiomvrgiannakis, and Y. Wu, “Natural tts synthesis byconditioning wavenet on mel spectrogram predictions,” in , 2018, pp. 4779–4783.[2] A¨aron van den Oord, Sander Dieleman, Heiga Zen, Karen Si-monyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, An-drew W. Senior, and Koray Kavukcuoglu, “Wavenet: A gener-ative model for raw audio,”
CoRR , vol. abs/1609.03499, 2016.[3] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio,“Neural machine translation by jointly learning to align andtranslate,”
CoRR , vol. abs/1409.0473, 2015.[4] Andrew Gibiansky, Sercan Arik, Gregory Diamos, John Miller,Kainan Peng, Wei Ping, Jonathan Raiman, and Yanqi Zhou,“Deep voice 2: Multi-speaker neural text-to-speech,” in
Ad-vances in Neural Information Processing Systems 30 , I. Guyon,U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vish-wanathan, and R. Garnett, Eds., pp. 2962–2970. Curran As-sociates, Inc., 2017.[5] Wei Ping, Kainan Peng, Andrew Gibiansky, Sercan O. Arik,Ajay Kannan, Sharan Narang, Jonathan Raiman, and JohnMiller, “Deep voice 3: 2000-speaker neural text-to-speech,” in
International Conference on Learning Representations , 2018.[6] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib-rispeech: An asr corpus based on public domain audio books,”in , 2015, pp. 5206–5210.[7] Yaniv Taigman, Lior Wolf, Adam Polyak, and Eliya Nach-mani, “Voiceloop: Voice fitting and synthesis via a phono-logical loop,” in
International Conference on Learning Repre-sentations , 2018.[8] Eliya Nachmani, Adam Polyak, Yaniv Taigman, and Lior Wolf,“Fitting new speakers based on a short untranscribed sample,”
CoRR , vol. abs/1802.06984, 2018.[9] Ye Jia, Yu Zhang, Ron J. Weiss, Quan Wang, Jonathan Shen,Fei Ren, Zhifeng Chen, Patrick Nguyen, Ruoming Pang, Igna-cio Lopez-Moreno, and Yonghui Wu, “Transfer learning fromspeaker verification to multispeaker text-to-speech synthesis,”
CoRR , vol. abs/1806.04558, 2018.[10] Lipeng Wan, Qi shan Wang, Alan Papir, and Ignacio Lopez-Moreno, “Generalized end-to-end loss for speaker verifi-cation,” , pp. 4879–4883,2018.[11] Georg Heigold, Ignacio Moreno, Samy Bengio, and NoamShazeer, “End-to-end text-dependent speaker verification,” , pp. 5115–5119, 2016.[12] Ehsan Variani, Xin Lei, Erik McDermott, Ignacio LopezMoreno, and Javier Gonzalez-Dominguez, “Deep neural net-works for small footprint text-dependent speaker verification,”in
Proc. ICASSP , 2014.[13] Serkan Kiranyaz, Onur Avci, Osama Abdeljaber, Turker Ince,Moncef Gabbouj, and Daniel J. Inman, “1d convolutionalneural networks and applications: A survey,”
ArXiv , vol.abs/1905.03554, 2019. [14] Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho, andYoshua Bengio, “Empirical evaluation of gated recurrent neu-ral networks on sequence modeling,” in
NIPS 2014 Workshopon Deep Learning, December 2014 , 2014.[15] Nal Kalchbrenner, Erich Elsen, Karen Simonyan, SebNoury, Norman Casagrande, Edward Lockhart, Florian Stim-berg, A¨aron van den Oord, Sander Dieleman, and KorayKavukcuoglu, “Efficient neural audio synthesis,” in
ICML ,2018.[16] Heiga Zen, Viet Dang, Rob Clark, Yu Zhang, Ron J. Weiss,Ye Jia, Zhifeng Chen, and Yonghui Wu, “Libritts: A corpus de-rived from librispeech for text-to-speech,” in
INTERSPEECH ,2019.[17] Arsha Nagrani, Joon Son Chung, and Andrew Zisserman,“Voxceleb: A large-scale speaker identification dataset,” in
INTERSPEECH , 2017.[18] Joon Son Chung, Arsha Nagrani, and Andrew Zisserman,“Voxceleb2: Deep speaker recognition,” in
INTERSPEECH ,2018.[19] Diederik Kingma and Jimmy Ba, “Adam: A method forstochastic optimization,”
International Conference on Learn-ing Representations , 12 2014.[20] Katarzyna Janocha and Wojciech Czarnecki, “On loss func-tions for deep neural networks in classification,”
ArXiv , vol.abs/1702.05659, 2017.[21] Corentin Jemine, “Master thesis: Automatic multispeakervoice cloning,” 2019, Unpublished master’s thesis, Universit´ede Li`ege, Li`ege, Belgique.[22] Klaus Greff, Rupesh K. Srivastava, Jan Koutnik, Bas R. Ste-unebrink, and Jurgen Schmidhuber, “Lstm: A search spaceodyssey,”
IEEE Transactions on Neural Networks and Learn-ing Systems , vol. 28, no. 10, pp. 2222–2232, Oct 2017.[23] Leland McInnes and John Healy, “Umap: Uniform manifoldapproximation and projection for dimension reduction,”