[PDF] One Model, Many Languages: Meta-learning for Multilingual Text-to-Speech

Abstract

We introduce an approach to multilingual speech synthesis which uses the meta-learning concept of contextual parameter generation and produces natural-sounding multilingual speech using more languages and less training data than previous approaches. Our model is based on Tacotron 2 with a fully convolutional input text encoder whose weights are predicted by a separate parameter generator network. To boost voice cloning, the model uses an adversarial speaker classifier with a gradient reversal layer that removes speaker-specific information from the encoder. We arranged two experiments to compare our model with baselines using various levels of cross-lingual parameter sharing, in order to evaluate: (1) stability and performance when training on low amounts of data, (2) pronunciation accuracy and voice quality of code-switching synthesis. For training, we used the CSS10 dataset and our new small dataset based on Common Voice recordings in five languages. Our model is shown to effectively share information across languages and according to a subjective evaluation test, it produces more natural and accurate code-switching speech than the baselines.

Full PDF

OOne Model, Many Languages: Meta-learning for Multilingual Text-to-Speech

Tomáš Nekvinda, Ondˇrej Dušek

Charles University, Faculty of Mathematics and Physics, Prague, Czechia [email protected], [email protected]

Abstract

We introduce an approach to multilingual speech synthesis whichuses the meta-learning concept of contextual parameter gener-ation and produces natural-sounding multilingual speech usingmore languages and less training data than previous approaches.Our model is based on Tacotron 2 with a fully convolutional inputtext encoder whose weights are predicted by a separate parame-ter generator network. To boost voice cloning, the model uses anadversarial speaker classiﬁer with a gradient reversal layer thatremoves speaker-speciﬁc information from the encoder.We arranged two experiments to compare our model withbaselines using various levels of cross-lingual parameter sharing,in order to evaluate: (1) stability and performance when trainingon low amounts of data, (2) pronunciation accuracy and voicequality of code-switching synthesis. For training, we used theCSS10 dataset and our new small dataset based on CommonVoice recordings in ﬁve languages. Our model is shown toeffectively share information across languages and accordingto a subjective evaluation test, it produces more natural andaccurate code-switching speech than the baselines.

Index Terms : text-to-speech, speech synthesis, multilinguality,code-switching, meta-learning, domain-adversarial training

1. Introduction

Contemporary end-to-end speech synthesis systems achievegreat results and produce natural-sounding human-like speech[1, 2] even in real time [3, 4]. They make possible an efﬁcienttraining that does not put high demands on quality, amount, andpreprocessing of training data. Based on these advances, re-searchers aim at, for example, expressiveness [5], controllability[6], or few-shot voice cloning [7]. When extending these mod-els to support multiple languages, one may encounter obstaclessuch as different input representations or pronunciations, andimbalanced amounts of training data per language.In this work, we examine cross-lingual knowledge-sharingaspects of multilingual text-to-speech (TTS). We experimentwith more languages simultaneously than most previous TTSwork known to us. We can summarize our contributions asfollows: (1) We propose a scalable grapheme-based model thatutilizes the idea of contextual parameter generator network [8]and we compare it with baseline models using different levelsof parameter sharing. (2) We introduce a new small datasetbased on Common Voice [9] that includes data in ﬁve languagesfrom 84 speakers. (3) We evaluate effectiveness of the comparedmodels on ten languages with three different scripts and weshow their code-switching abilities on ﬁve languages. For thepurposes of the evaluation, we created a new test set of 400bilingual code-switching sentences.Our source code, hyper-parameters, training and evaluationdata, samples, pre-trained models, and interactive demos arefreely available on GitHub. https://github.com/Tomiinek/Multilingual_Text_to_Speech Figure 1:

Diagram of our model. The meta-network generatesparameters of language-speciﬁc convolutional text encoders.Encoded text inputs enhanced with speaker embeddings are readby the decoder. The adversarial classiﬁer suppresses speaker-dependent information in encoder outputs.

2. Related Work

So far, several works explored training joint multilingual modelsin text-to-speech, following similar experiments in the ﬁeld ofneural machine translation [10, 8]. Multilingual models offer afew key beneﬁts:•

Transfer learning:

We can try to make use of high-resourcelanguages for training TTS systems for low-resource lan-guages, e.g., via transfer learning approaches [11, 12].•

Knowledge sharing:

We may think of using multilingual datafor joint training of a single shared text-to-speech model. In-tuitively, this enables cross-lingual sharing of patterns learnedfrom data. The only work in this area to our knowledge isPrakash et al.’s study [13] on TTS for related Indian languagesusing hand-built uniﬁed phoneme representations.•

Voice cloning:

Under certain circumstances, producingspeech in multiple languages with the same voice, i.e., cross-lingual voice cloning, is desired. However, audio data where asingle speaker speaks several languages is scarce. That is whymultilingual voice-cloning systems should be trainable usingmixtures of monolingual data. Here, Zhang et al. [14] usedTacotron 2 [1] conditioned on phonemes and showed voice-cloning abilities on English, Spanish, and Chinese. Nachmaniand Wolf [15] extended Voice Loop [16] and enabled voiceconversion for English, Spanish, and German. Chen et al.[17] used a phoneme-based Tacotron 2 with a ResCNN basedspeaker encoder [18] that enables a massively multi-speakerspeech synthesis, even with ﬁctitious voices.•

Code switching:

In this task closely related to cross-lingualvoice cloning, we would like to alternate languages withinsentences. This is useful for foreign names in navigationsystems or news readers. In view of that, Cao et al. [19] mod-iﬁed Tacotron; their model uses language-speciﬁc encoders.Code-switching itself is done by combining of their outputs.Overall, all recent multilingual text-to-speech systems were onlytested in 2-3 languages simultaneously, or required vast amountsof data to be trained. a r X i v : . [ ee ss . A S ] A ug . Model Architecture We base our experiments on Tacotron 2 [1]. We focus on thespectrogram generation part here; for vocoding, we use Wav-eRNN [3, 20] in all our conﬁgurations. We ﬁrst explain our newmodel that uses meta-learning for multilingual knowledge shar-ing in Sec. 3.1, then describe contrastive baseline models whichare based on recent multilingual TTS architectures (Sec. 3.2). EN ) We introduce a scalable multilingual text-to-speech model thatfollows a meta-learning approach of contextual parameter gener-ation proposed by Platanios et al. [8] for NMT (see Fig. 1). Wecall the model generated (G EN ) further in this text.The backbone of our model is built on our own implemen-tation of Tacotron 2, composed of these main components: (1)an input text encoder that includes a stack of convolutional lay-ers and a bidirectional LSTM, (2) a location-sensitive attentionmechanism [1] with the guided attention loss term [21] that sup-ports faster convergence, (3) a decoder with two stacked LSTMlayers where the ﬁrst queries the attention mechanism and thesecond generates outputs. We increase tolerance of the guidedattention loss exponentially during training.We propose the following changes to this basic architecture: Convolutional Encoders:

We use multiple language-speciﬁcinput text encoders. However, having a separate encoder withrecurrent layers for each language is not practical as it involvespassing the training batches (which should be balanced withrespect to languages) through multiple encoders sequentially.Therefore, we use a fully convolutional encoder from DCTTS[21]. The encoders use grouped layers and are thus processedeffectively. We enhance the encoders with batch normalizationand dropout with a very low rate. The normalization layers aresituated before activations and dropouts after them.

Encoder parameter generation:

To enable cross-lingualknowledge-sharing, parameters of the encoders are generatedusing a separate network conditioned on language embeddings.The parameter generator is composed of multiple site-speciﬁcgenerators, each of which takes a language embedding on theinput and produces parameters for one layer of the convolutionalencoder for the given language. The generators enable a control-lable cross-lingual parameter sharing because reduction of theirsize prevents generation of highly language-speciﬁc parameters.We implement them as fully connected layers.

Training with multilingual batches:

We construct unusualtraining batches to fully utilize the potential of this architecture.We would like to have a batch of B examples that can be reshapedinto a batch of size B / L where L is the number of encoder groupsor languages. This new batch should have a new dimension thatgroups all examples with the same language. Thus we use abatch sampler that creates batches where for each l < L and i < B / L , all ( l + iL )-th examples are of the same language. Speaker embedding:

We extend the model with a speakerembedding which is concatenated with each element of the en-coded sequence that is attended by the decoder while generatingspectrogram frames. This makes the model multi-speaker andallows cross-lingual voice cloning.

Adversarial speaker classiﬁer:

We combine the model withan adversarial speaker classiﬁer [14] to boost voice cloning. Theclassiﬁer follows principles of domain adversarial training [22]and is used to proactively remove speaker-speciﬁc informationfrom the encoders. It includes a single hidden layer, a softmax Table 1:

Total data sizes per language (hours of audio data) inour cleaned CSS10 (CSS) and Common Voice (CV) subsets.

DE EL SP FI FR HU JP NL RU ZHCSS 15.4 3.5 20.9 9.7 16.9 9.5 14.3 11.7 17.7 5.6CV 4.8

N/A N/A N/A

N/A N/A λ . The gradients are clippedto stabilize training. It is optimized to reduce the cross-entropyof speaker predictions. The predictions are done separately foreach element of the encoders’ outputs. We compare G EN with baseline models called shared (S HA ), separate (S EP ), and single (S GL ). S GL is a basic Tacotron 2model, S HA and S EP follow the recent multilingual TTS worksof Zhang et al. [14] and Cao et al. [19], respectively, but wereslightly adapted to our tasks for a fairer comparison to G EN –we use more languages and less data than the original works. Inthe following, we only describe their differences from G EN . Single (S GL ) represents a set of monolingual models that fol-low vanilla Tacotron 2 [1] with the original recurrent encoderand default settings. S GL cannot be used for code-switching. Shared (S HA ): Unlike G EN , S HA has a single encoder withthe original Tacotron 2 architecture, so it fully shares all en-coder parameters. This sharing implicitly leads to language-independent encoder outputs. The language-dependent process-ing happens in the decoder, so the speaker embeddings are ex-plicitly factorized into speaker and language parts. Separate (S EP ) uses multiple language-speciﬁc convolutionalencoders too, but their parameters are not generated. It also doesnot include the adversarial speaker classiﬁer.

4. Dataset

We created a new dataset for our experiments, based on care-fully cleaning and preprocessing freely available audio sources:CSS10 [23] and a small fraction of Common Voice [9]. Table 1shows total durations of the used audio data per language.

CSS10 consists of mono-speaker data in German, Greek, Span-ish, Finnish, French, Hungarian, Japanese, Dutch, Russian, andChinese. It was created from audiobooks and contains variouspunctuation styles. We applied an automated cleaning to normal-ize transcripts across languages, including punctuation and somespelling variants (e.g., “œ” → “oe”). We romanized Japanesewith MeCab and Romkan [24, 25], Chinese using Pinyin [26].We further ﬁltered the data to remove any potentially prob-lematic transcripts: we preserved just examples with 0.5-10.1sof audio and 3-190 transcript characters. We computed means µ and variances σ of audio durations of groups correspondingto examples with the same transcript lengths. Then we removedthose with durations outside the interval ( µ – 3 σ , µ + 3 σ ). Intotal, the resulting dataset includes 125.26 hours of recordings. To train code-switching models, multi-speaker data is required todisentangle the connection between languages and speakers. Weable 2:

Left: CERs of ground-truth recordings ( GT ) and recordings produced by monolingual and the three examined multilingualmodels. Right: CERs of the recordings synthesized by G EN and S HA trained on just 600 or 900 training examples per language. Bestresults for the given language are shown in bold; “*” denotes statistical signiﬁcance (established using paired t-test; p < 0.05 ). GT S GL S HA S EP G EN S HA

600 S HA

900 G EN

600 G EN DE 4.8 ± ± ± ± ± ± ± ± ± ± N/A ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± FI 6.9 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± HU 6.3 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± RU 12.3 ± ± ± ± ± ± ± ± ± ZH 14.6 ± ± ± ± ± ± ± ± ± thus enhanced CSS10 with data from Common Voice (CV) forlanguages included in both sets – the intersection covers German,French, Chinese, Dutch, Russian, Japanese, and Spanish.Since CV is mainly aimed at speech recognition and rathernoisy, we performed extensive ﬁltering: We removed recordingswith a negative rating (as provided by CV for each example) andexcluded any speakers with less than 50 recordings. We checkeda sample of recordings for each speaker, and we removed alltheir data if we considered the sample to have poor quality. Thisresulted in a small dataset of 39 German, 22 French, 11 Dutch,6 Chinese, and 6 Russian speakers. Japanese and Spanish datawere removed completely. A lot of recordings in CV containartifacts at the beginning or end. Thus we semi-automaticallycleaned leading and trailing segments of all recordings. Thedataset has 13.7 hours of audio data in total.

5. Experiments

We compare our models described in Section 3. The experimentin Section 5.1 was designed to show stability and ability to trainon lower amounts of data. We conclude that character errorrate (CER) evaluation [27] is sufﬁcient for this experiment. InSection 5.2, we test pronunciation accuracy and voice quality ofcode-switching synthesis. We used a subjective evaluation testas there are no straightforward objective metrics for this task.We used the same vocoder for all models, i.e., the WaveRNNmodel trained on a training subset of the cleaned CSS10 dataset.

We used our cleaned CSS10 dataset for train-ing; 64 randomly selected samples per language were reservedfor validation and another 64 for testing. We did not have anambition to clone voices in this experiment, so we switched offspeaker classiﬁers for S HA and G EN (i.e., S HA was reduced tothe vanilla Tacotron 2 model with a language embedding).We trained the three models for 50k steps with the Adamoptimizer. We used a stepped learning rate that starts from 10 –3 and halves every 10k steps. In the case of S EP , we used a lowerinitial learning rate 10 –4 . For S GL , the learning rate schedulewas tuned individually per language. We stopped training earlyafter validation data loss started increasing. S HA , S EP , and G EN used speaker embeddings of size 32 and G EN used language em-beddings and parameter generators of size 10 and 8, respectively.We used language-balanced batches of size 60 for all models. With β =0.9, β =0.999, (cid:15) =10 –6 , and weight decay of 10 –6 Evaluation:

We synthesized evaluation data using all the mod-els followed by WaveRNN and we sent the synthesized record-ings to Google Cloud Platform ASR. Then we computed CERsbetween ground-truth and ASR-produced transcripts (we usedthe native symbols for Chinese and Japanese).

Results:

Table 2 summarizes the obtained CERs. The ﬁrstcolumn gives us a notion about the performance of the ASRengine. The rates stay below 20% for all languages; higherCERs are mostly caused by noisy CSS10 recordings.We were not able to train the Greek S GL model due to lowamount of training data. The decoder started to overﬁt soon be-fore the attention could have been established. The performanceof S GL is similar to S HA except for Chinese, Finnish, and Greek.S EP performed noticeably worse than S HA or even S GL . Thismay be caused by the imbalance between the batch size of theencoder and the decoder as the encoder’s effective batch size isjust B / L . Sharing of the data probably regularized the decoder,so the attention was established even in the case of Greek. G EN seems to be signiﬁcantly better than S HA on most languages. Itfulﬁlls our expectations as G EN should be more ﬂexible. Manual error analysis:

We manually inspected the outputsin German, French, Spanish, and Russian. In the case of Spanish,all the models work well; we noticed just differences in thetreatment of punctuation. German outputs by G EN seem to bethe best. Other models sometimes do unnatural pauses whenreaching a punctuation mark. Right after the pauses, they oftenskip a few words. G EN is noticeably better on French andRussian, others produce obvious mispronunciations. Data-stress training:

To further test the models in data-stresssituations, we chose random subsets of 600 and 900 examplesper language from the training set (i.e., about 80 or 120 minutesof recordings, respectively). We trained all models on both re-duced datasets, but accomplished training just for S HA and G EN .While training on the bigger and smaller dataset, we decayedthe learning rate every 7.5k and 5k training steps, respectively.The right half of Table 2 shows that G EN can work better evenin data-stress situations. G EN models have, compared to S HA models, signiﬁcantly better CER values on six languages. In this experiment, we only used the ﬁve lan-guages where both CSS10 and CV data are available (Table 1), https://cloud.google.com/speech-to-text Our attempts to compensate for this using different encoder anddecoder learning rates were not successful. igure 2:

Language abilities of participants of our survey. and trained on all data in our cleaned sets; 64 and 4 randomlyselected samples for each speaker from CSS10 and CV, respec-tively, were reserved for validation. The S GL models are notapplicable to the code-switching scenario. S HA , S EP , and G EN models were trained for 50k steps with the same learning rateand schedule settings as in Section 5.1, this time with the adver-sarial speaker classiﬁers enabled. We set the size of speakerembeddings to 32 and used a language embedding of size 4 inS HA . G EN uses language embeddings of size 10 and generatorlayers of size 4. We used mini-batches of size 50 for all models. Code-switching evaluation dataset:

We created a new small-scale dataset especially for code-switching evaluation. We usedbilingual sentences scraped from Wikipedia. For each language,we picked 80 sentences with a few foreign words (20 sentencesfor each of the 4 other languages); Chinese was romanized. Wereplaced foreign names with their native forms (see Fig. 3).Figure 3:

Examples of code-switching evaluation sentences.

Subjective evaluation:

We synthesized all evaluation sen-tences using speaker embedding of the CSS10 speaker for thebase language of the sentence. We arranged a subjective evalua-tion test and used a rating method that combines ﬁve-point meanopinion score (MOS) with MUSHRA [28]. For each sample, itstranscript and systems’ outputs were shown at the same time.Participants were asked to rate them on a scale from 1 to 5 with0.1 increments and with labels “Bad”, “Poor”, “Fair”, “Good”,“Excellent”. To distinguish different error types, we asked fortwo ratings: (1) ﬂuency , naturalness, and stability of the voice(speaker similarity) – to check if foreign words cause any changeto the speaker’s voice, and (2) accuracy – testing if all wordsare pronounced and the foreign word pronunciation is correct.Participants could leave a textual note at the end of the survey.For each language, we recruited ten native speakers thatspoke at least one other language ﬂuently via the Proliﬁc plat-form (Fig. 2). They were given twelve sentences with the baselanguage matching their native language where each of the otherlanguages was represented by three sentences. Results:

Table 3 summarizes results of the survey. The rowsmarked “All” show means and variances of the ratings of all 50 Based on preliminary experiments on validation data, we set λ =1and weighted the loss of the classiﬁer by 0.125 and 0.5 for G EN andS HA , respectively. The classiﬁers include a hidden layer of size 256. In 3 sentences, a random model output was distorted and used assanity check (expected to be rated lowest). All participants passed.

Table 3:

Mean (with std. dev.) ratings of ﬂuency, naturalness,voice stability (top) and pronunciation accuracy (middle). Thebottom row shows the number of sentences with word skips. S HA S EP G EN F l u e n c y German 3.0 ± ± ± French 2.8 ± ± ± Dutch 3.1 ± ± ± Russian 2.8 ± ± ± Chinese 2.7 ± ± ± ± ± ± A cc u r a c y German 3.3 ± ± ± French 3.1 ± ± ± Dutch 3.4 ± ± ± Russian 3.0 ± ± ± Chinese 2.9 ± ± ± ± ± ± participants. Fig. 4 visualizes quantiles of the ratings (groupedby dominant languages). G EN has signiﬁcantly higher mean rat-ings on both scales. Unlike S HA or S EP , it allows cross-lingualmixing of the encoder outputs and enables smooth control overpronunciation. S EP scores consistently worst. The accuracyratings are overall slightly higher than the ﬂuency ratings; thismight be caused by improper word stress, which several partici-pants commented on.Figure 4: Graphs showing distributions of ﬂuency and accuracyratings grouped by the dominant language of rated sentences.

Manual error analysis:

We found that the models sometimesskip words, especially when reaching foreign words in Chinesesentences. Therefore, we manually inspected all 400 outputs ofall models and counted sentences where any word skip occurred,see the “Word skips” row in Table 3. We found that the G EN model makes much fewer of these errors than S HA and S EP .

6. Conclusion

We presented a new grapheme-based model that uses meta-learning for multilingual TTS. We showed that it signiﬁcantlyoutperforms multiple strong baselines on two tasks: data-stresstraining and code-switching, where our model was favored inboth voice ﬂuency as well as pronunciation accuracy. Our codeis available on GitHub. For future work, we consider changesto our model’s attention module to further improve accuracy.

7. Acknowledgements

This research was supported by the Charles University grantPRIMUS/19/SCI/10. . References [1] J. Shen, R. Pang, R. Weiss, M. Schuster, N. Jaitly, Z. Yang,Z. Chen, Y. Zhang, Y. Wang, R. J. Skerrv-Ryan, R. Saurous,Y. Agiomvrgiannakis, and Y. Wu, “Natural TTS Synthe-sis by Conditioning WaveNet on Mel Spectrogram Pre-dictions,” in

IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP) , Calgary, AB,Canada, Apr. 2018, pp. 4779–4783.[2] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan,O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, andK. Kavukcuoglu, “WaveNet: A Generative Model for RawAudio,” arXiv , vol. abs/1609.03499, Sep. 2016.[3] N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury,N. Casagrande, E. Lockhart, F. Stimberg, A. van den Oord,S. Dieleman, and K. Kavukcuoglu, “Efﬁcient Neural AudioSynthesis,” in , Jul. 2018, pp. 2410–2419.[4] K. Kumar, R. Kumar, T. de Boissiere, L. Gestin, W. Z.Teoh, J. Sotelo, A. de Brébisson, Y. Bengio, and A. C.Courville, “MelGAN: Generative Adversarial Networks forConditional Waveform Synthesis,” in

Advances in NeuralInformation Processing Systems 32 (NeurIPS) , Vancouver,BC, Canada, Dec. 2019, pp. 14 910–14 921.[5] Y. Wang, D. Stanton, Y. Zhang, R.-S. Ryan, E. Battenberg,J. Shor, Y. Xiao, Y. Jia, F. Ren, and R. A. Saurous, “Styletokens: Unsupervised style modeling, control and transferin end-to-end speech synthesis,” in , Stockholm, Sweden,Jul. 2018, pp. 5180–5189.[6] W. Hsu, Y. Zhang, R. J. Weiss, H. Zen, Y. Wu, Y. Cao, andY. Wang, “Hierarchical Generative Modeling for Control-lable Speech Synthesis,” in

International Conference onLearning Representations (ICLR) , May 2019.[7] Y. Jia, Y. Zhang, R. Weiss, Q. Wang, J. Shen, F. Ren,Z. Chen, P. Nguyen, R. Pang, I. L. Moreno, and Y. Wu,“Transfer Learning from Speaker Veriﬁcation to Multi-speaker Text-To-Speech Synthesis,” in

Advances in NeuralInformation Processing Systems 31 (NeurIPS) , Montréal,QC, Canada, Dec. 2018, pp. 4480–4490.[8] E. A. Platanios, M. Sachan, G. Neubig, and T. M. Mitchell,“Contextual Parameter Generation for Universal NeuralMachine Translation,” in , Brus-sels, Belgium, Oct. 2018, pp. 425–435.[9] R. Ardila, M. Branson, K. Davis, M. Henretty, M. Kohler,J. Meyer, R. Morais, L. Saunders, F. M. Tyers, and G. We-ber, “Common Voice: A Massively-Multilingual SpeechCorpus,” in , Marseille, France, 2020.[10] D. Sachan and G. Neubig, “Parameter Sharing Methods forMultilingual Self-Attentional Translation Models,” in

ThirdConference on Machine Translation (WMT): Research Pa-pers , Brussels, Belgium, Oct. 2018, pp. 261–271.[11] Y.-J. Chen, T. Tu, C. chieh Yeh, and H.-Y. Lee, “End-to-End Text-to-Speech for Low-Resource Languages byCross-Lingual Transfer Learning,” in

Interspeech , Graz,Austria, Sep. 2019, pp. 2075–2079.[12] Y. Lee, S. Shon, and T. Kim, “Learning pronunciation froma foreign language in speech synthesis networks,” arXiv ,vol. abs/1811.09364, Nov. 2018. [13] A. Prakash, A. Leela Thomas, S. Umesh, and H. A Murthy,“Building Multilingual End-to-End Speech Synthesisersfor Indian Languages,” in , Vienna, Austria, Sep. 2019, pp. 194–199.[14] Y. Zhang, R. Weiss, H. Zen, Y. Wu, Z. Chen, R. Skerry-Ryan, Y. Jia, A. Rosenberg, and B. Ramabhadran, “Learn-ing to Speak Fluently in a Foreign Language: MultilingualSpeech Synthesis and Cross-Language Voice Cloning,” in

Interspeech , Graz, Austria, Sep. 2019, pp. 2080–2084.[15] E. Nachmani and L. Wolf, “Unsupervised Polyglot Text-to-speech,” in

IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP) , Brighton, UK,May 2019.[16] Y. Taigman, L. Wolf, A. Polyak, and E. Nachmani,“VoiceLoop: Voice Fitting and Synthesis via a PhonologicalLoop,” in

International Conference on Learning Represen-tations (ICLR) , Vancouver, BC, Canada, Apr. 2018.[17] M. Chen, M. Chen, S. Liang, J. Ma, L. Chen, S. Wang, andJ. Xiao, “Cross-Lingual, Multi-Speaker Text-To-SpeechSynthesis Using Neural Speaker Embedding,” in

Inter-speech , Graz, Austria, Sep. 2019, pp. 2105–2109.[18] C. Li, X. Ma, B. Jiang, X. Li, X. Zhang, X. Liu,Y. Cao, A. Kannan, and Z. Zhu, “Deep speaker: anend-to-end neural speaker embedding system,” arXiv , vol.abs/1705.02304, May 2017.[19] Y. Cao, X. Wu, S. Liu, J. Yu, X. Li, Z. Wu, X. Liu, andH. M. Meng, “End-to-end Code-switched TTS with Mix ofMonolingual Recordings,”

IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP) , pp.6935–6939, May 2019.[20] Fatchord, “WaveRNN Vocoder + TTS,” https://github.com/fatchord/WaveRNN, 2019.[21] H. Tachibana, K. Uenoyama, and S. Aihara, “EfﬁcientlyTrainable Text-to-Speech System Based on Deep Convo-lutional Networks with Guided Attention,”

IEEE Interna-tional Conference on Acoustics, Speech and Signal Pro-cessing (ICASSP) , pp. 4784–4788, Apr. 2018.[22] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain,H. Larochelle, F. Laviolette, M. Marchand, and V. Lempit-sky, “Domain-Adversarial Training of Neural Networks,”

J. Mach. Learn. Res. , p. 2096–2030, Jan 2016.[23] K. Park and T. Mulc, “CSS10: A Collection of SingleSpeaker Speech Datasets for 10 Languages,” in

Interspeech ,Graz, Austria, Sep. 2019, pp. 1566–1570.[24] T. Kudo, “MeCab: Yet Another Part-of-Speech andMorphological Analyzer,” https://github.com/soimort/python-romkan, 2013.[25] M. Yao, “A Romaji/Kana conversion library for Python,”https://taku910.github.io/mecab/, 2015.[26] L. Yu, “Pinyin,” https://github.com/lxyu/pinyin, 2016.[27] R. W. Soukoreff and I. S. MacKenzie, “Measuring errorsin text entry tasks: an application of the Levenshtein stringdistance statistic,” in