Phonological Features for 0-shot Multilingual Speech Synthesis
Marlene Staib, Tian Huey Teh, Alexandra Torresquintero, Devang S Ram Mohan, Lorenzo Foglianti, Raphael Lenain, Jiameng Gao
PPhonological Features for 0-shot Multilingual Speech Synthesis
Marlene Staib , Tian Huey Teh , Alexandra Torresquintero , Devang S Ram Mohan , LorenzoFoglianti , Raphael Lenain * , Jiameng Gao Papercup Technologies Ltd., Novoic { marlene,tian } @papercup.com Abstract
Code-switching—the intra-utterance use of multiplelanguages—is prevalent across the world. Within text-to-speech (TTS), multilingual models have been found to enablecode-switching [1–3]. By modifying the linguistic input tosequence-to-sequence TTS, we show that code-switching ispossible for languages unseen during training, even withinmonolingual models. We use a small set of phonologicalfeatures derived from the International Phonetic Alphabet(IPA), such as vowel height and frontness, consonant place andmanner. This allows the model topology to stay unchanged fordifferent languages, and enables new, previously unseen featurecombinations to be interpreted by the model. We show thatthis allows us to generate intelligible, code-switched speech ina new language at test time, including the approximation ofsounds never seen in training.
Index Terms : speech synthesis, zero-shot, code-switching
1. Introduction
End-to-end TTS models such as Tacotron 2 are able to gen-erate highly natural-sounding speech, mapping text inputs di-rectly into acoustic outputs [4]. Recently, transformations oftext used in traditional TTS, such as phonemes [5], have beenfound to improve naturalness over characters as inputs withinend-to-end models [6]. Similar advantages have been foundwith phonological features (PFs), such as place and mannerof articulation [7], when used in addition to, or in place ofphonemes as inputs to DNN TTS models [2]. Using phonemes,PFs or both has proven an essential step in training multilingualmodels [1–3]. For low-resource settings, performance improve-ments in both TTS [8] and Automatic Speech Recognition [9]were found with PFs over phonemes alone, using either multi-lingual/multitask or transfer learning from high-resource TTS.Zhang and colleagues [1] first noted the ability of a multi-lingual, multi-speaker Tacotron 2 to code-switch, i.e. producenatural-sounding speech of one speaker in two languages, onlyone of which has been previously seen for that speaker. In theirsetup, which uses phonemes as inputs, code-switching is onlypossible for languages within the training data. However, train-ing data is only readily available for a fraction of the world’s5000-7000 languages. The ability to generate appropriate pro-nunciations, minimally for foreign names, organisations or lo-cations, without requiring large multilingual corpora, is there-fore highly desirable in speech applications.An alternative approach is presented in [10], which exploresUnicode bytes as a possible input to a multi-speaker, multilin-gual Tacotron 2. The advantage over single-valued inputs suchas characters or phonemes is that new characters can be addedwithout changing the model topology. This makes it suitable * work done while at Papercup Technologies Ltd. for transfer learning across languages. However, since Unicodeonly encodes typographic, and not phonological information,nothing is learned in this model about unseen byte combina-tions, likely requiring at least parts of the model to be relearnedentirely when enrolling a new language.Gutkin and colleagues [2,3,8,11] demonstrate the possibil-ity to synthesise a previously unseen language within multilin-gual models trained on 9-39 languages, partially with phyloge-netic relationships between each other. They use PFs [2, 8] ora combination of PFs and phonemes [2, 3, 11] as inputs to theirneural, multi-lingual TTS models. They show that various, au-tomatically derived phonological feature sets can be used to ei-ther replace or supplement phonemes as input features, yieldingimproved intelligibility over a phoneme-only baseline across avariety of trained and even untrained languages [2]. To ourknowledge, they do not attempt to synthesise any phonemescompletely unseen in training. Notably, models which concate-nate PFs to phonemes suffer the same constraints on extendingthe phoneme inventory as phoneme or character-based models,and do not allow for previously unseen phonemes to be syn-thesised without manual mapping or further training. Finally,these models require a substantial number of training languagesto allow for the generation of a new language.PFs offer a shared model topology across languages, similarto the “byte-like” representation used in [10], and maintain theconnection to abstract, phonological categories, while also pro-viding explanatory power on a level closer to the acoustics of anutterance [12]. While the applicability of a specific PF set to all languages is questionable, certain phonological contrasts, suchas “front–back”, have been shown to generalise across variouslanguage families [13]. At a lower bound, where phonologi-cal categories such as “fricatives”, “rounded vowels”, etc. donot share any acoustic properties amongst themselves, PF vec-tors can be seen as unique identifiers, i.e. the “byte-version”of phonemes. If, as we expect, phonological categories haveacoustic correlates, we are able to transfer what is learnt to new,unseen or infrequent combinations of sounds. In the case wherethe acoustics can be disentangled into (somewhat orthogonal)PFs, they would enable us to create new sounds, even includingthose not present in any human language.We extend the work in [2, 3, 8, 11] by showing that PFs en-able code-switching into an untrained language within a smallmultilingual, or even a monolingual model. Most importantly,we investigate the model’s ability to synthesise sounds com-pletely unseen in training (as opposed to an untrained languagecontaining only previously encountered phonemes). While weenvision the application of this research to be in code-switching,we conduct our experiments by synthesizing full sentences in anuntrained language, marking an extreme case where all wordsare “code-switched”. Further applications may include TTS forlow resource languages, as our experiments simulate a zero-resource setting (irrespective of a particular choice of language). a r X i v : . [ ee ss . A S ] A ug . Phonological features A range of PF sets have been put forward, including structuredprimes—nuclear units used to compose phonemes [14], binaryfeatures—such as “voice”, “tense”, “vocalic” [15], or multi-valued ones—such as place and manner of articulation [7].Within the latter, some features can be seen as continuous, forexample the horizontal dorsum position for vowels [9].In this paper, we use the following set of 10 categorical,multi-valued PFs, 9 of which are directly read from the IPA[16]: consonant/vowel, voicing (voiced/unvoiced), vowel front-ness, vowel openness, vowel roundedness, stress on vowel, con-sonant place, consonant manner, and diacritic (e.g., nasalised,velarised) . Features relating to only vowels are set to
NULL forconsonants, and vice versa. The tenth feature, “symbol type” ,is used to integrate symbols that mark, e.g., silences, the endof a sentence, or word boundaries, with all phoneme symbolssharing a single value on this feature.Each multi-valued feature is 1-hot encoded into a varyingnumber of binary variables, making up a total of 60 binary fea-tures. Phoneme identity is not used as a separate feature, as inother work [3, 8, 11], since this inhibits the encoding of unseenphonemes in our 0-shot experiments.
We use a (monolingual/multi-lingual), multi-speaker variant ofTacotron 2 [1, 4] as our baseline, mapping from phonemes tomel-filterbank features (MFBs). The input text is transformedinto phonemes using a linguistic frontend (see 4.2). The Griffin-Lim vocoder [17] is used to map from MFBs to waveform.
In our proposed PF model, the phoneme embedding table inthe baseline is replaced with a single, linear feedforward layeron top of our binary input features. The output dimensionalityof this layer is the same as the embedding size of the baselinemodel. Since the number of binarised PFs in the input featurevector is less than the size of the phoneme inventories used inour experiments (see 4.1), the number of parameters used torepresent the input is less than in our baseline model.A phonemic transcription is obtained from text in the sameway as in our baseline model, and then further mapped to itsPF representation using a dictionary-lookup based on the IPA.Some additional mappings between IPA and resource-specificsymbols were necessary, due to the lexical resources used (see4.2). Here, we chose the IPA as it is a widely adopted resource.In principle, this framework extends to other PF sets [18–20]for which automatic extraction is also possible [2].
3. 0-shot TTS
In our proposed model, our method (AUTO) of synthesizingunseen phonemes is straightforward: PFs of new, out-of-sample(OOS) phonemes are simply inferred from the IPA. PFs can bedirectly fed to the network without any modifications. Table 1:
Data set statistics, including the number of hours inthe training set, number of unique phonemes and out-of-sampletarget phonemes in the test language German (OOS).
Corpus Hours Speakers Phonemes OOS
VCTK . MIX . To demonstrate the effectiveness of our approach, we considertwo baselines of “0-shot” TTS in an untrained language: 1)RANDOM, which maps OOS phonemes to a new, randomlyinitialised (untrained) vector in the phoneme embedding ta-ble; and 2) MANUAL, for which we manually map all OOSphonemes to their “closest” existing phoneme in the trained em-bedding table of the Tacotron 2 encoder. We define the closestphoneme as one with a minimal number of differing PFs (e.g.,the rounded version of an unrounded vowel), which soundsclose to the target sound to a native speaker. In rare cases, aperceptually closer match is found that has less overlap in PFspace—e.g., when mapping [ö] to [ô] (see 4.6 for a discussion).RANDOM serves as a lower bound, demonstrating what canbe achieved without additional linguistic knowledge. The com-parison of AUTO against MANUAL tests whether our modelcan go beyond the training data and approximate new soundsfrom never seen combinations of features, thus outperformingan expert mapping. Minimally, we expect our model to be ableto automatically and reliably “pick“ the closest sound, therebyperforming on par with MANUAL.
4. Experiments
We use two corpora for experimentation: VCTK [21], an open-source, multi-speaker, multi-dialect corpus in English, andAdrianex, a proprietary, multi-speaker corpus in Mexican Span-ish. We compare the performance of the baseline and pro-posed model, using A) VCTK only or B) both corpora com-bined (MIX). German is used as a target language (see 4.5.1).Statistics, including the number of OOS phonemes in German,are shown in Table 1. The two data settings are used to explorethe relationship between the number of unseen phonemes, andtarget language intelligibility. Audio was downsampled to 24kHz, and 128-dimensional MFBs were extracted every 10 msover a window of 50 ms.
For English, phonemic transcriptions of the input text are ob-tained from the Received Pronunciation (RP) version of theCombilex dictionary [22]. For Spanish, pronunciations for eachword are obtained using a set of Mexican Spanish pronunciationrules, modified from [23]. For test sentences in German, the lex-icon from the German MARY-TTS voice [24] is used. For allmodels including the baseline, resource-specific phoneme-setsneed to be mapped into a shared inventory, such as the IPA .Diphthongs, which are represented as a single symbol inthe English and German dictionary, are split into their compo- Mapping tables for the listed resources, as well as an IPA-PF lookup dictionary are available at https://github.com/papercup-open-source/phonological-features /s> + aa a ðæ ææ bdd e e e ef g hi i i jk llmmnn ps tt uu u vwz (a) baseline model embedding layer + aa að æææbb dd d ee eefgg hi i ij j kllmmnn oopr stt uuu vv w xz e (b) phonological features after first layer + aa að æææb bd dd ee eef g gh ii ijjk l lmmnn oop rs tt uuuv vwxz e øøç a (c) b, with unseen phonemes
Figure 1: t-SNE plot of phoneme representations (first encoder layer) from the baseline and proposed model in the MIX data condition.Red in (c): additional phonemes unseen in training. (Note: (c) is a different t-SNE view from (b) of the same representation space.) nent vowels. Lexical stress is added to the vowel in a stressedsyllable. In the baseline model, stressed vowels are added asa separate symbol in the phoneme embedding table. For diph-thongs, stress is added on the first vowel [25]. Vowel lengthinformation, which is available from the German, but not theEnglish lexicon, is discarded.
We train our models for 200k iterations, with a batch size of 32and an initial learning rate of 10e-3, which is decreased to 10e-4after 100k iterations. Following [4], teacher-forcing in conjunc-tion with dropout is used during training, to align predicted andtrue MFB output sequences, ˆy and y . The L1 norm between ˆy and y is used as a loss function for Maximum Likelihood train-ing. To discourage over-reliance on teacher-forcing in the earlystages of training, we start with a Prenet layer size of 64, andincrease it to 256 after 40k iterations. We found this initial re-duction to be essential for the attention network to successfullytrain across different data settings. All other training parametersare as described in [4]. We use t-SNE plots [26] to inspect the phoneme representationspaces learned by our proposed versus the baseline model (Fig.1). In both models, we observe a meaningful arrangement ofphonemes that lends itself to phonological interpretation. Forinstance, vowels tend to be closer to each other than to con-sonants, as is the case for voiced versus unvoiced versions ofa consonant, and phonemic classes such as nasals, fricatives,stops, etc. However, there are no identifiable clusters in thebaseline model, and neighbouring phonemes appear more orless equidistant from each other, likely serving as unique iden-tifiers without further meaningful axes of variation. In contrast,tight clusters emerge in the projection space of our proposedmodel for vowels versus consonants, nasals, stops, fricatives,etc., suggesting a richer, more meaningful space learnt. Fig.1(b) and 1(c) also suggest an intuitive information hierarchy;with PFs clustering by broader group (vowel/consonant/other)first, then—within consonants—by manner second, place third,and everything else (voicing, diacritic) last.Moreover, Fig. 1(c) shows that new, unseen phonemes getprojected into intuitive places within that space without further Table 2:
Test set statistics, including number of sentences (nsents), sentence length (sent. len.) and Unseen Phoneme Rate(UPR) in percent, in each data setting (where µ is the mean). n sents sent. len. UPR(VCTK)% UPR(MIX)% − . ( µ = . ) − . ( µ = . )2 101 −
26 0 − . ( µ = . ) − . ( µ = . )training. This supports the expectation that, even if the model isunable to interpolate between feature combinations to producenew sounds, it will be able to automatically map unseen soundsto the closest seen sound in feature space, with high accuracy. Listening tests were conducted to test whether 1) 0-shot Ger-man speech from AUTO is more intelligible than RANDOM,and competitive with MAPPED; 2) AUTO outperforms RAN-DOM, and is competitive with MAPPED in terms of listenerpreference; and 3) listener preference is the same for the base-line and the proposed model in the trained language (English).For 3, the test set consisted of a random subset of 96 vali-dation set sentences from 6 selected, RP English VCTK speak-ers (3 male, 3 female). For 1 and 2, sentences were randomlysampled from Wikipedia articles in German, subject to havingseven words (or more, as required), all in-vocabulary. The same6 VCTK speakers were used to synthesise these two test sets.Statistics on the different test sets are shown in Table 2. TheUnseen Phoneme Rate (UPR) was calculated as the number ofOOS phonemes in a target utterance, divided by the total num-ber of phonemes in that utterance.Listener preference was evaluated using A/B preferencetests. For intelligibility, listeners were asked to transcribe thegenerated samples, and word level accuracy was measured. Ablock design was used to reduce the amount of variability aris-ing from different listeners and different sentence-model pair-ings. All tests were performed online by native listeners, re-cruited through Amazon Mechanical Turk [27]. For preferencetests, language proficiency was assessed with a preliminary taskof transcribing a sample in the target language. For the tran-
CTK MIX020406080 w o r d a cc u r a c y Figure 2:
Results for intelligibility of 0-shot German. scription task, we discarded responses from participants with anoverall Word Error Rate (WER) above 80%, prior to analysis.Listeners in this category appeared to be non-native speakers orpeople who had not seriously attempted the task. After filtering,the total number of listeners was 20 for intelligibility and 30 forpreference tests.
Figure 2 shows that AUTO produces significantly more intel-ligible speech compared to RANDOM in both data conditions(using pair-wise t-test comparisons; p < . ), and outper-forms MANUAL in the MIX data setup ( p < . ). Interest-ingly, word level accuracy is similar across data conditions forMANUAL, while AUTO and RANDOM improve with the ad-ditional language present in MIX, compared to VCTK alone.The degradation in intelligibility in all tested methods againstvocoded human speech is likely due to the strong English accentpresent in the 0-shot samples. Producing accent-free, 0-shotspeech, perhaps relying on more sophisticated representationsof typology and language [11, 28] or with adversarial losses [1]remains a challenging task for future research.We find a weak negative correlation between UPR and wordaccuracy across data settings in RANDOM ( r = − . , p < . ), but not in AUTO or MANUAL. This suggests that, whileRANDOM is exposed to pronunciation (and subsequent percep-tion) errors arising from unseen phonemes, AUTO and MAN-UAL are both effective strategies to overcome or at least attenu-ate them. Further research is needed to determine how and whypronunciations in AUTO improve with more languages, moredata or a larger input phoneme inventory. Since we were unableto measure a clear relationship between UPR and word accu-racy in AUTO, phoneme coverage alone may not fully explainthis effect. Another possibility is that, rather than just exploitinga greater number of learned phonemes, AUTO is actually ableto learn a more sophisticated phoneme projection space whentrained on MIX.No clear trend in listener preference was found for any ofthe compared methods (Fig. 3(a)). A potential explanation isthat listeners’ preference was heavily influenced by the strongEnglish accent present in the 0-shot speech, which may haveovershadowed the preference for any particular model. It is alsoconceivable that the improved phoneme representation primar-ily affects pronunciation quality, which manifests itself mostlyin intelligibility, over other metrics.We also did not observe a significant difference in listenerpreference for the baseline versus our proposed model in En- MIX+AUTO MIX+MAPPEDMIX+AUTO MIX+RANDOMMIX+AUTO VCTK+AUTOVCTK+AUTO VCTK+MAPPED0 10 20 30 40 50 60 70 80 90VCTK+AUTO VCTK+RANDOM (a) preference for the 0-shot systems
MIX+baseline MIX+proposed0 10 20 30 40 50 60 70 80 90VCTK+baseline VCTK+proposed (b) preference in English
Figure 3:
Results of the listening evaluations. glish, in either of the data settings (Fig. 3(b)), indicating thatthe ability to do 0-shot TTS in a new language in our proposedmodel does not come at the cost of reduced naturalness or qual-ity in the trained language.
Through informal listening, we find that most of the OOSphonemes collapse to neighbouring in-sample phonemes in theaudio output. Most often, this produces a perceptually agree-able mapping, e.g. from [Ãğ] to [S] and [Y] to [I] . In one case,within the monolingual (VCTK) model, the collapse is inappro-priate, mapping the trill [ö] to the stop [g]. This does not hap-pen for the multilingual (MIX) model, where [ö] sounds eitherlike the alveolar trill [r] , or approximant [ô] in the generated out-put . When a feature is completely unobserved in training (suchas ”trill” in the VCTK data setting), a manual mapping may bepreferable to human listeners, e.g., from [ö] to [ô] (sharing fewPFs, but having a common graphemic symbol ’r’).
5. Conclusion and Future Work
By replacing the character input in Tacotron 2 with a relativelysmall set of IPA-inspired features, we were able to create amodel topology which is language independent, and allows forthe automatic approximation of sounds unseen in training, ex-ceeding or matching the performance of various baselines in-cluding a resource-intensive expert mapping approach.Further research is needed to validate this work for differ-ent pairs of source and target languages. This involves a morethorough investigation of the effect of the number of traininglanguages, the amount of overlap between phoneme and PF in-ventories, as well as other typological and phylogenetic rela-tionships. Another avenue of interest would be to further disen-tangle PFs, as well as accent, in order to generate unseen com-binations of features more accurately.
6. Acknowledgements
We would like to thank our advisor, Simon King, for his inputto this research, Doniyor Ulmasov, for his questions which in-spired the idea for this paper, and the anonymous reviewers fortheir valuable feedback. Samples can be found at https://research.papercup.com/samples/phonological-features-for-0-shot . References [1] Y. Zhang, R. J. Weiss, H. Zen, Y. Wu, Z. Chen, R. Skerry-Ryan,Y. Jia, A. Rosenberg, and B. Ramabhadran, “Learning to speakfluently in a foreign language: Multilingual speech synthesis andcross-language voice cloning,”
INTERSPEECH , pp. 2080–2084,2019.[2] A. Gutkin, M. Jansche, and T. Merkulova, “FonBund: A libraryfor combining cross-lingual phonological segment data,”
Pro-ceedings of the 11th International Conference on Language Re-sources and Evaluation (LREC 2018) , pp. 2236–2240, 2018.[3] I. Demirsahin, M. Jansche, and A. Gutkin, “A unified phonologi-cal representation of South Asian languages for multilingual text-to-speech,”
SLTU , pp. 80–84, 2018.[4] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang,Z. Chen, Y. Zhang, Y. Wang, R. Skerrv-Ryan et al. , “Natural TTSsynthesis by conditioning WaveNet on mel spectrogram predic-tions,” , pp. 4779–4783, 2018.[5] H. Zen, K. Tokuda, and A. W. Black, “Statistical parametricspeech synthesis,”
Speech Communication , vol. 51, no. 11, pp.1039–1064, 2009.[6] J. Fong, J. Taylor, K. Richmond, and S. King, “A comparison be-tween letters and phones as input to sequence-to-sequence modelsfor speech synthesis,” , pp.223–227, 2019.[7] R. Jakobson, C. G. Fant, and M. Halle,
Preliminaries to speechanalysis: The distinctive features and their correlates . Cam-bridge, Massachusetts: The MIT Press, 1951.[8] A. Gutkin, “Uniform multilingual multi-speaker acoustic modelfor statistical parametric speech synthesis of low-resourced lan-guages,”
INTERSPEECH , pp. 2183–2187, 2017.[9] S. St¨uker and A. Waibel, “Porting speech recognition systems tonew languages supported by articulatory feature models,”
Speechand Computer, SPECOM , 2009.[10] B. Li, Y. Zhang, T. Sainath, Y. Wu, and W. Chan, “Bytes are allyou need: End-to-end multilingual speech recognition and synthe-sis with bytes,” , pp. 5621–5625,2019.[11] A. Gutkin and R. Sproat, “Areal and phylogenetic features formultilingual speech synthesis,”
INTERSPEECH , pp. 2078–2082,2017.[12] S. King and P. Taylor, “Detection of phonological features in con-tinuous speech using neural networks,”
Computer Speech & Lan-guage , vol. 14, no. 4, pp. 333–353, 2000.[13] C. Johny, A. Gutkin, and M. Jansche, “Cross-lingual consistencyof phonological features: An empirical study,”
INTERSPEECH ,pp. 1741–1745, 2019.[14] J. Harris,
English sound structure . Oxford, UK: Blackwell Pub-lishers, 1994.[15] N. Chomsky and M. Halle,
The sound pattern of English . NewYork, New York: Harper & Row, 1968.[16] International Phonetic Association, International Phonetic Asso-ciation Staff et al. , Handbook of the International Phonetic Asso-ciation: A guide to the use of the International Phonetic Alphabet .Cambridge University Press, 1999.[17] D. Griffin and J. Lim, “Signal estimation from modified short-time Fourier transform,”
IEEE Transactions on Acoustics, Speech,and Signal Processing , vol. 32, no. 2, pp. 236–243, 1984.[18] S. Moran and D. McCloy, Eds.,
PHOIBLE 2.0 . Jena: MaxPlanck Institute for the Science of Human History, 2019.[Online]. Available: https://phoible.org/[19] D. R. Mortensen, P. Littell, A. Bharadwaj, K. Goyal, C. Dyer,and L. Levin, “Panphon: A resource for mapping IPA segmentsto articulatory feature vectors,”
Proceedings of COLING 2016,the 26th International Conference on Computational Linguistics:Technical Papers , pp. 3475–3484, 2016. [20] D. Dediu and S. Moisik, “Defining and counting phonologicalclasses in cross-linguistic segment databases,”
Proceedings of the10th International Conference on Language Resources and Eval-uation (LREC 2016) , pp. 1955–1962, 2016.[21] V. Christophe, Y. Junichi, and M. Kirsten, “CSTR VCTK Corpus:English multi-speaker corpus for CSTR voice cloning toolkit,”
The Centre for Speech Technology Research (CSTR) , 2016.[22] K. Richmond, R. A. Clark, and S. Fitt, “Robust LTS rules withthe Combilex speech technology lexicon,”
INTERSPEECH , pp.1295–1298, 2009.[23] P. Oplustil,
Multi-style Text-to-Speech using Recurrent NeuralNetworks for Chilean Spanish . The University of Edinburgh,2016.[24] M. Schrder and J. Trouvain, “The German Text-to-Speech Syn-thesis System MARY: A Tool for Research, Development andTeaching,” , 2001.[25] P. Ladefoged and K. Johnson,
A course in phonetics . NelsonEducation, 2014.[26] L. van der Maaten and G. Hinton, “Visualizing data using t-SNE,”
Journal of Machine Learning Research , vol. 9, no. Nov, pp. 2579–2605, 2008.[27] K. Crowston, “Amazon Mechanical Turk: A research tool for or-ganizations and information systems scholars,” in
Shaping the Fu-ture of ICT Research. Methods and Approaches . Springer, 2012,pp. 210–221.[28] Y. Tsvetkov, S. Sitaram, M. Faruqui, G. Lample, P. Littell,D. Mortensen, A. W. Black, L. Levin, and C. Dyer, “Polyglotneural language models: A case study in cross-lingual phoneticrepresentation learning,”