LRSpeech: Extremely Low-Resource Speech Synthesis and Recognition
Jin Xu, Xu Tan, Yi Ren, Tao Qin, Jian Li, Sheng Zhao, Tie-Yan Liu
LLRSpeech: Extremely Low-Resource Speech Synthesis andRecognition
Jin Xu , Xu Tan , Yi Ren , Tao Qin , Jian Li , Sheng Zhao , Tie-Yan Liu Institute for Interdisciplinary Information Sciences, Tsinghua University, China, Zhejiang University, China Microsoft Research Asia, Microsoft Azure [email protected],[email protected],[email protected]@mail.tsinghua.edu.cn,{taoqin,sheng.zhao,tyliu}@microsoft.com
ABSTRACT
Speech synthesis (text to speech, TTS) and recognition (automaticspeech recognition, ASR) are important speech tasks, and requirea large amount of text and speech pairs for model training. How-ever, there are more than 6,000 languages in the world and mostlanguages are lack of speech training data, which poses significantchallenges when building TTS and ASR systems for extremely low-resource languages. In this paper, we develop LRSpeech, a TTS andASR system under the extremely low-resource setting, which cansupport rare languages with low data cost. LRSpeech consists ofthree key techniques: 1) pre-training on rich-resource languagesand fine-tuning on low-resource languages; 2) dual transformationbetween TTS and ASR to iteratively boost the accuracy of eachother; 3) knowledge distillation to customize the TTS model on ahigh-quality target-speaker voice and improve the ASR model onmultiple voices. We conduct experiments on an experimental lan-guage (English) and a truly low-resource language (Lithuanian) toverify the effectiveness of LRSpeech. Experimental results show thatLRSpeech 1) achieves high quality for TTS in terms of both intelligi-bility (more than 98% intelligibility rate) and naturalness (above 3.5mean opinion score (MOS)) of the synthesized speech, which satisfythe requirements for industrial deployment, 2) achieves promisingrecognition accuracy for ASR, and 3) last but not least, uses ex-tremely low-resource training data. We also conduct comprehensiveanalyses on LRSpeech with different amounts of data resources, andprovide valuable insights and guidances for industrial deployment.We are currently deploying LRSpeech into a commercialized cloudspeech service to support TTS on more rare languages.
ACM Reference Format:
Jin Xu , Xu Tan , Yi Ren , Tao Qin , Jian Li , Sheng Zhao , Tie-Yan Liu .2020. LRSpeech: Extremely Low-Resource Speech Synthesis and Recog-nition. In Proceedings of the 26th ACM SIGKDD Conference on KnowledgeDiscovery and Data Mining (KDD ’20), August 23–27, 2020, Virtual Event, CA,USA.
ACM, New York, NY, USA, 11 pages. https://doi.org/10.1145/3394486.3403331 This work was conducted at Microsoft. Correspondence to: Tao Qin
KDD ’20, August 23–27, 2020, Virtual Event, CA, USA © 2020 Association for Computing Machinery.ACM ISBN 978-1-4503-7998-4/20/08...$15.00https://doi.org/10.1145/3394486.3403331
Speech synthesis (text to speech, TTS) [28, 30, 35, 41] and speechrecognition (automatic speech recognition, ASR) [6, 10, 11] aretwo key tasks in speech domain, and attract a lot of attention inboth the research and industry community. However, popular com-mercialized speech services (e.g., Microsoft Azure, Google Cloud,Nuance, etc.) only support dozens of languages for TTS and ASR,while there are more than 6,000 languages in the world [21]. Mostlanguages are lack of speech training data, which makes it difficultto support TTS and ASR for these rare languages, as large-amountand high-cost speech training data are required to ensure goodaccuracy for industrial deployment.We describe the typical training data to build TTS and ASRsystems as follows: • TTS aims to synthesize intelligible and natural speech from textsequences, and usually needs single-speaker high-quality record-ings that are collected in professional recording studio. To im-prove the pronunciation accuracy, TTS also requires a pronunci-ation lexicon to convert the character sequence into phonemesequence as the model input (e.g., “speech" is converted into “s piy ch"), which is called as grapheme-to-phoneme conversion [36].Additionally, TTS models use text normalization rules to con-vert the irregular word into the normalized type that is easier topronounce (e.g., “Sep 7th" is converted into “September seventh"). • ASR aims to generate correct transcripts (text) from speech se-quences, and usually requires speech data from multiple speakersin order to generalize to unseen speakers during inference. Themulti-speaker speech data in ASR do not need to be as high-quality as that in TTS, but the data amount is usually an orderof magnitude bigger. We call the speech data for ASR as multi-speaker low-quality data . Optionally, ASR can first recognizethe speech into phoneme sequence, and further convert it intocharacter sequence with the pronunciation lexicon as in TTS. • Besides paired speech and text data, TTS and ASR models canalso leverage unpaired speech and text data to further improvethe performance.
According to the data resource used, previous works on TTS andASR can be categorized into rich-resource, low-resource and unsu-pervised settings.As shown in Table 1, we list the data resources and the corre-sponding related works in each setting : The low quality here does not mean the quality of ASR data is very bad, but is justrelatively low compared to the high-quality TTS recordings. a r X i v : . [ ee ss . A S ] A ug etting Rich-Resource Low-Resource Extremely Low-Resource UnsupervisedData pronunciation lexicon ✓ ✓ × × paired data (single-speaker, high-quality) dozens of hours dozens of minutes several minutes × paired data (multi-speaker, low-quality) hundreds of hours dozens of hours several hours × unpaired speech (single-speaker, high-quality) ✓ dozens of hours × × unpaired speech (multi-speaker, low-quality) ✓ ✓ dozens of hours ✓ unpaired text ✓ ✓ ✓ ✓ Related Work TTS [22, 28, 30, 35] [2, 12, 23, 31] Our Work /ASR [6, 10, 11] [16, 32, 33, 39] [8, 24, 45]
Table 1: The data resource to build TTS and ASR systems and the corresponding related works in rich-resource, low-resource,extremely low-resource and unsupervised settings. • In the rich-resource setting, both TTS [22, 28, 30, 35] and ASR [6,10, 11] require a large amount of paired speech and text datato achieve high accuracy: TTS usually needs dozens of hoursof single-speaker high-quality recordings, while ASR requiresat least hundreds of hours multiple-speaker low-quality data.Besides, TTS in the rich-resource setting also leverages pronun-ciation lexicon for accurate pronunciation. Optionally, unpairedspeech and text data can be leveraged. • In the low-resource setting, the single-speaker high-quality paireddata are reduced to dozens of minutes in TTS [2, 12, 23, 31] whilethe multi-speaker low-quality paired data is reduced to dozensof hours in ASR [16, 32, 33, 39], compared to that in the rich-resource setting. Additionally, they leverage unpaired speechand text data to ensure the performance. • In the unsupervised setting, only unpaired speech and text dataare leverage to build ASR models [8, 24, 45].As can be seen, a large amount of data resources are leveragedin the rich-resource setting to ensure the accuracy for industrialdeployment. Considering nearly all low-resource languages arelack of training data and there are more than 6,000 languages in theworld, it will be a huge cost for training data collection. Althoughdata resource can be reduced in the low-resource setting, it stillrequires 1) a certain amount of paired speech and text (dozens ofminutes for TTS and dozens of hours for ASR), 2) a pronunciationlexicon, and 3) a large amount of single-speaker high-quality un-paired speech data that still incur high data collection cost . What ismore, the accuracy of the TTS and ASR models in the low-resourcesetting is not high enough. The purely unsupervised methods forASR suffer from low accuracy and cannot meet the requirement ofindustrial deployment. In this paper, we develop LRSpeech, a TTS and ASR system underthe extremely low-resource setting, which supports rare languageswith low data collection cost. LRSpeech aims for industrial deploy-ment under two constraints: 1) extremely low data collection cost,and 2) high accuracy to satisfy the deployment requirement. Forthe first constraint, as the extremely low-resource setting shownin Table 1, LRSpeech explores the limits of data requirements by Although we can crawl the multi-speaker low-quality unpaired speech data fromthe web, it is hard to crawl the single-speaker high-quality unpaired speech data.Therefore, it has the same collection cost (recorded by human) with the single-speakerhigh-quality paired data.
1) using single-speaker high-quality paired data as few as possi-ble (several minutes), 2) using a few multi-speaker low-qualitypaired data (several hours), 3) using slightly more multi-speakerlow-quality unpaired speech data (dozens of hours), 4) not usingsingle-speaker high-quality unpaired data, and 5) not using thepronunciation lexicon but directly taking character as the input ofTTS and the output of ASR.For the second constraint, LRSpeech leverages several key tech-niques including transfer learning from rich-resource languages,iterative accuracy boosting between TTS and ASR through dualtransformation, and knowledge distillation to further refine TTSand ASR models for better accuracy. Specifically, LRSpeech consistsof a three-stage pipeline: • We first pre-train both TTS and ASR models on rich-resource lan-guages with plenty of paired data, which can learn the alignmentcapability between speech and text and benefit the alignmentlearning on low-resource languages. • We further leverage dual transformation between TTS and ASRto iteratively boost the accuracy of each other with unpairedspeech and text data. • Furthermore, we leverage knowledge distillation with unpairedspeech and text data to customize the TTS model on a high-quality target-speaker voice and improve the ASR model onmultiple voices.
Next, we introduce the extremely low data cost while promisingaccuracy achieved by LRSpeech.According to [4, 14, 15, 38, 43], the pronunciation lexicon, single-speaker high-quality paired data and single-speaker high-qualityunpaired speech data require much higher collection cost thanother data such as multi-speaker low-quality unpaired speech dataand unpaired text, since they can be crawled from the web. Accord-ingly, compared to the low-resource setting in Table 1, LRSpeech 1)removes the pronunciation lexicon, 2) reduces the single-speakerhigh-quality paired data by an order of magnitude, 3) removessingle-speaker high-quality unpaired speech data, 4) also reducesmulti-speaker low-quality paired data by an order of magnitude, 5)similarly leverages multi-speaker low-quality unpaired speech, and6) additionally leverage paired data from rich-resource languageswhich incur no additional cost since they are already available in thecommercialized speech service. Therefore, LRSpeech can greatlyreduce the data collection cost for TTS and ASR. nowledge Distillation TTS(single-speaker)ASR(multi-speaker)Dual TransformationPre-trainingFine-tuning
Paired data from rich-resource languagePaired data (single-speaker high, multi-speaker low)Unpaired data (multi-speaker low)Synthesized data
Figure 1: The three-stage pipeline of LRSpeech.
To verify the effectiveness of LRSpeech under the extremelylow-resource setting, we first conduct comprehensive experimen-tal studies on English and then verify on the truly low-resourcelanguage: Lithuanian, which is for product deployment. For TTS,LRSpeech achieves 98.08% intelligibility rate, 3.57 MOS score, with0.48 gap to the ground-truth recordings, satisfying the online de-ployment requirements . For ASR, LRSpeech achieves 28.82% WERand 14.65% CER, demonstrating great potential under the extremelylow-resource setting. Furthermore, we also conduct ablation studiesto verify the effectiveness of each component in LRSpeech, andanalyze the accuracy of LRSpeech under different data settings,which provide valuable insights for industrial deployment. Finally,we apply LRSpeech to Lithuanian and also meets the online re-quirement for TTS and achieves promising results on ASR. We arecurrently deploying LRSpeech to a commercialized speech serviceto support TTS for rare languages. In this section, we introduce the details of LRSpeech for extremelylow-resource speech synthesis and recognition. We first give anoverview of LRSpeech, and then introduce the formulation of TTSand ASR. We further introduce each component of LRSpeech re-spectively, and finally describe the model structure of LRSpeech.
To ensure the accuracy of TTS and ASR models under extremelylow-resource scenarios, we design a three-stage pipeline for LR-Speech as shown in Figure 1: • Pre-training and fine-tuning. We pre-train both TTS and ASRmodels on rich-resource languages and then fine-tune them onlow-resource languages. Leveraging rich-resource languages inLRSpeech are based on two considerations: 1) a large amount ofpaired data on rich-resource languages are already available inthe commercialized speech service, and 2) the alignment capa-bility between speech and text in rich-resource languages canbenefit the alignment learning in low-resource languages, due tothe pronunciation similarity between human languages [42]. • Dual transformation. Considering the dual nature between TTSand ASR, we further leverage dual transformation [31] to boostthe accuracy of each other with unpaired speech and text data. • Knowledge distillation. To further improve the accuracy of TTSand ASR and facilitate online deployment, we leverage knowledge According to the requirements of a commercialized cloud speech service, the intelli-gibility rate should be higher than 98% and the MOS score should be higher than 3.5while the MOS gap to the ground-truth recordings should be less than 0.5. distillation [18, 37] to synthesize paired data to train better TTSand ASR models.
TTS and ASR are usually formulated as sequence to sequence prob-lems [6, 41]. Denote the text and speech sequence pair ( x , y ) ∈ D ,where D is the paired text and speech corpus. Each element in thetext sequence x represents a phoneme or character, while eachelement in the speech sequence y represents a frame of speech. Tolearn the TTS model θ , a mean square error loss is used: L( θ ; D ) = − Σ ( x , y )∈ D ( y − f ( x ; θ )) . (1)To learn the ASR model ϕ , a negative log likelihood loss is used: L( ϕ ; D ) = − Σ ( y , x )∈ D log P ( x | y ; ϕ ) . (2)TTS and ASR models can be developed based on an encoder-attention-decoder framework [3, 25, 40], where the encoder transforms thesource sequence into a set of hidden representations, and the de-coder generates the target sequence autoregressively based on thesource hidden representations obtained through an attention mech-anism [3].We make some notations for the data used in LRSpeech. Denote D rich_tts as the high-quality TTS paired data in rich-resource lan-guages, D rich_asr as the low-quality ASR paired data in rich-resourcelanguages, D h as the single-speaker high-quality paired data fortarget speaker, and D l as the multi-speaker low-quality paired data.Denote X u as unpaired text data while Y u as multi-speaker low-quality unpaired speech data.Next, we introduce each component of the LRSpeech pipeline inthe following subsections. The key to the conversion between text and speech is to learn thealignment between the character/phoneme representations (text)and the acoustic features (speech). Since people coming from dif-ferent nations and speaking different languages share similar vocalorgans and thus similar pronunciations, the ability of alignmentlearning in one language can help the alignment in another lan-guage [19, 42]. This motivates us to transfer the TTS and ASRmodels that trained in rich-resource languages into low-resourcelanguages, considering there are plenty of paired speech and textdata for both TTS and ASR in rich-resource languages.
Pre-Training.
We pre-train the TTS model θ with data corpus D rich_tts following Equation 1 and pre-train the ASR model ϕ with D rich_asr following Equation 2. ine-Tuning. Considering the rich-resource and low-resource lan-guages have different phoneme/character vocabularies and speak-ers, we initialize the TTS and ASR models on low-resource languagewith all the pre-trained parameters except the phoneme/characterand speaker embeddings in TTS and the phoneme/character em-beddings in ASR respectively. We then fine-tune the TTS model θ and ASR model ϕ both with the concatenation corpus of D h and D l following Equation 1 and Equation 2 respectively. During fine-tuning, we first fine-tune the character embeddings and speakerembeddings following the practice in [1, 7], and then fine-tuneall parameters. It can help prevent the TTS and ASR models fromoverfitting on the limited paired data in a low-resource language. TTS and ASR are two dual tasks and their dual nature can beexplored to boost the accuracy of each other, especially in the low-resource scenarios. Therefore, we leverage dual transformation [31]between TTS and ASR to improve the ability to transform betweentext and speech. Dual transformation shares similar ideas with back-translation [34] in machine translation and cycle-consistency [46]in image translation, which are effective ways to leverage unla-beled data in speech, text and image domains respectively. Dualtransformation works as follows: • For each unpaired text sequence x ∈ X u , we transform it intospeech sequence using the TTS model θ , and construct a pseudocorpus D ( X u ) to train the ASR model ϕ following Equation 2. • For each unpaired speech sequence y ∈ Y u , we transform it intotext sequence using the ASR model ϕ , and construct a pseudocorpus D ( Y u ) to train the TTS model θ following Equation 1.During training, we run the dual transformation process on thefly, which means the pseudo corpus are updated in each iterationand the model can benefit from the newest data generated by eachother. Next, we introduce some specific designs in dual transforma-tion to support multi-speaker TTS and ASR. Multi-Speaker TTS Synthesis.
Different from [23, 31] that onlysupport a single speaker in both TTS and ASR model, we supportmulti-speaker TTS and ASR in the dual transformation stage. Specif-ically, we randomly choose a speaker ID and synthesize speech ofthis speaker given a text sequence, which can benefit the training ofthe multi-speaker ASR model. Furthermore, the ASR model trans-forms multi-speaker speech into text, which can help the trainingof the multi-speaker TTS model.
Levering Unpaired Speech of Unseen Speakers.
Since multiple-speaker low-quality unpaired speech data are much easier to obtainthan high-quality single-speaker unpaired speech data, enabling theTTS and ASR models to utilize unseen speakers’ unpaired speech indual transformation can make our system more robust and scalable.Compared to ASR, it is more challenging for TTS to synthesizevoice on unseen speakers. To this end, we split dual transformationinto two phases: 1) In the first phase, we only use the unpairedspeech whose speakers are seen before in the training data. 2) In ASR model does not need speaker embeddings, and the target embeddings andthe softmax matrix are usually shared in many sequence generation tasks for betteraccuracy [29]. the second phase, we also add the unpaired speech whose speakersare unseen in the training data. As the ASR model can naturallysupport unseen speakers, the pseudo paired data can be used totrain and enable the TTS model with the capability to synthesizespeech of new speakers.
The TTS and ASR models we currently have are far from readyfor online deployment after dual transformation. There are sev-eral issues we need to address: 1) While the TTS model can sup-port multiple speakers, the speech quality of our target speakeris not good enough and needs further improvement; 2) The syn-thesized speech by the TTS models still have word skipping andrepeating issues; 3) The accuracy of the ASR model needs to befurther improved. Therefore, we further leverage knowledge dis-tillation [18, 37], which generates target sequences given sourcesequences as input to construct a pseudo corpus, to customize theTTS and ASR models for better accuracy.
The knowledge distillationprocess for TTS consists of three steps: • For each unpaired text sequence x ∈ X u , we synthesize thecorresponding speech of the target speaker using the TTS model θ , and construct a single-speaker pseudo corpus D ( X u ) . • Filter the pseudo corpus D ( X u ) whose synthesized speech hasword skipping and repeating issues. • Use the filtered corpus D ( X u ) to train a new TTS model dedicatedto the target speaker following Equation 1.In the first step, the speech in the pseudo corpus D ( X u ) aresingle-speaker, which is different from the multi-speaker pseudocorpus D ( X u ) in Section 2.4. The TTS model (obtained by dualtransformation) in the first step has word skipping and repeating is-sues. Therefore, in the second step, we filter the synthesized speechwhich has word skipping and repeating issues, and thus the dis-tilled model can be trained on accurate text and speech pairs. Inthis way, the word skipping and repeating problem can be largelyreduced. We filter the synthesized speech based on two metrics:word coverage ratio (WCR) and attention diagonal ratio (ADR). Word Coverage Ratio.
We observe that word skipping happenswhen a word has small or no attention weights from the target mel-spectrograms. Therefore, we propose word coverage ratio (WCR):
W CR = min i ∈[ , N ] { max t ∈[ , T i ] max s ∈[ , S ] A t , s } , (3)where N is the number of words in a sentence, T i is the number ofcharacters in the i -th word, S is the number of frames of the targetmel-spectrograms, and A t , s denotes the element in the t -th row and s -th column of the attention weight matrix A . We get the attentionweight matrix A from the encoder-decoder attention weights inthe TTS model and calculate the mean over different layers andattention heads. A high WCR indicates all words in a sentence havehigh attention weights from target speech frames, and thus is lesslikely to cause word skipping.a) TTS model (b) ASR model (c) Speaker module (d) Encoder (left) and (e) Input/Output moduleDecoder (right) for speech/text Figure 2: The Transformer based TTS and ASR models in LRSpeech.
Attention Diagonal Ratio.
As demonstrated by previous works [30,41], the attention alignments between text and speech are mono-tonic and diagonal. When the synthesized speech has word skippingand repeating issues, or is totally crashed, the attention alignmentswill deviate from the diagonal. We define the attention diagonalratio (ADR) as:
ADR = (cid:205) Tt = (cid:205) kt + bs = kt − b A t , s (cid:205) Tt = (cid:205) Ss = A t , s , (4)where T and S are the number of characters and speech frames in atext and speech pair, k = ST , and b is a hyperparameter to determinethe width of diagonal. ADR measures how much attention lies inthe diagonal area with a width of b . A higher ADR indicates thatthe synthesized speech has good attention alignment with text andthus has less word skipping, repeating or crashing issues. Since the unpaired text andlow-quality multi-speaker unpaired speech are both available forASR , we leverage both the ASR and TTS models to synthesize dataduring the knowledge distillation for ASR: • For each unpaired speech y ∈ Y u , we generate the correspondingtext using the ASR model ϕ , and construct a pseudo corpus D ( Y u ) . • For each unpaired text x ∈ X u , we synthesize the correspondingspeech of multiple speakers using the TTS model θ , and constructa pseudo corpus D ( X u ) . • We combine the above pseudo corpus D ( Y u ) and D ( X u ) , as wellas the single-speaker high-quality paired data D h and multi-speaker low-quality paired data D l to train a new ASR modelfollowing Equation 2.Similar to the knowledge distillation for TTS, we also leveragea large amount of unpaired text to synthesize speech. To furtherimprove the ASR accuracy, we use SpecAugment [27] to add noisein the input speech which acts like data augmentation. In this section, we introduce the model structure of LRSpeech, asshown in Figure 2.
Transformer Model.
Both the TTS and ASR models adopt theTransformer based encoder-attention-decoder structure [40]. Onedifference from the original Transformer model is that we replace the feed-forward network with a one-dimensional convolutionnetwork following [31], in order to better capture the dependenciesin a long speech sequence.
Input/Output Module.
To enable the Transformer model to sup-port ASR and TTS, we need different input and output modulesfor speech and text [31]. For the TTS model: 1) The input moduleof the encoder is a character/phoneme embedding lookup table,which converts character/phoneme ID into embedding; 2) The in-put module of the decoder is a speech pre-net, which consists ofmultiple dense layers to transform each speech frame non-linearly;3) The output module of the decoder consists of a linear layer toconvert hidden representations into mel-spectrograms, and a stoplinear layer with a sigmoid function to predict whether currentstep should stop or not. For the ASR model: 1) The input moduleof the encoder consists of multiple convolutional layers, whichreduce the length of the speech sequence; 2) The input module ofthe decoder is a character/phoneme embedding lookup table; 3)The output module of the decoder consists of a linear layer and asoftmax function, where the linear layer shares the same weightswith the character/phoneme embedding lookup table in the decoderinput module.
Speaker Module.
The multi-speaker TTS model relies on a speakerembedding module to differentiate multiple speakers. We add aspeaker embedding vector both in the encoder output and decoderinput (after the decoder input module). As shown in Figure 2 (c),we convert the speaker ID into a speaker embedding vector usingan embedding lookup table, and then add a linear transformationwith a softsign function x = x /( + | x |) . We further concatenate theobtained vector with the encoder output or decoder input, and useanother linear layer to reduce the hidden dimension to the originalhidden of the encoder output or decoder input. In this section, we conduct experiments to evaluate LRSpeech forextremely low-resource TTS and ASR. We first describe the experi-ment settings, show the results of our method, and conduct someanalyses of LRSpeech. otation Quality Type Dataset D h High Paired LJSpeech [17] 50 (5 minutes) D l Low Paired LibriSpeech [26] 1000 (3.5 hours) Y u seen Low Unpaired LibriSpeech 2000 (7 hours) Y u unseen Low Unpaired LibriSpeech 5000 (14 hours) X u / Unpaired news-crawl 20000 Table 2: The data used in the low-resource language: English. D h represents target-speaker high-quality paired data. D l represents multi-speaker low-quality paired data (50 speak-ers). Y u seen represents multi-speaker low-quality unpairedspeech data (50 speakers), where speakers are seen in thepaired training data. Y u unseen represents multi-speaker low-quality unpaired speech data (50 speakers), where speakersare unseen in the paired training data. X u represents un-paired text data. We describe the datasets used in rich-resource andlow-resource languages respectively: • We select Mandarin Chinese as the rich-resource language. TheTTS corpus D rich_tts contains 10000 paired speech and text data(12 hours) of a single speaker from Data Baker . The ASR corpus D rich_asr is from AIShell [5], which contains about 120000 pairedspeech and text data (178 hours) from 400 Mandarin Chinesespeakers. • We select English as a low-resource language for experimentaldevelopment. The details of the data resources used are shownin Table 2. More information about these datasets are shown inSection A.1 and Table 6.
We use a 6-layer encoder and a6-layer decoder for both the TTS and ASR models. The hiddensize, character embedding size, and speaker embedding size are allset to 384, and the number of attention heads is set to 4. Duringdual transformation, we up-sample the paired data to make itssize roughly the same with the unpaired data. During knowledgedistillation, we filter the synthesized speech with WCR less than0.7 and ADR less than 0.7. The width of diagonal ( b ) in ADR is 10.More model training details are introduced in Section A.2.The TTS model uses Parallel WaveGAN [44] as the vocoder tosynthesize speech. To train Parallel WaveGAN, we combine thespeech data in the Mandarin Chinese TTS corpus D rich_tts with thespeech data in the English target-speaker high-quality corpus D h .We up-sample the speech data in D h to make it roughly the samewith the speech data in D rich_tts .For evaluation, we use MOS (mean opinion score) and IR (intelli-gibility rate) for TTS, and WER (word error rate) and CER (charactererror rate) for ASR. For TTS, we select English text sentences fromthe news-crawl dataset to synthesize speech for evaluation. Werandomly select 200 sentences for IR test and 20 sentences for MOStest, following the practice in [30, 41] . Each speech is listened byat least 5 testers for IR test and 20 testers for MOS test, who are allnative English speakers. For ASR, we measure the WER and CER http://data.statmt.org/news-crawl The sentences for IR and MOS test, audio samples and test reports can be founded inhttps://speechresearch.github.io/lrspeech. score on the LibriSpeech “test-clean” set. The test sentences andspeech for TTS and ASR do not appear in the training corpus.
Setting TTS ASRIR (%) MOS WER (%) CER (%)Baseline
GT (Parallel WaveGAN) - 3.88 - -GT - 4.05 - -
Table 3: The accuracy comparisons for TTS and ASR. PF, DTand KD are the three components of LRSpeech, where PFrepresents pre-training and fine-tuning, DT represents dualtransformation, KD represents knowledge distillation. GT isthe ground-truth and GT (Parallel WaveGAN) is the audiogenerated with Parallel WaveGAN from the ground-truthmel-spectrogram. Baseline
We compare LRSpeech with the baselines thatpurely leverage the limited paired data for training, including 1)Baseline D h ,and 2) Baseline D l on Baseline • Both baselines cannot synthesize reasonable speech and the cor-responding IR and MOS are marked as “/”. The WER and CERon ASR are also larger than 100% , which demonstrates the poorquality when only using the limited paired data D h and D l forTTS and ASR training. • Based on Baseline • However, the paired data in both rich-resource and low-resourcelanguages cannot guarantee high accuracy, and thus we furtherleverage the unpaired speech corpus Y u seen and Y u unseen , and un-paired text corpus X u through dual transformation (DT). DT cangreatly improve IR to 96.70% and MOS to 3.28 on TTS, as well asWER to 38.94% and CER to 19.99%. The unpaired text and speechsamples can cover more words and pronunciations, as well asmore speech prosody, which help the synthesized speech in TTSachieves higher intelligibility (IR) and naturalness (MOS), andalso help ASR achieves better WER and CER. The WER and CER can be larger than 100%, and the detailed reasons can be foundedin Section A.3. a) Baseline
Figure 3: The TTS attention alignments (where the column and row represent the source text and target speech respectively) ofan example chosen from the test set. The source text is “the paper’s author is alistair evans of monash university in australia”.
Setting WCR ADR (%)PF 0.65 97.85PF + DT 0.66 98.37PF + DT + KD (LRSpeech) 0.72 98.81
Table 4: The word coverage ratio (WCR) and attention diago-nal ratio (ADR) scores in TTS model under different settings. • Furthermore, adding knowledge distillation (KD) brings 1.38% IR,0.29 MOS, 10.12% WER and 5.34% CER improvements. We also listthe speech quality in terms of MOS for the ground-truth record-ings (GT) and the synthesized speech from the ground-truthmel-spectrogram by Parallel WaveGAN vocoder (GT (ParallelWaveGAN)) in Table 3 as the upper bounds for references. It canbe seen that LRSpeech achieves a MOS score of 3.57, with a gapto the ground-truth recordings less than 0.5, demonstrating thehigh quality of the synthesized speech. • There are also some related works focusing on low-resource TTSand ASR, such as Speech Chain [39], Almost Unsup [31], andSeqRQ-AE [23]. However, these methods require much data re-source to build systems and thus cannot achieve reasonable accu-racy in the extremely low-resource setting. For example, [31] re-quires a pronunciation lexicon to convert the character sequenceinto phoneme sequence, and dozens of hours of single-speakerhigh-quality unpaired speech data to improve the accuracy, whichare costly and not available in the extremely low-resource setting.As a result, [31] cannot synthesize reasonable speech in TTS andachieves high WER according to our preliminary experiments.As a summary, LRSpeech achieves an IR score of 98.08% and a MOSscore of 3.57 for TTS with extremely low data cost, which meetsthe online requirements for deploying the TTS system. Besides, italso achieves a WER score of 28.82% and a CER score of 14.65%,which is highly competitive considering the data resource used,and shows great potential for further online deployment.
Since the qualityof the attention alignments between the encoder (text) and decoder(speech) are good indicators of the performance of TTS model, weanalyze the word coverage ratio (WCR) and attention diagonal ratio(ADR) as described in Section 2.5.1 and show their changes amongdifferent settings in Table 4. We also show the attention alignmentsof a sample case from each setting in Figure 3. We have severalobservations: • As can be seen from Figure 3 (a) and (b), both Baseline D l ) on D h cannothelp the TTS model but make it worse. Due to the poor alignmentquality of Baseline • After adding pre-training and fine-tuning (PF), the attentionalignments in Figure 3 (c) become diagonal, which demonstratesthe TTS model pre-training in rich-resource languages can helpbuild reasonable alignments between text and speech in low-resource languages. Although the synthesized speech can beroughly understood by humans, it still has many issues such asword skipping and repeating. For example, the word “in" in thered box of Figure 3 (c) has low attention weight (WCR), and thusthe speech skips the word “in". • Further adding dual transformation (DT) improves WCR andADR, and also alleviates the words skipping and repeating issues.Accordingly, the attention alignments in Figure 3 (d) are better. • Since there still exist some word skipping and repeating issuesafter DT, we filter the synthesized speech according to WCRand ADR during knowledge distillation (KD). The final WCRis further improved to 0.72 and ADR is improved to 98.81% asshown in Table 4, and the attention alignments in Figure 3 (e)are much more clear.
There are some questions to further investigate in LRSpeech: • Low-quality speech data may bring noise to the TTS model. Howcan the accuracy change if using different scales of low-qualitypaired data D l ? • As described in Section 2.4, supporting the LRSpeech trainingwith unpaired speech data from seen and especially unseen speak-ers ( Y u seen and Y u unseen ) is critical for a robust and scalable system.Can the accuracy be improved if using Y u seen and Y u unseen ? • How can the accuracy change if using different scales of unpairedtext data X u to synthesize speech during knowledge distillation? a) Varying the data scale of D l (b) Results using Y u seen and Y u unseen (c) Varying the data X u for (d) Varying the data X u forTTS knowledge distillation ASR knowledge distillation Figure 4: Analyses of LRSpeech with different training data.
We conduct experimental analyses to answer these questions. Forthe first two questions, we simply analyze LRSpeech without knowl-edge distillation, and for the third question, we analyze in the knowl-edge distillation stage. The results are shown in Figure 4 . We haveseveral observations: • As shown in Figure 4 (a), we vary the size of D l with 1 / × , 1 / × and 5 × of the default setting (1000 paired data, 3.5 hours) usedin LRSpeech, and find that more low-quality paired data resultin the better accuracy for TTS. • As shown in Figure 4 (b), we add Y u seen and Y u unseen respectively,and find that both of them can boost the accuracy of TTS and ASR,which demonstrates the ability of LRSpeech to utilize unpairedspeech from seen and especially unseen speakers. • As shown in Figure 4 (c), we vary the number of synthesizedspeech for TTS during knowledge distillation with 1 / × , 1 / × and 1 / × of the default setting (20000 synthesized speech data),and find more synthesized speech data result in better accuracy. • During knowledge distillation for ASR, we use two kinds of data:1) the realistic speech data (8050 data in total), which contains D h , D l and the pseudo paired data distilled from Y u seen and Y u unseen by the ASR model, 2) the synthesized speech data, which arethe pseudo paired data distilled from X u by the TTS model. Wevary the number of synthesized speech data from X u (the secondtype) with 0 × , / × , / × , × , × , × of the realistic speech data(the first type) in Figure 4 (d). It can be seen that increasing theratio of synthesized speech data can achieve better results.All the observations above demonstrate the effectiveness and scala-bility of LRSpeech by leveraging more low-cost data resources. Data Setting.
The data setting in Lithuanian is similar to thatin English. We select a subset of Liepa corpus [20] and only usethe characters as the raw texts. The D h contains 50 paired text andspeech data (3.7 minutes), D l contains 1000 paired text and speechdata (1.29 hours), Y u seen contains 4000 unpaired speech data (5.1hours), Y u unseen contains 5000 unpaired speech data (6.7 hours), and X u contains 20000 unpaired texts.We select Lithuanian text sentences from the news-crawl datasetas the test set for TTS. We randomly select 200 sentences for IR testand 20 sentences for MOS test, following the same test configurationin English. Each audio is listened by at least 5 testers for IR test and The audio samples and complete experiments results on IR and MOS for TTS, andWER and CER for ASR can be founded in https://speechresearch.github.io/lrspeech.
20 testers for MOS test, who are all native Lithuanian speakers. ForASR evaluation, we randomly select 1000 speech data (1.3 hours)with 197 speakers from Liepa corpus to measure the WER and CERscores. The test sentences and speech for TTS and ASR do notappear in the training corpus.
Results.
As shown in Table 5, the TTS model on Lithuanianachieves an IR score of 98.60% and a MOS score of 3.65, with a MOSgap to the ground-truth recording less than 0.5, which also meetsthe online deployment requirement . The ASR model achieves aCER score of 10.30% and a WER score of 17.04%, which shows greatpotential under this low-resource setting.Setting IR (%) MOS WER (%) CER (%)Lithuanian 98.60 3.65 17.04 10.30GT (Parallel WaveGAN) - 3.89 - -GT - 4.01 - - Table 5: The results of LRSpeech on TTS and ASR with re-gard to Lithuanian.
In this paper, we developed LRSpeech, a speech synthesis and recog-nition system under the extremely low-resource setting, which sup-ports rare languages with low data costs. We proposed pre-trainingand fine-tuning, dual transformation and knowledge distillation inLRSpeech to leverage few paired speech and text data, and slightlymore multi-speaker low-quality unpaired speech data to improvethe accuracy of TTS and ASR models. Experiments on English andLithuanian show that LRSpeech can meet the requirements of on-line deployment for TTS and achieve very promising results forASR under the extremely low-resource setting, demonstrating theeffectiveness of LRSpeech for rare languages.Currently we are deploying LRSpeech to a large commercializedcloud TTS service. In the future, we will further improve the accu-racy of ASR in LRSpeech and also deploy it to this commercializedcloud service.
Jin Xu and Jian Li are supported in part by the National NaturalScience Foundation of China Grant 61822203, 61772297, 61632016,61761146003, and the Zhongguancun Haihua Institute for Frontier The audio samples can be founded in https://speechresearch.github.io/lrspeech nformation Technology, Turing AI Institute of Nanjing and Xi’anInstitute for Interdisciplinary Information Core Technology.
REFERENCES [1] Mikel Artetxe, Sebastian Ruder, and Dani Yogatama. 2019. On the cross-lingualtransferability of monolingual representations. arXiv preprint arXiv:1910.11856 (2019).[2] Alexei Baevski, Michael Auli, and Abdelrahman Mohamed. 2019. Effective-ness of self-supervised pre-training for speech recognition. arXiv preprintarXiv:1911.03912 (2019).[3] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural ma-chine translation by jointly learning to align and translate. arXiv preprintarXiv:1409.0473 (2014).[4] Antoine Bruguier, Anton Bakhtin, and Dravyansh Sharma. 2018. DictionaryAugmented Sequence-to-Sequence Neural Network for Grapheme to PhonemePrediction.
Proc. Interspeech 2018 (2018), 3733–3737.[5] Hui Bu, Jiayu Du, Xingyu Na, Bengu Wu, and Hao Zheng. 2017. Aishell-1: Anopen-source mandarin speech corpus and a speech recognition baseline. In .IEEE, 1–5.[6] William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals. 2016. Listen, attendand spell: A neural network for large vocabulary conversational speech recogni-tion. In
Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE InternationalConference on . IEEE, 4960–4964.[7] Tian-Yi Chen, Lan Zhang, Shi-Cong Zhang, Zi-Long Li, and Bai-Chuan Huang.2019. Extensible cross-modal hashing. In
Proceedings of the 28th InternationalJoint Conference on Artificial Intelligence . AAAI Press, 2109–2115.[8] Yi-Chen Chen, Chia-Hao Shen, Sung-Feng Huang, and Hung-yi Lee. 2018. To-wards Unsupervised Automatic Speech Recognition Trained by Unaligned Speechand Text only. arXiv preprint arXiv:1803.10952 (2018).[9] Yuan-Jui Chen, Tao Tu, Cheng-chieh Yeh, and Hung-Yi Lee. 2019. End-to-endtext-to-speech for low-resource languages by cross-lingual transfer learning.
Proc. Interspeech 2019 (2019), 2075–2079.[10] Chung-Cheng Chiu, Tara N Sainath, Yonghui Wu, Rohit Prabhavalkar, PatrickNguyen, Zhifeng Chen, Anjuli Kannan, Ron J Weiss, Kanishka Rao, Ekaterina Go-nina, et al. 2018. State-of-the-art speech recognition with sequence-to-sequencemodels. In . IEEE, 4774–4778.[11] Jan Chorowski, Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014.End-to-end continuous speech recognition using attention-based recurrent nn:First results. In
NIPS 2014 Workshop on Deep Learning, December 2014 .[12] Yu-An Chung, Yuxuan Wang, Wei-Ning Hsu, Yu Zhang, and RJ Skerry-Ryan. 2019.Semi-supervised training for improving data efficiency in end-to-end speechsynthesis. In
ICASSP 2019-2019 IEEE International Conference on Acoustics, Speechand Signal Processing (ICASSP) . IEEE, 6940–6944.[13] Erica Cooper, Emily Li, and Julia Hirschberg. 2018. Characteristics of Text-to-Speech and Other Corpora.
Proceedings of Speech Prosody 2018 (2018).[14] Erica Lindsay Cooper. 2019.
Text-to-speech synthesis using found data for low-resource languages . Ph.D. Dissertation. Columbia University.[15] Joel Harband. 2010. Text-to-Speech Costs âĂŤ Licensing and Pricing. http://elearningtech.blogspot.com/2010/11/text-to-speech-costs-licensing-and.html[16] Takaaki Hori, Ramon Astudillo, Tomoki Hayashi, Yu Zhang, Shinji Watanabe,and Jonathan Le Roux. 2019. Cycle-consistency training for end-to-end speechrecognition. In
ICASSP 2019-2019 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP) . IEEE, 6271–6275.[17] Keith Ito. 2017. The LJ Speech Dataset. https://keithito.com/LJ-Speech-Dataset/.[18] Yoon Kim and Alexander M Rush. 2016. Sequence-Level Knowledge Distillation.In
Proceedings of the 2016 Conference on Empirical Methods in Natural LanguageProcessing . 1317–1327.[19] Patricia K Kuhl, Barbara T Conboy, Sharon Coffey-Corina, Denise Padden, MaritzaRivera-Gaxiola, and Tobey Nelson. 2008. Phonetic learning as a pathway tolanguage: new data and native language magnet theory expanded (NLM-e).
Philosophical Transactions of the Royal Society B: Biological Sciences
Informatica
29, 3 (2018), 487–498.[21] M Paul Lewis and F Gary. 2013. Simons, and Charles D. Fennig (eds.).(2015).Ethnologue: Languages of the World, Dallas, Texas: SIL International. (2013).[22] Naihan Li, Shujie Liu, Yanqing Liu, Sheng Zhao, Ming Liu, and M Zhou. 2019.Neural Speech Synthesis with Transformer Network. AAAI.[23] Alexander H Liu, Tao Tu, Hung-yi Lee, and Lin-shan Lee. 2019. Towards Unsuper-vised Speech Recognition and Synthesis with Quantized Speech Representation Learning. arXiv preprint arXiv:1910.12729 (2019).[24] Da-Rong Liu, Kuan-Yu Chen, Hung-yi Lee, and Lin-shan Lee. 2018. CompletelyUnsupervised Phoneme Recognition by Adversarially Learning Mapping Rela-tionships from Audio Embeddings.
Proc. Interspeech 2018 (2018), 3748–3752.[25] Minh-Thang Luong, Hieu Pham, and Christopher D Manning. 2015. EffectiveApproaches to Attention-based Neural Machine Translation. In
Proceedings of the2015 Conference on Empirical Methods in Natural Language Processing . 1412–1421.[26] Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. 2015.Librispeech: an ASR corpus based on public domain audio books. In . IEEE,5206–5210.[27] Daniel S Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin DCubuk, and Quoc V Le. 2019. SpecAugment: A Simple Data Augmentation Methodfor Automatic Speech Recognition.
Proc. Interspeech 2019 (2019), 2613–2617.[28] Wei Ping, Kainan Peng, Andrew Gibiansky, Sercan O. Arik, Ajay Kannan, SharanNarang, Jonathan Raiman, and John Miller. 2018. Deep Voice 3: 2000-SpeakerNeural Text-to-Speech. In
International Conference on Learning Representations .[29] Ofir Press and Lior Wolf. 2017. Using the Output Embedding to Improve LanguageModels. In
Proceedings of the 15th Conference of the European Chapter of theAssociation for Computational Linguistics: Volume 2, Short Papers . 157–163.[30] Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-YanLiu. 2019. Fastspeech: Fast, robust and controllable text to speech. In
Advancesin Neural Information Processing Systems . 3165–3174.[31] Yi Ren, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. 2019. AlmostUnsupervised Text to Speech and Automatic Speech Recognition. In
InternationalConference on Machine Learning . 5410–5419.[32] Andrew Rosenberg, Yu Zhang, Bhuvana Ramabhadran, Ye Jia, Pedro Moreno,Yonghui Wu, and Zelin Wu. 2019. Speech Recognition with Augmented Synthe-sized Speech. arXiv preprint arXiv:1909.11699 (2019).[33] Steffen Schneider, Alexei Baevski, Ronan Collobert, and Michael Auli. 2019.wav2vec: Unsupervised Pre-Training for Speech Recognition.
Proc. Interspeech2019 (2019), 3465–3469.[34] Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Improving NeuralMachine Translation Models with Monolingual Data. In
Proceedings of the 54thAnnual Meeting of the Association for Computational Linguistics (Volume 1: LongPapers) , Vol. 1. 86–96.[35] Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep Jaitly,Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, et al.2018. Natural tts synthesis by conditioning wavenet on mel spectrogram pre-dictions. In . IEEE, 4779–4783.[36] Hao Sun, Xu Tan, Jun-Wei Gan, Hongzhi Liu, Sheng Zhao, Tao Qin, and Tie-Yan Liu. 2019. Token-Level Ensemble Distillation for Grapheme-to-PhonemeConversion.
Proc. Interspeech 2019 (2019), 2115–2119.[37] Xu Tan, Yi Ren, Di He, Tao Qin, and Tie-Yan Liu. 2019. Multilingual NeuralMachine Translation with Knowledge Distillation. In
International Conference onLearning Representations . https://openreview.net/forum?id=S1gUsoR9YX[38] Ye Kyaw Thu, Win Pa Pa, Yoshinori Sagisaka, and Naoto Iwahashi. 2016. Compar-ison of grapheme-to-phoneme conversion methods on a myanmar pronunciationdictionary. In
Proceedings of the 6th Workshop on South and Southeast AsianNatural Language Processing (WSSANLP2016) . 11–22.[39] Andros Tjandra, Sakriani Sakti, and Satoshi Nakamura. 2017. Listening whilespeaking: Speech chain by deep learning. In
Automatic Speech Recognition andUnderstanding Workshop (ASRU), 2017 IEEE . IEEE, 301–308.[40] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is allyou need. In
Advances in Neural Information Processing Systems . 5998–6008.[41] Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J Weiss, NavdeepJaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, et al. 2017.Tacotron: Towards End-to-End Speech Synthesis.
Proc. Interspeech 2017 (2017),4006–4010.[42] Jan Wind. 1989. The evolutionary history of the human speech organs.
Studiesin language origins
IEEE Transactions on Audio, Speech, andLanguage Processing
18, 5 (2010), 984–1004.[44] Ryuichi Yamamoto, Eunwoo Song, and Jae-Min Kim. 2019. Parallel WaveGAN: Afast waveform generation model based on generative adversarial networks withmulti-resolution spectrogram. arXiv preprint arXiv:1910.11480 (2019).[45] Chih-Kuan Yeh, Jianshu Chen, Chengzhu Yu, and Dong Yu. 2019. UnsupervisedSpeech Recognition via Segmental Empirical Output Distribution Matching.
ICLR (2019).[46] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. 2017. Unpairedimage-to-image translation using cycle-consistent adversarial networks. In
Pro-ceedings of the IEEE international conference on computer vision . 2223–2232.
REPRODUCIBILITYA.1 Datasets
We list the detailed information of all of the datasets used in thispaper in Table 6. Next, we first describe the details of the datapreprocessing for speech and text data, and then describe whatis the “high-quality” and “low-quality” speech mentioned in thispaper . Data Proprocessing.
For the speech data, we re-sample it to 16kHZand convert the raw waveform into mel-spectrograms following Shenet al. [35] with 50ms frame size, 12.5ms hop size. For the text, weuse text normalization rules to convert the irregular word into thenormalized type which is easier to pronounce, e.g., “Sep 7th" willbe converted into “September seventh".
High-Quality Speech.
We use high-quality speech to refer thespeech data from TTS corpus (e.g., LJSpeech, Data Baker as shownin Table 6), which are usually recorded in a professional record-ing studio with consistent characteristics such as speaking rate.Collecting high-quality speech data for TTS is typically costly [13].
Low-Quality Speech.
We use low-quality speech to refer thespeech data from ASR corpus (e.g., LibriSpeech, AIShell, Liepa asshown in Table 6). Compared to high-quality speech, low-qualityspeech usually contains noise due to the recording devices (e.g.,smartphones, laptops) or the recording environment (e.g., roomreverberation, traffic noise). However, low-quality speech cannotbe too noisy for model training. We just use the term “low-quality”to differ from high-quality speech.
A.2 Model Configurations and Training
Both the TTS and ASR models use the 6-layer encoder and 6-layerdecoder. For both the TTS and ASR models, the hidden size andspeaker ID embedding size is 384 and the number of attention headsis 4. The kernel sizes of 1D convolution in the 2-layer convolutionnetwork are set to 9 and 1 respectively, with input/output sizeof 384/1536 for the first layer and 1536/384 in the second layer.For the TTS model, the input module of the decoder consists of 3fully-connected layers. The first two fully-connected layers have64 neurons each and the third one has 384 neurons. The ReLU non-linearity is applied to the output of every fully-connected layer. Wealso insert 2 dropout layers in between the 3 fully-connected layers,with dropout probability 0.5. The output module of the decoderis a fully-connected layer with 80 neurons. For the ASR model,the encoder contains 3 convolution layers. The first two are 3 × .We use the Adam optimizer with β = . β = . ε = − and follow the same learning rate schedule in Vaswani et al. [40].We train both the TTS and ASR models in LRSpeech on 4 NVIDIAV100 GPUs. Each batch contains 20,000 speech frames in total. Thepre-training and fine-tuning, dual transformation and knowledge We show some high-quality speech (target speaker) and low-qualityspeech (other speakers) from the training set in the demo page:https://speechresearch.github.io/lrspeech. https://github.com/tensorflow/tensor2tensor distillation take nearly 1, 7, 1 days respectively. We measure the TTSinference speed on a server with 12 Intel Xeon CPU, 256GB memory,1 NVIDIA V100 GPU. The TTS model takes about 0.21s to generate1.0s of speech, which satisfies online deployment requirements forinference speed. A.3 Evaluation Details
Mean Opinion Score (MOS).
The MOS test is a speech qualitytest for naturalness where listeners (testers) were asked to givetheir opinions on the speech quality in a five-point scale MOS:5=excellent, 4=good, 3=fair, 2=poor, 1=bad. We randomly select20 sentences to synthesize speech for MOS test and each audio islistened by 20 testers, who are all native speakers. We present apart of the MOS test results in Figure 5. The complete test reportcan be downloaded here . Figure 5: A part of the English MOS test report.
Intelligibility Rate (IR).
The IR test is a speech quality test forIntelligibility. During the test, the listeners (testers) are requested tomark every unintelligible word in the text sentence. IR is calculatedby the proportion of the words that are intelligible over the totaltest words. We randomly select 200 sentences to synthesize speechfor IR test and each audio is listened by 5 testers, who are all nativespeakers. A part of the IR test results is shown in Figure 6. You canfind more test reports from the demo link.
WER and CER.
Given the reference text and predicted text, theWER calculates the edit distance between them and then normalizesthe distance by dividing the number of words in the referencesentence. The WER is defined as
W ER = S + D + IN , where the N isthe number of words in the reference sentence, S is the number ofsubstitutions, D is the number of deletions and I is the number ofinsertions. The WER can be larger than 100%. For example, giventhe reference text “an apple” and predicted text “what is history”, thepredicted text needs two substitution operations and one insertionoperation. For this case, the WER is + =150%. The CER is similarto WER. A.4 Some Explorations in Experiments
We briefly describe some other explorations in training LRSpeechin this paper: https://speechresearch.github.io/lrspeech ataset Type Speakers Language Open Source UsageData Baker High-quality speech Single Mandarin Chinese ✓ Pre-trainingAIShell Low-quality speech Multiple Mandarin Chinese ✓ Pre-trainingLJSpeech High-quality speech Single English ✓ TrainingLibriSpeech Low-quality speech Multiple English ✓ Training / Engish ASR testLiepa Low-quality speech Multiple Lithuanian ✓ Training / Lithuanian ASR testnews-crawl Text / English/Lithuanian ✓ English/Lithuanian training and TTS test
Table 6: The datasets used in this paper.
Setting ResultReference some mysterious force seemed to have brought about a convulsion of the elementsBaseline
Table 7: A case analysis for ASR model under different settings.Figure 6: A part of the English IR test report. • Pre-training and Finetune
We also try different methods suchas unifying the character spaces between rich-resource and low-resource languages, or learning the mapping between the char-acter embeddings of rich- and low-resource languages as used in[9]. However, we find these methods result in similar accuracyfor both TTS and ASR. • Speaker Module
To design the speaker module, we exploreseveral ways including replacing softsign with ReLU, etc. Experi-mental results show that the design as Figure 2 (c) can help modelreduce the repeating words and missing words. • Knowledge Distillation for TTS
We try to add the paired tar-get speaker data for training. However, the result is slightly worsethan that using only synthesized speech. • Knowledge Distillation for ASR
Since the synthesized speechcan improve the performance, we try to remove the real speechand add plenty of synthesized speech for training. However, the ASR model cannot work well for real speech and WER is above47%. • Vocoder Training