[PDF] Exploration of End-to-end Synthesisers forZero Resource Speech Challenge 2020

Abstract

A Spoken dialogue system for an unseen language is referred to as Zero resource speech. It is especially beneficial for developing applications for languages that have low digital resources. Zero resource speech synthesis is the task of building text-to-speech (TTS) models in the absence of transcriptions. In this work, speech is modelled as a sequence of transient and steady-state acoustic units, and a unique set of acoustic units is discovered by iterative training. Using the acoustic unit sequence, TTS models are trained. The main goal of this work is to improve the synthesis quality of zero resource TTS system. Four different systems are proposed. All the systems consist of three stages: unit discovery, followed by unit sequence to spectrogram mapping, and finally spectrogram to speech inversion. Modifications are proposed to the spectrogram mapping stage. These modifications include training the mapping on voice data, using x-vectors to improve the mapping, two-stage learning, and gender-specific modelling. Evaluation of the proposed systems in the Zerospeech 2020 challenge shows that quite good quality synthesis can be achieved.

Full PDF

EExploration of End-to-end Synthesisers forZero Resource Speech Challenge 2020

Karthik Pandia D S, Anusha Prakash, Mano Ranjith Kumar M, Hema A Murthy

Indian Institute of Techonology Madras, India [email protected], [email protected], [email protected],[email protected]

Abstract

A Spoken dialogue system for an unseen language is re-ferred to as Zero resource speech. It is especially beneﬁcial fordeveloping applications for languages that have low digital re-sources. Zero resource speech synthesis is the task of buildingtext-to-speech (TTS) models in the absence of transcriptions.In this work, speech is modelled as a sequence of transientand steady-state acoustic units, and a unique set of acousticunits is discovered by iterative training. Using the acoustic unitsequence, TTS models are trained.The main goal of this work is to improve the synthesis qual-ity of zero resource TTS system. Four different systems areproposed. All the systems consist of three stages– unit discov-ery, followed by unit sequence to spectrogram mapping, andﬁnally spectrogram to speech inversion. Modiﬁcations are pro-posed to the spectrogram mapping stage. These modiﬁcationsinclude training the mapping on voice data, using x-vectors toimprove the mapping, two-stage learning, and gender-speciﬁcmodelling. Evaluation of the proposed systems in the Ze-rospeech 2020 challenge shows that quite good quality synthe-sis can be achieved.

Index Terms : Text-to-speech synthesis, Acoustic unit discov-ery, TTS without T, end-to-end speech synthesis

1. Introduction

An infant learns the acoustic units of a language and repro-duces them by babbling, even before s/he starts to recognisethe sounds. The goal of Zerospeech task is to design a sys-tem that closely mimics this process. The Zerospeech challengeultimately aims to build an autonomous speaker dialogue sys-tem. Previous challenges have focused on tasks such as spokenterm detection, spoken term discovery (STD), and zero resourcespeech synthesis. The Zerospeech 2020 challenge consolidatesthe STD and synthesis tasks. The two tasks are run as task1 andtask2, respectively. Acoustic unit discovery (AUD) is a com-mon step in both tasks. STD and speech synthesis are applica-tions of the discovered acoustic units (AUs).Several AUD approaches have been proposed in the litera-ture for various tasks [1]. The baseline provided by the organ-isers is Dirichlet process Gaussian mixture models (DPGMM)[2] for AUD and Ossian [3] for speech synthesis. Some of theclassic AUD approaches are non-parametric Bayesian approachby Lee et. al. [4], AUD based on weak top-down constraints[5], and Autoencoder-based approaches [6, 7]. Most systems inthe Zerospeech 2019 challenge [8] used autoencoder-based ap-proaches for AUD [9, 10, 11]. The objective function in mostapproaches is frame-based, wherein sequence information is notexplicitly modelled. Hidden Markov model (HMM) based gen-erative modelling of AU is one of the top systems [1] in terms ofthe synthesis quality and low-bitrate encoding. Even the base- line is a non-parametric HMM-based approach [2]. In [1], AUsare modelled explicitely as transient and steady-state regions.A modiﬁed version of this approach is used in this work. Inthe past two years, there has been a surge in the development oftraining approaches for TTS [12, 13, 14, 15]. The current workuses the AUD technique from [1] and focuses on improving thesynthesis quality in an end-to-end framework.The TTS framework used in this work has two stages. Theﬁrst stage maps the AU symbol sequence to the correspondingspectrogram. The second stage inverts the spectrogram backto speech. Since the task is to synthesise speech in a targetspeaker’s voice, the inversion stage is ﬁxed. Different methodsare proposed to learn the mapping between the AU sequenceand spectrogram efﬁciently. The proposed approaches explorean end-to-end framework that includes speaker embedding, hi-erarchical training, and gender-dependent training to learn themapping.End-to-end TTSes are trained based on the Tacotron2 ar-chitecture [13]. x-vectors are used as speaker embedding toproduce speech in the target speaker’s voice [16]. In hierarchi-cal learning, similar to the AUD approach, the mapping is ﬁrstconﬁned to smaller units. The obtained model is then used tobootstrap the learning on full utterances. Since the training datahas both male and female speakers, gender-dependent TTSesare developed. During synthesis, depending on the gender ofthe target speaker, the appropriate TTS is employed. Subjectivemeasures indicate that there is an improvement in the overallquality of the synthesised speech output compared to our sys-tems in Zerospeech 2019 challenge, with a slight degradation inthe speaker similarity measure.The rest of the paper is organised as follows. The proposedapproaches are presented in Section 2. Section 3 details the ex-periments carried out. Section 4 discusses the results and theiranalysis. The work is concluded in Section 5.

2. Proposed systems

The AUD approach used in this work is similar to the approachin [1]. A modiﬁcation to the syllable-like segmentation algo-rithm is proposed, which makes AUD totally unsupervised. TheAUD approach and the four TTS systems incorporated in the ze-rospeech pipeline are explained in this section. These systemsare illustrated by a block diagram given in Figure 2.

An overview of the AUD approach used in [1] is brieﬂy givenhere. The approach models speech as transient and steady-state regions. The transient regions correspond to rising andfalling transients, and the steady-state regions predominantlycorrespond to vowels. The block diagram of the proposed AUDapproach is shown in Figure 1. First, speech is segmented into a r X i v : . [ ee ss . A S ] S e p yllable-likeSegmentation Similarity Matrix KNN GraphClusteringSelf-trainingStage 1Self-trainingStage 2AUmodelUnit Data Figure 1:

Steps for acoustic unit discovery (AUD) syllable-like units. A similarity matrix is obtained by comput-ing the DTW score between all pairs of syllable-like segments.Homogeneous syllable like units are clustered using a K-nearestneighbour (KNN) graph clustering approach. The syllable-likeunits in each cluster is a sequence of 3 AUs corresponding torising transient, steady-state, and falling transient. HMMs areused to model the AUs. Using the trained models, the syllable-like units are transcribed. Then the obtained transcriptions areused to retrain models. The training and transcription processesare repeated until convergence. This process of repeated trainand transcribe is termed as self training . The initial models thusobtained are trained only on the syllable-like segments presentin the clusters. Using the initial models, the full set of syllable-like segments are transcribed. Self-training is performed on thisset to obtain better models. The self-training process on thesyllable-like segments is referred to as stage 1 training. Oncethe models are trained on the entire set of syllable-like seg-ments, the models are used to transcribe continuous speech.Stage 2 self-training is performed on continuous speech untilconvergence. The models thus obtained are the ﬁnal modelsused to generate the AU sequence.In [1], a vowel posterior function was used to extractsyllable-like units, which is not completely unsupervised. Inthe current work, a signal processing based approach is used tosegment the speech into syllable-like regions, making it com-pletely unsupervised. Speciﬁcally, the unsupervised segmenta-tion approach uses short-time energy (STE) post-processed us-ing a group-delay function [17]. The STE function is assumedas a magnitude function of a signal. It has been shown that thepoles and zeros of a magnitude function can be better resolvedusing group-delay processing by deriving the minimum phasefunction corresponding to the original signal [18]. Accordingly,a function is derived, and the group-delay is computed to re-solve the peaks and valleys of the STE reliably.

For synthesising speech, the end-to-end paradigm is used. Theend-to-end speech synthesis framework is an attractive plat-form to use as training TTSes is easy. It alleviates the need forseparate modules for feature engineering and language-speciﬁctasks. Synthesisers can be trained given only speech wave-forms and corresponding text transcriptions (sequence of acous-tic units in this case).The end-to-end framework used in this work is based on theTacotron2 architecture [13]. It takes care of the conversion ofa sequence of AUs to mel-spectrograms. Tacotron2 consists ofan encoder and a decoder with attention weights. The encoder

AUmodel Transcribe SpectrogramMappingStage 1Syllable-likeSegmentation SpectrogramMappingStage 2GenderIdentificationExtracted x-vectorUnit Data Synthesized Speech

33 3 31 1 1 344 Common to all systems x-vector based system Hierarchical training

Gender dependent TTS

Figure 2:

Block diagram of the proposed systems. Numbers nextto the arrows refer to the ﬂow of the corresponding system type.The ﬂow of System type 1 is common to all other systems. extracts sequential information from the character embeddings,and the attention module predicts a ﬁxed-length context vec-tor. The decoder predicts frame-level mel-spectrograms at eachstep.For the speech waveform inversion, the WaveGlow vocoderis used [14]. WaveGlow takes in mel-spectrogram as input andgenerates the speech output. WaveGlow is a neural vocoderwhich is a combination of WaveNet [12] and Glow [19]. It usesa single network with a single likelihood cost function. WaveG-low is a generative model, in which samples are generated froma zero mean spherical Gaussian, whose dimension is the sameas that of the output. It uses a series of non-linear layers totransform the Gaussian distribution to the desired distribution.The desired distribution comes from audio samples conditionedon mel-spectrograms. The ﬂow of this system can be seen inFigure 2. It is to be noted that this ﬂow is common to all theother proposed systems.

One of the objectives of the challenge is to produce synthe-sised speech in the target voice. The conventional Tacotron2framework does not incorporate any speaker-speciﬁc informa-tion and may not be suited to a multi-speaker setting. Incorpo-rating speaker embedding in the Tacotron2 framework providesbetter ﬂexibility in terms of speaker selection. For this purpose,x-vectors are used. x-vectors have been conventionally usedfor speaker recognition and veriﬁcation tasks [20, 21], and havenow been applied to TTS tasks too [16].x-vectors are ﬁxed-length speaker embeddings computedfrom variable length utterances. The model to compute x vec-tor is trained to discriminate speakers using a time-delay neu-ral network (TDNN) architecture [22]. x-vectors are extractedfrom the audio ﬁles and then appended to each encoder state ofthe sequence-to-sequence model. The TTS synthesiser is thentrained [16].During testing of the TTS, per utterance x-vector is notavailable as only transcriptions are provided. Hence, the meanof x-vectors corresponding to the speaker in the training data isconsidered as the speaker x-vector. The same speaker x-vectors appended to all the encoder states for synthesis.

In a conventional Tacotron TTS system, the mapping betweenthe symbols and spectrogram is learnt at the utterance level. Ithas been shown in the literature that when the training is per-formed on short utterances, the model learns the mapping better.This is primarily because the confusion in mapping between theinput symbol sequence and the spectrogram decreases. Unlikea phoneme or character that is predeﬁned for a language, thereis an inherent confusion between the discovered AUs. This isbecause unlike phonemes, AUs are not uniquely represented fordifferent sounds. This confusion adds to the existing problemof training with long utterances. To alleviate these issues, themapping between the AU sequence and spectrogram regions isperformed at the syllable level. This is similar to unit discov-ery training used in this work, wherein the initial self-trainingﬁrst starts at the syllable level. Similar to AUD, learning of themapping is constrained within smaller syllable-like segments,leading to robust initial models. The initial network parametersthus obtained are used to bootstrap the training process in thenext stage, where ﬁne-tuning is performed using the utterance-level data.

One of the objectives of Zerospeech TTS task is to synthesisespeech in a target speaker’s voice. The characteristics of the tar-get speaker have to be preserved while training. There are twocomponents in the system - symbols to spectrogram mappingand spectrogram to audio inversion. Similar to the phoneme, theAU should be agnostic to the speaker’s characteristics. There-fore, the target speaker’s data can be used to train both themapping and inversion task. Surprisingly, it is observed thatthe synthesised speech has the source speaker’s characteristics.This means that the source speaker’s characteristics in terms oftimbre are embedded in the spectrogram, through the symbolsequence.Experiments are conducted to evaluate the difference in thequality of the synthesised speech when (a) unit data, (b) voicedata is used for spectrogram mapping. When the data providedfor unit discovery was used to learn the spectrogram mapping,there was an improvement in the intelligibility of the synthe-sised speech. This improvement in intelligibility can be at-tributed to the amount of data used to learn the mapping. Al-though the intelligibility improved, the speaker similarity mea-sure degraded. This was more pronounced when male speaker’saudio was synthesised using female speaker’s voice and vice-versa. Hence, gender-dependent systems are built, which takecare of both data insufﬁciency and the problem of mismatchedcondition.Gender identiﬁcation is performed using GMMs. Mel fre-quency cepstral coefﬁcients using longer frame size are ex-tracted. A GMM is trained using both male and female targetspeakers’ data. Then the male and female GMMs are trainedusing maximum aposteriori adaptation. During classiﬁcation,the likelihood ratio (LR) testing is performed. Each ﬁle in theunit set is classiﬁed based on LR score. A voting rule is appliedto this decision to arrive at a ﬁnal decision of gender classiﬁ-cation. The gender with a high number of votes is classiﬁed asthe identiﬁed gender. Once the gender of the unit ﬁles is iden-tiﬁed, spectrogram mapping is performed separately for eachgender. During testing, the appropriate spectrogram mappingmodel is used to estimate the spectrogram, which is then fed to the WaveGlow model for synthesis.

3. Experiments

The datasets used in this work are part of the Zerospeech 2019challenge data. Datasets are provided for two languages– En-glish, the development language, and Indonesian, the surpriselanguage. The dataset for each language is divided into threesets. Unit set to train AU models, voice set to train TTS model,and test set to evaluate the system. The English dataset hasabout 15 hours unit data (100 speakers); 2 hours (V001, male)and 2.6 hours (V002, female) of voice data for target speakers.The test data has about 28 minutes of data across 34 speakers.The Indonesian dataset (surprise language) has about 15 hoursof unit data (112 speakers); 1.5 hours of voice data for a tar-get female speaker. The test data has about 29 minutes of dataacross 15 speakers. A detailed description of the surprise lan-guage dataset is given in [23, 24]. The task is to synthesise testsentences uttered by a source speaker with the target speaker’scharacteristics. This is similar to the voice conversion problem.

The four systems developed as part of this challenge differ intheir synthesiser. The AUD approach is the same across allsystems. Kaldi toolkit [25] is used for AUD. ESPnet, an im-plementation of Tacotron2, is used for AU sequence to mel-spectrogram conversion [16]. The encoder-decoder network istrained for 200 epochs using location-sensitive attention alongwith guided attention. Two variations of System 1 are built–System 1 (unit) using unit data, and System 1 (voice) usingvoice data.WaveGlow uses mel-spectrograms extracted with 80 binslibrosa [26] mel ﬁlters. Training a good WaveGlow model fromscratch is time-consuming. Hence, in this work, WaveGlowmodels are re-trained on the pre-trained LJ Speech model [14]for about 10K iterations. Three WaveGlow models, correspond-ing to target speakers, are trained on the voice data– two forEnglish (V001, V002) and one for the surprise language.For speaker embedding, 512-dimensional x-vectors are ex-tracted from the audio ﬁles using a pre-trained x-vector [20]provided by Kaldi. For this variety of TTS, even the voice traindata is pooled along with the multi-speaker source data for train-ing. Test sentences were synthesised across all systems. Listen-ing tests suggested that System 1 (voice) and System 2, cor-responding to vanilla TTS trained on voice data and TTS withspeaker embedding, respectively, were the best systems. Hence,Systems 1 (voice) and 2 were submitted to the Zerospeech 2020Challenge.

4. Results and analysis

Table 1:

Mushra test scores of different systems

ScoreSystem 1 (unit)

System 1 (voice)

System 2

System 4

Evaluation measures on development and test languages. Scores are shown for the baseline system, topline system, the twosystems submitted to Zerospeech 2019 (ZS19) and Zerospeech 2020 (ZS20) challenges

Development Language (English) Test Language (Indonesian)Systems MOS CER Similarity ABX Bitrate MOS CER Similarity ABX BitrateBaseline

Topline

System 1 (voice) (ZS20)

System 2 (ZS20)

System 1 (ZS19)

System 2 (ZS19) random utterances from the test set.The target utterance is provided as reference audio, and the rat-ings for the system outputs for different systems are given on ascale of 0 to 100, 100 being the best. 18 listeners participated inthe evaluations. The results are shown in Table 1.For System 3 (hierarchical), it was observed that some ofthe synthesised utterances had artefacts and random speech incertain regions. Although the syllable-level mapping was learntwell, this didn’t scale to the utterance-level. The reason for suchartefacts needs to be further investigated. Hence, it was notconsidered for Mushra evaluation.System 1 (unit) gave the best Mushra score of . , fol-lowed by System 2 (x-vector), and System 1 (voice). AlthoughSystem 2 and System 1 (voice) had better speaker similarity,the overall intelligibility of System 1 (unit) was better. Hence,the listeners ignored the speaker factor while giving the scores.The performance of System 4 was poor as synthesised audiohad artefacts, though not to the extent present in the synthesisedaudio of System 3. Based on the Mushra test scores in devel-opment language (English) and based on inspection of speakersimilarity, System 1 (voice) and System 2 were submitted forthe challenge.A detailed analysis of the results of the systems submittedto Zerospeech 2019 (ZS19) and 2020 (ZS20) challenges is pre-sented here. In our submission to ZS19, the focus of the exper-iments was on unit discovery. In the current work, experimentsare extensively conducted to improve the synthesis quality byﬁxing the AUD method. In simple words, the sole objectiveis to improve the subjective evaluations: mean opinion score(MOS) and character error rate (CER).Table 2 summarises the results of the baseline and toplinesystems of ZS20, and the results of our submissions to ZS19and ZS20. First, we compare results of ZS20 systems– Sys-tem 1 (voice) and System 2. According to Table 1, the overallsynthesis quality of System 2 is better than System 1 (voice)by (1 . / . The scores (Table 2: MOS) of the systems ondevelopment data also show a similar pattern. But for Indone-sian, the difference is signiﬁcant (0 . / . The improvement inCER is signiﬁcant when the x-vector model (System 2) is usedfor spectrogram mapping. The absolute improvement of themeasure is for English and for Indonesian. Hence,System 2 is better in terms of subjective evaluation measures,with more or less similar speaker similarity numbers.In our ZS19 submission, it was observed that the MOS andCER scores were high when the number of AUs were larger,albeit with increased bitrate. This increase in bit rate was ap-proximately 20 for an increase in the number of units from 40 to 112. For this year’s work, the number of AUs is set to 100so that the bit rate is still not large. Though the segmentationapproach is different in this year’s work, the bitrate follows asimilar pattern. The bitrate is 126.41 for English and 101.0 forthe Surprise language.When compared to last year’s results (ZS19), the speakersimilarities for all the current submissions are lower. This ef-fect is seen even in the x-vector system (ZS20-System 2), whichis supposed to normalize the speaker’s characteristics with thehelp of speaker embeddings. This could mean that there is aninherent problem with spectrogram mapping, which is trainedon unit data. Moreover, the source speaker’s characteristicsseem to exist even when spectrogram mapping is trained onlyon voice data. This problem might be because of data insufﬁ-ciency for voice data, as E2E systems require a large amount oftraining data.Comparing MOS results with the systems submitted inZS19, there is an absolute increase of . for English and . for Indonesian. The CER also improves by . for English,whereas the value remains almost the same for the Indonesianlanguage. The effects on both languages do not seem to followsimilar patterns.Based on the discussion and analysis of results, we see thatthe overall synthesis quality has improved compared to our sub-mission to Zerospeech 2019 challenge, where the baseline TTS(Ossian) was used. To get the full beneﬁt of using E2E ap-proaches, the proposed techniques have to be applied to a muchlarger training data.

5. Conclusion

While AUD is important towards building zero resource speechsynthesis systems, the TTS component is equally vital. In thiswork, the AU sequence in terms of steady-state and transientregions in speech is used as transcription to build TTS systems.Various approaches to improve the synthesis quality in an end-to-end framework are explored. While there is a signiﬁcant im-provement in the overall synthesis quality using the E2E frame-work, speaker similarity seems to be an issue. Despite usingonly the target speaker data to train the TTS model, the sourcespeaker’s characteristics are observed in the synthesised speech.This problem could be addressed by increasing the amount ofdata for the target speaker.

6. References [1] K. P. D. S. and H. A. Murthy, “Zero Resource Speech SynthesisUsing Transcripts Derived from Perceptual Acoustic Units,”in

Proc. Interspeech 2019 , 2019, pp. 1113–1117. [Online].Available: http://dx.doi.org/10.21437/Interspeech.2019-2336[2] L. Ondel, L. Burget, and J. ˇCernock`y, “Variational inference forcoustic unit discovery,”

Procedia Computer Science , vol. 81, pp.80–86, 2016.[3] Z. Wu, O. Watts, and S. King, “Merlin: An open source neuralnetwork speech synthesis system.” in

SSW , 2016, pp. 202–207.[4] C.-y. Lee and J. Glass, “A nonparametric bayesian approach toacoustic model discovery,” in

Proceedings of the 50th AnnualMeeting of the Association for Computational Linguistics (Vol-ume 1: Long Papers) , 2012, pp. 40–49.[5] A. Jansen, S. Thomas, and H. Hermansky, “Weak top-down con-straints for unsupervised acoustic model training,” in . IEEE, 2013, pp. 8091–8095.[6] L. Badino, C. Canevari, L. Fadiga, and G. Metta, “An auto-encoder based approach to unsupervised learning of subwordunits,” in . IEEE, 2014, pp. 7634–7638.[7] D. Renshaw, H. Kamper, A. Jansen, and S. Goldwater, “A com-parison of neural network methods for unsupervised representa-tion learning on the zero resource speech challenge,” in

SixteenthAnnual Conference of the International Speech CommunicationAssociation , 2015.[8] E. Dunbar, R. Algayres, J. Karadayi, M. Bernard, J. Benjumea,X.-N. Cao, L. Miskic, C. Dugrain, L. Ondel, A. W.Black, L. Besacier, S. Sakti, and E. Dupoux, “The ZeroResource Speech Challenge 2019: TTS Without T,” in

Proc.Interspeech 2019 , 2019, pp. 1088–1092. [Online]. Available:http://dx.doi.org/10.21437/Interspeech.2019-2904[9] R. Eloff, A. Nortje, B. van Niekerk, A. Govender,L. Nortje, A. Pretorius, E. van Biljon, E. van derWesthuizen, L. van Staden, and H. Kamper, “Unsuper-vised Acoustic Unit Discovery for Speech Synthesis UsingDiscrete Latent-Variable Neural Networks,” in

Proc. Inter-speech 2019 , 2019, pp. 1103–1107. [Online]. Available:http://dx.doi.org/10.21437/Interspeech.2019-1518[10] A. T. Liu, P. chun Hsu, and H.-Y. Lee, “Unsupervised End-to-EndLearning of Discrete Linguistic Units for Voice Conversion,”in

Proc. Interspeech 2019 , 2019, pp. 1108–1112. [Online].Available: http://dx.doi.org/10.21437/Interspeech.2019-2048[11] A. Tjandra, B. Sisman, M. Zhang, S. Sakti, H. Li, andS. Nakamura, “VQVAE Unsupervised Unit Discovery andMulti-Scale Code2Spec Inverter for Zerospeech Challenge 2019,”in

Proc. Interspeech 2019 , 2019, pp. 1118–1122. [Online].Available: http://dx.doi.org/10.21437/Interspeech.2019-3232[12] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan,O. Vinyals, A. Graves, N. Kalchbrenner, A. W. Senior, andK. Kavukcuoglu, “WaveNet: A Generative Model for RawAudio,”

CoRR , vol. abs/1609.03499, 2016. [Online]. Available:http://arxiv.org/abs/1609.03499[13] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang,Z. Chen, Y. Zhang, Y. Wang, R.-S. Ryan, R. A. Saurous,Y. Agiomyrgiannakis, and Y. Wu, “Natural TTS Synthesis byConditioning WaveNet on Mel Spectrogram Predictions,” in

In-ternational Conference on Acoustics, Speech and Signal Process-ing (ICASSP) , 2018, pp. 4779–4783.[14] R. Prenger, R. Valle, and B. Catanzaro, “Waveglow: A Flow-based Generative Network for Speech Synthesis,” in

Interna-tional Conference on Acoustics, Speech and Signal Processing(ICASSP) , 2019, pp. 3617–3621.[15] N. Li, S. Liu, Y. Liu, S. Zhao, M. Liu, and M. Zhou,“Close to human quality TTS with transformer,” arXiv preprintarXiv:1809.08895 , 2018.[16] T. Hayashi, R. Yamamoto, K. Inoue, T. Yoshimura, S. Watanabe,T. Toda, K. Takeda, Y. Zhang, and X. Tan, “Espnet-TTS: Uniﬁed,Reproducible, and Integratable Open Source End-to-End Text-to-Speech Toolkit,” in

ICASSP 2020 - 2020 IEEE International Con-ference on Acoustics, Speech and Signal Processing (ICASSP) ,2020, pp. 7654–7658. [17] V. K. Prasad, T. Nagarajan, and H. A. Murthy, “Automatic seg-mentation of continuous speech using minimum phase group de-lay functions,”

Speech Communication , vol. 42, no. 3-4, pp. 429–446, 2004.[18] J. Sebastian, M. Kumar, and H. A. Murthy, “An analysis of thehigh resolution property of group delay function with applicationsto audio signal processing,”

Speech Communication , vol. 81, pp.42–53, 2016.[19] D. P. Kingma and P. Dhariwal, “Glow: Generative ﬂow withinvertible 1x1 convolutions,” in

Advances in Neural InformationProcessing Systems 31 . Curran Associates, Inc., 2018, pp.10 215–10 224. [Online]. Available: http://papers.nips.cc/paper/8224-glow-generative-ﬂow-with-invertible-1x1-convolutions.pdf[20] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudan-pur, “X-Vectors: Robust DNN Embeddings for Speaker Recogni-tion,” in

International Conference on Acoustics, Speech and Sig-nal Processing (ICASSP) , 2018, pp. 5329–5333.[21] D. Snyder, D. Garcia-Romero, D. Povey, and S. Khudanpur,“Deep neural network embeddings for text-independent speakerveriﬁcation,” in

Interspeech 2017, 18th Annual Conference ofthe International Speech Communication Association, Stockholm,Sweden, August 20-24, 2017 . ISCA, 2017, pp. 999–1003.[22] V. Peddinti, D. Povey, and S. Khudanpur, “A time delay neuralnetwork architecture for efﬁcient modeling of long temporal con-texts,” in

Sixteenth Annual Conference of the International SpeechCommunication Association , 2015.[23] S. Sakti, R. Maia, S. Sakai, T. Shimizu, and S. Nakamura, “De-velopment of hmm-based indonesian speech synthesis,” in

Proc.Oriental COCOSDA , vol. 1, 2008.[24] S. Sakti, E. Kelana, H. Riza, S. Sakai, K. Markov, and S. Naka-mura, “Development of indonesian large vocabulary continuousspeech recognition system within a-star project,” in

Proceedingsof the Workshop on Technologies and Corpora for Asia-PaciﬁcSpeech Translation (TCAST) , 2008.[25] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek,N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz,J. Silovsky, G. Stemmer, and K. Vesely, “The kaldi speech recog-nition toolkit,” in

IEEE 2011 Workshop on Automatic SpeechRecognition and Understanding . IEEE Signal Processing So-ciety, Dec. 2011.[26] B. McFee, C. Raffel, D. Liang, D. P. W. Ellis, M. McVicar, E. Bat-tenberg, and O. Nieto, “librosa: Audio and Music Signal Analysisin Python,” in

Proceedings of the 14th python in science confer-ence , 2015, pp. 18–25.[27] I. Recommendation, “Method for the subjective assessment of in-termediate sound quality (mushra),”