[PDF] Exploiting Cross-Lingual Knowledge in Unsupervised Acoustic Modeling for Low-Resource Languages

Abstract

(Short version of Abstract) This thesis describes an investigation on unsupervised acoustic modeling (UAM) for automatic speech recognition (ASR) in the zero-resource scenario, where only untranscribed speech data is assumed to be available. UAM is not only important in addressing the general problem of data scarcity in ASR technology development but also essential to many non-mainstream applications, for examples, language protection, language acquisition and pathological speech assessment. The present study is focused on two research problems. The first problem concerns unsupervised discovery of basic (subword level) speech units in a given language. Under the zero-resource condition, the speech units could be inferred only from the acoustic signals, without requiring or involving any linguistic direction and/or constraints. The second problem is referred to as unsupervised subword modeling. In its essence a frame-level feature representation needs to be learned from untranscribed speech. The learned feature representation is the basis of subword unit discovery. It is desired to be linguistically discriminative and robust to non-linguistic factors. Particularly extensive use of cross-lingual knowledge in subword unit discovery and modeling is a focus of this research.

Full PDF

EExploiting Cross-Lingual Knowledgein Unsupervised Acoustic Modelingfor Low-Resource Languages

FENG, SiyuanA Thesis Submitted in Partial Fulfilmentof the Requirements for the Degree ofDoctor of PhilosophyinElectronic EngineeringThe Chinese University of Hong KongMay 2020hesis Assessment CommitteeProfessor CHING, Pak-Chung (Chair)Professor LEE, Tan (Thesis Supervisor)Professor LIU, Xunying (Committee Member)Professor Thomas HAIN (External Examiner) o my family. bstract

This thesis describes an investigation on unsupervised acoustic modeling (UAM)for automatic speech recognition (ASR) in the zero-resource scenario, where onlyuntranscribed speech data is assumed to be available. UAM is not only importantin addressing the general problem of data scarcity in ASR technology developmentbut also essential to many non-mainstream applications, for examples, languageprotection, language acquisition and pathological speech assessment. The presentstudy is focused on two research problems. The first problem concerns unsuperviseddiscovery of basic (subword level) speech units in a given language. Under the zero-resource condition, the speech units could be inferred only from the acoustic signals,without requiring or involving any linguistic direction and/or constraints. The secondproblem is referred to as unsupervised subword modeling. In its essence a frame-level feature representation needs to be learned from untranscribed speech. Thelearned feature representation is the basis of subword unit discovery. It is desiredto be linguistically discriminative and robust to non-linguistic factors. Particularlyextensive use of cross-lingual knowledge in subword unit discovery and modeling isa focus of this research.For unsupervised subword modeling, a two-stage system framework is adopted.The two stages are known as frame labeling and deep neural network bottleneckfeature (DNN-BNF) modeling. Various approaches to speaker adaptation are ap-plied to produce robust input features for frame labeling and DNN training. Theseapproaches include feature-space maximum likelihood linear regression (fMLLR) as-sisted by out-of-domain (OOD) ASR, disentangled speech representation learningand speaker adversarial training. Experimental results on the Zero Resource SpeechChallenge (ZeroSpeech) 2017 show that the fMLLR approach achieves the most

Ibstract significant performance improvement against the baseline system and further im-provement could be attained by combining multiple adaptation approaches.It is noted that the quality of frame labels has a significant impact on theDNN-BNF framework. The frame labeling approaches proposed in the thesis in-clude Dirichlet process Gaussian mixture model-hidden Markov model (DPGMM-HMM) and OOD ASR decoding. A label filtering algorithm is developed to improvethe quality of DPGMM-HMM frame labels. Multi-task learning with the DPGMM-HMM labels and OOD ASR labels is investigated. Experimental evaluation on theZeroSpeech 2017 tasks demonstrates the advantage of DPGMM-HMM labels overDPGMM clustering labels. The best performance, which is achieved by combiningmultiple types of labels and BNFs, is comparable to the best submitted system toZeroSpeech 2017.Unsupervised unit discovery is tackled with the acoustic segment modeling(ASM) approach. Multiple language-mismatched phone recognizers are used to gen-erate the initial segmentation of speech and the phone posteriorgram features ofspeech segments. A symmetric Kullback ‒ Leibler (KL) divergence based distancemetric is employed to analyze the linguistic relevance of discovered subword units.Experiments on a multilingual speech database show that the proposed methodsachieve comparable performance to a previous study, with a simpler implemen-tation. The KL divergence metric is consistent with the conventional measure ofpurity. The discovered subword units are found to provide a good coverage of thelinguistically-defined phones. A few exceptions, e.g., /er/ in Mandarin, could be ex-plained by the limited modeling capability of the adopted system. The confusion ofa discovered unit between ground-truth phones can be alleviated by increasing thenumber of clusters. II 要本論⽂描述了針對零資源條件的⾮監督性聲學建模的研究。零資源條件假定語⾳數據不帶有任何標註。⾮監督性聲學建模不僅有助於解決語⾳識別技術發展中的數據量不⾜的問題，對於語⾳技術的很多⾮主流的應⽤也⾄為關鍵，例如語⾔保護，語⾔習得，病理語⾳診斷等。本項研究主要針對兩個問題。第⼀個問題是在制定⽬標語⾔中的基本語⾳單元的⾃動發現。在零資源條件下，由於缺乏最基本的語⾔學信息，基本語⾳單元的推斷只能依賴聲學信號本⾝。第⼆個問題是如何從不帶標註的語⾳數據學習⼀種幀層次的語⾳特征。該種特征會作為語⾳單元發現的基礎。理想的語⾳特征應當具有語⾔學區分度和對於⾮語⾔學因素的穩定性。本研究的重點之⼀為如何充分利⽤跨語⾔學知識幫助語⾳單元的發現和建模。對於⾮監督性語⾳單元建模，本⽂建議採取的結構包含兩個步驟，即幀標註和深度神經網絡瓶頸特征（神經特征）建模。為了給幀標註和神經特征建模提供穩定的輸⼊特征，本⽂探討了多種說話⼈適應的⽅法，包括：利⽤⾮⽬標語⾔的⾃動語⾳識別系統獲得特征空間最⼤似然線性回歸特征，語⾳分離表達學習，和說話⼈對抗訓練。本⽂在零資源條件語⾳競賽的數據集上進⾏了實驗驗證。實驗結果表明，線性回歸特征獲得了最顯著的性能提升，⽽透過融合多種說話⼈適應的⽅法，系統性能可得到進⼀步提升。幀標註的質量對於神經特征學習的成效有顯著的影響。本⽂提出了兩種幀標註的⽅法，包括：狄利克雷過程⾼斯混合模型 - 隱⾺爾科夫模型（狄⾺模型），和⾮⽬標語⾔的⾃動語⾳識別系統解碼。本⽂建議了⼀個過濾算法以提升狄⾺標註的質量，並將上述兩種標註融合在多任務學習結構中。實驗結果證實狄⾺標註相⽐狄利克雷過程⾼斯混合模型標註具有優勢。本⽂中最優系統的性能與參加零資源條件語⾳競賽的最佳系統相當。對於⾮監督性語⾳單元發現，本⽂採⽤聲學⽚段建模的⽅法，利⽤多個⾮⽬標語⾔的⾳素識別系統對⽬標語⾔進⾏⽚段分割和後驗概率特征提取。本⽂提出並利 III 要⽤⼀種基於 KL 散度的距離度量，對⾃動發現的語⾳單元進⾏語⾔學關聯性分析。通過在多語⾔語⾳數據集上的實驗驗證，我們發現所提出的⽚段分割⽅法和過去的研究⽅法性能相近，⽽⽅法的實現更簡單。基於 KL 散度的距離標準和傳統的純度標準具有⼀致性。⾃動發現的語⾳單元相對語⾔學定義的語⾳單元整體有很好的涵蓋。對於⼀些未能發現的語⾳單元，例如普通話的元⾳ /er/ ，我們認為這反映了⽬前系統建模的能⼒還有所不⾜。部分⾃動發現的語⾳單元在語⾔學⾓度造成混淆。⽽在⾮監督聚類中增加類別數可以有效降低混淆性。 IV cknowledgement I would like to express my sincere gratitude to my supervisor, Prof. Tan Lee, forhis patient guidance, thoughtful ideas and continued support and trust during myPh.D. research. I would thank Prof. P. C. Ching and Prof. Wing-Kin Ma for theiradvice on my thesis research during group meetings and progress presentations.I wish to thank Haipeng for introducing me into the area of speech processingand keeping giving me advice over the past four years, even though we literally hadno overlap in time at the DSP Lab. I also would like to thank David for giving anexcellent tutorial to the HTK toolkit and on the fundamentals of automatic speechrecognition during my undergraduate internship. Thanks to Raymond for sharingexperience in conducting scientific research in the academia and industry.I would like to thank numerous DSP Lab colleagues. It was a wonderful timeworking with them. Special thanks to Michael for sharing his expertise in Linuxoperation and server machine maintanence. I am proud to be the one in charge ofthe server machine administration after Michael graduated. Thanks to Yuanyuanand Carol for their research advice as senior colleagues when I joined as a freshstudent. Thanks to Ying and Xurong for keeping sharing thoughts and giving en-couragement during the study period. I also express my thanks to Sammi, Herman,Shuiyang, Matthew, Zhiyuan, Yuzhong, Jiarui, Ryan, Mingjie, Dehua, Guangyan,Yatao, Gary, Wilson, etc. I sincerely appreciate Arthur, our technician, for takingcare of equipment, software, schedule.I am very grateful to be a member of DSP coffee and basketball interest groups.It was a pleasant time enjoying coffee with Herman, Sammi, Michael, Gary, etc.They taught me a lot in how to make excellent coffee. I will not forget the basketballgames with Prof. Lee, Herman, Shuiyang, Dehua, Mingjie, Yatao etc.

Vcknowledgement

I express my sincerest gratitude to my family, for their love and support. VI ontents Abstract I 摘要 IIIAcknowledgement V1 Introduction 1

VIIontents

VIIIontents

Bibliography 104A Published work 117 IX ist of Figures ∆ MUBNF’ denotes differ-ence of ABX error rate between MUBNF0 and MUBNF. ‘EN, FR,MA’ are abbreviations of English, French and Mandarin. . . . . . . . 484.8 ABX error rates on z with different segment lengths and raw MFCCs.The performance is computed by averaging over all languages. . . . . 50 Xist of Figures ˆ x using s-vector uni-fication and latent segment variable z . Each bar corresponds to arepresentative speaker. The performance is computed by averagingover all languages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.10 An illustration on applying speaker-invariant features extracted froman FHVAE in DPGMM and MTL-DNN. . . . . . . . . . . . . . . . . 524.11 T-SNE visualization of frame-level feature representations from twospeakers in ZeroSpeech 2017 Mandarin subset. . . . . . . . . . . . . 565.1 Example of cluster-size sorting. . . . . . . . . . . . . . . . . . . . . . 645.2 MTL-DNN for extracting LI-BNF, MUBNF and OSBNF. “OOD”stands for out-of-domain. . . . . . . . . . . . . . . . . . . . . . . . . 675.3 Complete system framework of multi-label assisted unsupervised sub-word modeling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 685.4 Clustering results in the form of cumulative distribution function forthe three target languages. Clusters are sorted according to clustersize in descending order. . . . . . . . . . . . . . . . . . . . . . . . . . 705.5 ABX error rates ( % ) on HMM(P)-MUBNF representation w.r.t per-centage of labels to be retained. The performance is computed byaveraging over all languages. . . . . . . . . . . . . . . . . . . . . . . . 775.6 Average ABX error rates ( % ) of HMM(P)-MUBNF representationimplemented by MLP, LSTM and BLSTM over different utterancelengths. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 786.1 Exploiting language-mismatched phone recognizers in unsupervisedunit discovery. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 826.2 Example of segmentation of an English utterance using Czech, Hun-garian and Russian phone recognizers. . . . . . . . . . . . . . . . . . 836.3 Illustration on the computation process of D ( u r , g k ) . . . . . . . . . . 896.4 ∆ D ∗ ( u r ) and purity values computed from the discovered subwordunits. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 XI ist of Tables % ) on the MFCC features, super-vised phone posteriorgram and multilingual BNFs of ZeroSpeech 2017. 323.3 Within-speaker ABX error rates ( % ) on the MFCC features, super-vised phone posteriorgram and multilingual BNFs of ZeroSpeech 2017. 324.1 Summary of properties of the three speaker adaptation approaches.‘OOD’ stands for out-of-domain. . . . . . . . . . . . . . . . . . . . . 364.2 Across-speaker ABX error rates ( % ) on raw MFCCs, fMLLRs by OODASRs and multilingual BNFs. . . . . . . . . . . . . . . . . . . . . . 474.3 Within-speaker ABX error rates ( % ) on raw MFCCs, fMLLRs byOOD ASRs and multilingual BNFs. . . . . . . . . . . . . . . . . . . 474.4 Across-speaker ABX error rates ( % ) on multilingual BNFs trainedwith/without FHVAE-based speaker-invariant features. . . . . . . . 534.5 Within-speaker ABX error rates ( % ) on multilingual BNFs trainedwith/without FHVAE-based speaker-invariant features. . . . . . . . 534.6 Across-speaker ABX error rates ( % ) on multilingual BNFs extractedfrom speaker AMTL-DNN with different adversarial weights λ . . . . 574.7 Within-speaker ABX error rates ( % ) on multilingual BNFs extractedfrom speaker AMTL-DNN with different adversarial weights λ . . . . 584.8 Across-speaker ABX error rates ( % ) on systems combining differentspeaker adaptation approaches. . . . . . . . . . . . . . . . . . . . . . 594.9 Within-speaker ABX error rates ( % ) on systems combining differentspeaker adaptation approaches. . . . . . . . . . . . . . . . . . . . . . 60 XIIist of Tables % ) on BNFs learned by our proposedmultiple frame labeling approaches and state of the art of ZeroSpeech2017. MLP is adopted as the DNN structure. Label filtering is notapplied. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 735.3 Within-speaker ABX error rates ( % ) on BNFs learned by our proposedmultiple frame labeling approaches and state of the art of ZeroSpeech2017. MLP is adopted as the DNN structure. Label filtering is notapplied. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 745.4 Across-speaker ABX error rate ( % ) comparison of DNN structures. . 785.5 Within-speaker ABX error rate ( % ) comparison of DNN structures. 786.1 Multilingual speech data from the OGI-MTS corpus. . . . . . . . . . 916.2 Purity values obtained by exploiting a single recognizer. . . . . . . . 926.3 Purity values obtained by jointly exploiting CZ, HU and RU recog-nizers w.r.t cluster number R . Colored values in the last two columnsdenote better (red) or worse (blue) performance compared to the pro-posed approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 936.4 Purity values obtained by jointly exploiting CZ, HU, RU and CArecognizers w.r.t cluster number R . This system is used for analysisof linguistic relevance. . . . . . . . . . . . . . . . . . . . . . . . . . . 946.5 D ∗ ( u r ) /D ∗∗ ( u r ) and e D ( g k ) computed from the discovered subwordunits. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 956.6 Uncovered/total number of vowels and consonants ( R = 90 ). . . . . . 966.7 Mandarin phones that are not covered by automatically discoveredsubword units ( R = 90 ). . . . . . . . . . . . . . . . . . . . . . . . . . 976.8 Discovered subword units mapped to /uw/ with R = 50 and . . . . 97 XIII hapter 1

Introduction

Speech is an important and the most natural means of human communication.Speech technology has been one of the core areas in artificial intelligence (AI), par-alleling computer vision and natural language processing. The major components ofspeech technology are automatic speech recognition (ASR), text-to-speech (TTS),speaker identification (SID) and speaker verification (SV), language identification(LID), voice conversion (VC), etc.ASR refers to a computational process of identifying the linguistic content ofhuman speech from an acoustic signal. It could be described as a non-linear mappingprocess from continuous speech to discrete linguistic items such as words. ASR isclosely related to other topics of speech technology research, e.g. VC, LID and spokenterm detection. A typical ASR system is made up of an acoustic model (AM), alanguage model (LM), a pronunciation lexicon and a decoder. The AM calculatesthe probability of an observed a speech utterance conditioned on a sequence of pre-defined speech units (phonemes). The LM and lexicon jointly produce the probabilityof a sequence of words, each comprising a string of phonemes, for the given language.The decoder is implemented of a search algorithm to determine the best-matchingword sequence for the input utterance based on the constraints and probabilitiesgiven jointly by the AM, the LM and the lexicon.AM is the core component in an ASR system. It connects speech and phonemes.

1. Introduction

Traditionally, hidden Markov models with state observation probability being mod-eled by mixture of Gaussian are adopted to represent individual phonemes or othersubword units in speech [2]. This type of model is commonly referred to as GMM-HMM. In recent years, the deep neural network (DNN) is shown to be more powerfulthan the GMM in modeling acoustic variation in speech, especially when abundanttraining are data available [3]. More recently, there is an increasing trend in modelingspeech directly at word level [4]. This approach does not require the use of pronun-ciation lexicon and explicit phoneme-level modeling. The AM and LM probabilities are estimated jointly with a single network structure known as the connectionisttemporal classification (CTC) [5]. It is also referred to as one type of end-to-endapproaches [6].AM training is usually considered to be a process of supervised learning. Thetraining involves a large number of speech utterances and their transcriptions. Thetranscription tells what is being spoken in the respective utterance, and hencefacilitates an ordered alignment between speech and the model. Training a high-performance DNN-HMM AM for a large-vocabulary application requires hundredsto thousands of hours of transcribed speech. In the Big Data era, collecting a goodamount of speech data for a popular language like English and Mandarin is not diffi-cult. However, acquiring or preparing transcriptions for training is far less straightfor-ward and may involve unaffordable human effort. There exists an excessive amountof untranscribed speech for popular languages, and they could be exploited to furtherimprove the AM.There are about , spoken languages in the world [7]. For most of them,the amount of transcribed speech data is very limited, or even non-existent. Manyof these languages, e.g. ethnic minority languages in China and languages in Africa,may have never been formally studied. Many languages have very limited amountof transcribed speech data, or have no transcribed data. Knowledge about theirlinguistic properties is usually incomplete, or unavailable. Transcribing speech ofthese languages is difficult and impractical. Without transcription, conventional su-pervised acoustic modeling could not be applied straightforwardly. Therefore theinvestigation of unsupervised acoustic modeling methods has become a focal point Here we borrow terminologies from GMM-HMM/DNN-HMM to denote the probabilities. This frameworkdoes not have individual acoustic or language model anymore.

2. Introduction of research with significant application impact.Low-resource speech modeling has gained increasing research interest in recentyears. The term low-resource is used to refer to a range of scenarios when limitedlinguistic knowledge and limited amount of transcribed speech are available. In therelated research literature, the low-resource assumption generally refers to the fol-lowing two cases,1. Limited transcribed data and large amount of untranscribed data;2. Untranscribed data only.In the first scenario, a straightforward idea is to train an initial AM with thetranscribed data, followed by refining inferred transcriptions and re-training AMand/or LM in an iterative manner [8]. The inferred transcriptions are obtained bydecoding with the updated models, while model retraining is carried out with the newtranscriptions. A key research problem is how to effectively utilize the untranscribeddata. Previous studies investigated semi-supervised learning methods to improve theseed AM with untranscribed data [9, 10].The second scenario, which presents a highly restrictive data condition, is themajor focus of this thesis. It is commonly known as the zero-resource condition, asvirtually no knowledge about the language concerned is assumed to be available [11].Languages to be modeled satisfying the zero-resource assumption are named as zero-resource languages.Building AMs the zero-resource language is much more difficult than in the firstscenario. Since there is no prior knowledge on the phoneme inventory, i.e. its sizeand constituents , the first step is to construct a hypothesized set of fundamentalspeech units in an unsupervised manner. It is expected that these hypothesized unitsare closely related to those linguistically-defined phonemes of the target language.Once the speech units are defined and modeled, they can be used to tokenize inputspeech for the target language, and generate phoneme-like pseudo transcriptions fordownstream tasks e.g. zero-resource ASR, spoken term discovery.Speech modeling in the zero-resource scenario has significant impact in thebroad area of speech and language research. It is widely acknowledged that infantsstart to learn the first language primarily from speech interaction with their parents. Some closely related studies [12, 13] assume phoneme inventory size of a zero-resource language is known.However, this thesis assumes both size and components are unknown.

3. Introduction

This learning process is considered to be minimally supervised or unsupervised innature [14]. Research in zero-resource speech modeling could potentially help under-stand infant language acquisition mechanism. Zero-resource speech modeling couldpotentially be meaningful to the protection of endangered languages. With automat-ically discovered speech units, it would become possible to systematically documentraw speech data of an endangered language. With properly organized speech data,linguists could analyze and derive the linguistic structure of the language in anobjective and evidence-based way.The application of low-resource speech modeling is not limited to those unpop-ular languages that genuinely lack general linguistic knowledge. One possible exten-sion is toward the modeling of pathological speech, e.g. caused by voice, articulationand cognitive disorders. Pathological speech could be regarded as being low-resourcefor several reasons. First, collection of pathological speech data is more difficultthan that of normal speech. As a matter of fact, in the research area of automaticassessment of pathological speech, shortage of training data has long been a majorconcern. Second, people with pathological symptoms tend to produce speech soundsthat deviate greatly from the “norm”. Some of the sounds may not even carry linguis-tic meanings. Third, there exist numerous types of diseases that may cause speechand language impairments. These diseases may also be co-existing. It is difficult, ifnot impossible, to develop feasible annotation or transcription schemes to properlydeal with the specificities of impairments, and perform standardized transcriptionsfor acoustic modeling training. From this perspective, low-resource speech modelingapproaches can be applied to tackle problems in pathological speech processing.

The research presented in this thesis is focused on acoustic modeling for ASRin the zero-resource scenario, i.e. only untranscribed speech data are available, whilelinguistic knowledge about the target language is completely absent.Unsupervised discovery of basic speech units is one of the key problems in thisinvestigation. It is the first step in acoustic modeling when the modeling units areunknown or uncertain. We aim at phone-level acoustic modeling similar to conven-

4. Introduction tional GMM-HMM and DNN-HMM models in ASR. Since the granularity and thesize of phoneme inventory for a zero-resource language are unknown, supervisedacoustic modeling algorithms cannot be directly applied to zero-resource languages.Separating linguistic information from linguistically-irrelevant information solelybased on raw untranscribed speech is the major difficulty in robust unsupervised unitdiscovery. In the ideal case, automatically discovered speech units could constitutethe ground-truth phonemes of the target language, and are in good consistency withthese phonemes. In practice, a wide variety of linguistically-irrelevant variations co-exist in the speech signal, e.g. speaker, emotion, channel. They are encoded withlinguistic content in the acoustic signal, and are not easily separable. In supervisedacoustic modeling, manually annotated transcription can be relied on to support ro-bust discovery of acoustic units towards these irrelevant variations. In the concernedzero-resource scenario, speech units can only be inferred from acoustic signals. Thismay significantly affect the accuracy of unit discovery. For instance, speech soundscarrying the same phoneme but produced by different speakers might be mistak-enly modeled as different speech units, due to the effect of speaker difference [15].To alleviate this problem, a possible research direction would be to learn a featurerepresentation that could support subword or word identification both across- andwithin-speakers.Another focus of the thesis is on unsupervised learning of frame-level speech fea-tures that could discriminate fundamental speech units and are robust to linguistically-irrelevant variations. A good feature representation that differentiates linguistic in-formation from non-linguistic information is essential and crucial in unit discovery.It is expected that the learned feature representation is superior to conventionalspectral feature representations such as MFCCs or PLPs for unsupervised discoveryof fundamental speech units.Increasing the amount of data is known of being beneficial to acoustic modeling.For a resource-rich language, training data is by nature from the target language. Fora zero-resource language, while in-domain data is scarce, there exist abundant re-sources for out-of-domain languages. This thesis investigates a variety of approachesto exploiting out-of-domain mismatched language resources in improving unsuper-vised unit discovery and feature representation learning. While each language has

5. Introduction its distinctive linguistic properties, the speech sounds that can be produced in differ-ent languages may have significant overlap, because the basic mechanism of speechproduction is largely language-independent. Open-source speech corpora contain-ing hundreds of speakers and hundreds to thousands of hours of transcribed speechare available for English and Mandarin, etc. The use of cross-lingual speaker andlanguage resources to facilitate unsupervised acoustic modeling for zero-resourcelanguages is extensively studied.In a strict sense, out-of-domain resources are assumed unavailable in zero-resource scenarios. One may argue that unsupervised acoustic modeling for ASRshould assume no access to any out-of-domain speech and language resources. Fromthis perspective, ‘unsupervised learning’ is different from what is being consideredin this thesis. Nevertheless, in practice, certain types of resources from major lan-guages are often available. The

Unsupervised acoustic modeling problem tackled inthis thesis can be considered as an extension to the strict-sense unsupervised task.

Unsupervised subword modeling refers to the problem of learning frame-levelfeature representation that is discriminative to fundamental speech units, and robustto linguistically-irrelevant variations, under the assumption that only untranscribedspeech data is available for training.The problem was formally defined in the Zero Resource Speech Challenge (Ze-roSpeech) 2015, a world-wide challenge encouraging low-resource speech modeling.The follow-up challenge, ZeroSpeech 2017 [14], continued to focus on this problem.This problem encourages direct performance comparison of different subword mod-els and makes the performance unaffected by the quality of back-end decoders. It isnoted that previous studies on unsupervised subword modeling were evaluated in dif-ferent means, such as transcriptions, lattices, posteriorgrams, thus direct comparisonwas unavailable.It was shown that speaker variation is the major difficulty in robust unsupervised

6. Introduction subword modeling. This thesis considers speaker change as the main component oflinguistic-irrelevant variation.

Unsupervised unit discovery is the problem of automatically discovering basicspeech units of a language, assuming only untranscribed speech are available. Thegoal is to build AMs to cover the entire phoneme inventory of the target language.The outcome of a unit discovery system is the tokenization of untranscribed speechdata with time alignments, using the discovered units as tokens. This problem wasextensively studied. It was known in various terms, such as unsupervised acousticmodeling [16], self-organized unit (SOU) modeling [17, 18], acoustic unit discovery(AUD) [19, 20], acoustic model discovery [21] or unsupervised lexicon discovery [22,23]. The results of unsupervised speech unit discovery are symbolic indices that donot convey explicit linguistic functions or meanings, e.g. vowels, fricatives etc. Theseunits may represent either speech or non-speech patterns such as phonemes, noisesand pauses. Performance measurements used in supervised acoustic modeling, suchas word error rates (WERs) and phoneme error rates (PERs), are seldom used inunsupervised unit discovery, since there is no knowledge on the definition of wordor phoneme. In practice, unsupervised unit discovery is measured in terms of therelevance of tokenization to golden time-aligned phoneme transcription [24]. Eval-uation metrics such purity, F-score and normalized mutual information (NMI) arecommonly used [25].

The two problems mentioned above are closely related. Unsupervised subwordmodeling can be considered as a front-end optimization process for robust unsuper-vised unit discovery. Unsupervised unit discovery is the goal of performing unsuper-vised subword modeling. Besides, Unsupervised unit discovery could also be reliedon to improve unsupervised subword modeling.Specifically, a good feature representation that captures subword discriminative

7. Introduction information and is robust to speaker variation has been shown beneficial to unitdiscovery [26]. On the other hand, a set of discovered units having good consistencywith true phonemes of a language could provide phoneme-like speech transcriptionsto assist subword discriminative feature learning [27].

The two problems tackled in this thesis are closely related to other speech tasks.For instance, unsupervised subword modeling can be applied to query-by-examplespoken term detection (QbE-STD) for low-resource languages. QbE-STD is aimedat detecting audio documents from an archive which contain a specified spokenquery [28]. Typically a QbE-STD system involves two steps, constructing acousticfeature representation for both query and audio documents, followed by computingthe likelihood of the query occurring somewhere in the audio archive [29]. Featurerepresentation plays an important role in QbyE-STD performance. Under the low-resource scenario, linguistic knowledge about the audio archive is assumed unknown.Phonetically-discriminative features learned by unsupervised subword modeling areexpected to provide a desired representation for QbyE-STD.Unsupervised unit discovery is considered as a building block for developing acomplete unsupervised ASR system. The basic acoustic units learned by unit dis-covery serve as the basis for lexical modeling, where repetitive sequences of acousticunits are modeled by a hypothesized word-like unit. After generating work-like pat-terns and lexicon, an ASR system for the target language can be built.

Chapter 2 provides a review on the problems of unsupervised subword modelingand unsupervised unit discovery. For unsupervised subword modeling, considerationson model structures, input feature types and frame labeling approaches are discussed.Three modeling frameworks for unsupervised unit discovery are also reviewed.Chapter 3 describes the DNN-bottleneck feature (BNF) framework for unsuper-vised subword modeling. It defines the baseline system in this study. The objectives,database and evaluation metric of ZeroSpeech 2017 Challenge are introduced. The

8. Introduction database and the evaluation metric are used throughout the thesis.Chapter 4 is focused on applying speaker adaption to unsupervised subwordmodeling. Experiments on ZeroSpeech 2017 are described, with comparison to thebaseline system in Chapter 3. Combination of these approaches is also studied.Chapter 5 describes the proposed approaches to improving clustering-basedframe labeling in unsupervised subword modeling. Two different types of frame labelsare generated. Experimental results on ZeroSpeech 2017 are given.Chapter 6 deals with the problem of unsupervised unit discovery with the acous-tic segment modeling (ASM) framework, and presents proposed evaluation metricto analyze linguistic relevance of discovered subword units. Experiments on OGImultilingual telephone speech corpus are presented.Chapter 7 concludes this thesis, summarizes the main contributions and suggestsdirections for future work. hapter 2 Related works

This chapter provides a detailed literature review on the two research problemsconcerned, namely, unsupervised subword modeling and unsupervised unit discovery.

Unsupervised subword modeling is formulated as a problem of feature represen-tation learning. The key issue is how to retain linguistic-relevant information andsuppress non-linguistic variation of speech signals by learning from a large amount ofdata. This problem has gained increasing research interest in recent years. Relevantworks were mostly conducted on the ZeroSpeech 2015 and 2017 challenge datasetsand evaluation metrics, which facilitate informative performance comparison.Deep neural networks (DNNs) are widely investigated in unsupervised subwordmodeling. Typically a DNN model is trained using given speech data. The learnedfeatures are obtained either from a designated low-dimension hidden layer of theDNN, known as the bottleneck features (BNFs) [30], or from the softmax outputlayer, known as the posterior features or posteriorgram [31]. Various DNN structureshave been investigated, as discussed in Section 2.1.1.Feature learning can be realized with other machine learning techniques. Thisapproach demonstrated competitive performance to DNN-based methods [27, 32].Clustering of frame-level features is the key step, aiming at generating a number offrame clusters. Each cluster desirably corresponds to a discovered subword unit. By

10. Related works representing each cluster with a probability distribution, such as Gaussian distribu-tion, a cluster posteriorgram can be constructed to be the learned representation [27].Previous studies suggested that there are two key issues contributing to the per-formance of unsupervised subword modeling. They are the discriminability of inputfeatures, and the fitness of labels. The issues are crucial to DNN- and non-DNN-based models. Investigations on these two issues in previous studies are reviewed inSections 2.1.2 and 2.1.3, respectively.

The DNN model structures proposed for feature learning are divided into threecategories based on the training strategy, namely, supervised, unsupervised andweakly/pair-wise supervised.Supervised models, e.g. multi-layer perceptron (MLP), were widely used in un-supervised subword modeling [1,30,31,33]. Training of these models requires labels ofspeech frames or segments, which are typically phone identities. In the resource-richscenario, transcriptions can be used to generate frame labels. Whilst the acquisitionof frame labels becomes a challenging problem in the zero-resource scenario. It ishighly desired to derive some kinds of initial frame labels that have good correspon-dence to ground-truth phone alignments. With these initial labels, supervised modeltraining can be applied and the trained model can be used to generate updatedlabels of input speech. BNFs or posteriorgram extracted from the trained modelscould be used as subword-discriminative representation. There were also attemptsto eliminating the need for frame label acquisition [1, 34]. Typically, a DNN AMwas trained with transcribed speech from an out-of-domain language, and used togenerate BNFs or posteriorgrams.Unsupervised neural network models, e.g. auto-encoder (AE) [31, 35], denoisingAE (dAE) [36] and variational AE (VAE) [37], do not require any kind of targetlabels for training. These models are trained to learn a compact intermediate-layerrepresentation, which is capable of reconstructing the input representation (or de-noised input representation as in dAE). The encoded representation is known as embedding and is able to retain linguistic information that are needed to reconstructinput speech. The embedding is also expected to be free of linguistically-irrelevant

11. Related works variation, and hence intrinsically suitable for subword modeling. In [36], AE anddAE were compared on the unsupervised subword modeling task. In [31], the AEwas compared with supervised models. In [37], the vector-quantized VAE (VQ-VAE)was used to disentangle linguistic content and speaker characteristics and achievedbetter performance than AE. Although training of unsupervised models relies onless stringent data requirement, as it does not need transcription, the performanceof unsupervised models is in general not as good as supervised ones. This is probablydue to the fact that in the absence of supervision, it is relatively harder to separatesubword-discriminative information from non-linguistic information.Pair-wise supervised models, such as correspondence AE (cAE) [36] and siamesenetwork [38], are useful because, by leveraging the knowledge about speech segmentpairs which correspond to the same linguistic unit (word or subword) [39–41], suchpair-wise relational information provides top-down constraints to guide linguisticunit discrimination [40]. Compared with self-supervision in AEs, pair-wise super-vision provides additional information on linguistically-irrelevant variations . Thisknowledge is beneficial to robust subword modeling. In [36], the cAE was trainedwith pairs of segments representing the same linguistic units. In [38], the siamesenetwork was trained with pairs of input features. The objective was to determinewhether the two segments correspond to the same linguistic unit or not. In prac-tice, pair-wise information may not be directly available for low-resource languages.Unsupervised term discovery (UTD) [42] was suggested as a feasible approach toobtaining such information.

Input feature representation plays a critical role in unsupervised subword mod-eling. Early studies used conventional spectral features like MFCCs [27], filter banks(FBanks) [35] and PLPs [43]. These features have been widely used in ASR acousticmodeling. Nevertheless, it is widely understood that spectral features are not opti-mized for unsupervised subword modeling [14], as they contain abundant linguistically-irrelevant variations caused by, e.g. speaker and noise, which may negatively impactthe modeling [30, 44].It was found that linear transforms estimated from spectral features are useful to

12. Related works improve frame clustering for subword modeling to a great extent. These transformsreduce linguistically-irrelevant variations encoded in speech, meanwhile retaininglinguistic information. In [44, 45], linear discriminant analysis (LDA) and maximumlikelihood linear transform (MLLT) demonstrated noticeable improvement. LDA andMLLT are widely used in supervised acoustic modeling. LDA minimizes intra-classdiscriminability and maximize inter-class discriminability of the speech features,while MLLT de-correlates feature components [44]. In [44, 46], feature-space maxi-mum likelihood linear regression (fMLLR) based speaker adaptive training (SAT)were found to achieve further improvement over LDA+MLLT transforms. The es-timation of LDA, MLLT and fMLLR require transcribed speech. In [44–46], super-vised data were obtained by a two-pass frame clustering. In [30], speaker-normalizedMFCC with vocal tract length normalization (VTLN) [47] was found better thanraw MFCCs.Input feature representation also plays a critical role in DNN-based unsuper-vised subword modeling. In [48], deep scattering features were shown to outperformFBanks as input features to siamese network training. Deep scattering features arestable and can be efficiently exploited by classifiers, while retaining richer informa-tion than FBanks [48]. In [1, 34], fMLLRs estimated by an out-of-domain language-mismatched ASR system were shown to be better than MFCCs as input features.The observations in [1, 34, 48] demonstrated the importance of input feature selec-tion in DNN acoustic modeling. DNN acoustic models are preferred for its high-levelrepresentation learning capability, i.e., retaining subword-discriminative informationin input features and suppressing non-linguistic information to a great extent. Asa result, little attention was paid to the selection of DNN input features for unsu-pervised acoustic modeling, except for the aforementioned studies. In this thesis,approaches to learning input features to improve DNN acoustic modeling as well asframe clustering are extensively studied.

Frame labeling is an important step in unsupervised subword modeling. It aimsto create subword-like tokenization of untranscribed speech, with which featurelearning models can be trained. Frame labels are mainly used for supervised DNN

13. Related works acoustic model training [30], as well as for posteriorgram generation [32, 49]. Framelabeling approaches are divided into two categories, namely, frame clustering andout-of-domain ASR decoding.Frame clustering is a process of grouping together speech frames that havesimilar feature representations. Clustering-based frame labeling approaches assumethat speech frames belonging to the same ground-truth subword unit are closer inthe frame-level feature space than those belonging to different units. Under thisassumption, cluster indices assigned to speech frames are expected to constitutesubword-level time alignment of untranscribed speech. Various clustering algorithmswere investigated in the literature. In [27], Dirichlet process Gaussian mixture model(DPGMM) [50] was applied to frame clustering, and demonstrated superior perfor-mance in unsupervised subword modeling. The system described in [27] achievedthe best performance in ZeroSpeech 2015 [11]. DPGMM does not require a pre-defined number of clusters, which makes it suitable for frame clustering, as thenumber of subword units in a low-resource language is usually unknown. The suc-cess of DPGMM motivated subsequent studies on unsupervised subword model-ing [30,33,44,45]. Other clustering algorithms were also studied, e.g. GMM-universalbackground model (GMM-UBM) [49] and its variant, hidden Markov model-UBM(HMM-UBM) [49], where HMM training was applied after obtaining GMM cluster-ing results. Note that the aforementioned DPGMM algorithm differs from GMM-UBM only in the prior distribution. Besides, k -means was also studied in the con-cerned task [51]. However, their results show that DPGMM is by far the most effec-tive algorithm for speech frame clustering.Exploiting out-of-domain resources for frame labeling is considered as a transferlearning approach. It can be realized in different manners. In [1], the ASR systemfor an out-of-domain resource-rich language was utilized to decode target speech inthe language-mismatched manner, and generate frame labels from decoding lattices.In this way, target untranscribed speech are tokenized by the phoneme inventoryof the out-of-domain language. While each language has its distinctive linguisticproperties, the speech sounds that can be produced in different languages may havesignificant overlap, because the basic mechanism of speech production is largelylanguage-independent [52].

14. Related works

Unsupervised unit discovery aims at finding acoustically homogeneous basicspeech units, which are desirably equivalent to subword units or phonemes, fromuntranscribed speech data. An unsupervised unit discovery system typically involvesthree sub-problems, namely, speech segmentation, segment labeling, and subwordmodeling. Despite the variety of system frameworks reported in the literature, theformulation of these sub-problems is widely shared in common [21, 53, 54].There are three types of models applied to unsupervised unit discovery, i.e.,acoustic segment models (ASMs), nonparametric inference models and top-downconstraint models. These models have all been investigated extensively.

The acoustic segment model (ASM) was first proposed by Lee et al. [53]. Thiswork has shown far-reaching impact on subsequent studies in unsupervised unit dis-covery. In the ASM approach, initial segmentation, segment labeling and iterativeacoustic modeling are performed in a sequential manner. During initial segmenta-tion, each speech utterance is divided into a sequence of variable-length segments.The estimated segment boundaries are desirably in synchrony with ground-truthphoneme boundaries. Subsequently, speech segments with similar acoustic charac-teristics are grouped and labeled by the same index or symbol. In this way, thesegment labels constitute a form of tokenization of input speech. The tokenizationfor a speech utterance is comprised of a sequence of symbols with time boundaries. Itcould be regarded as a phone-level time-aligned transcription of input speech. Withthis phone-level transcription, conventional techniques of acoustic model trainingcan be applied to obtain hypothesized subword models. These subword models canin turn be used to decode the training speech into updated transcriptions. In short,the subword model training and the generation of transcription are carried out iter-atively. In [18], a similar approach to ASM was proposed. The discovered subwordunits were referred to as self-organized units (SOUs).There have been follow-up studies aiming to improve individual stages of theASM framework. Most of them were focuses on better initial segmentation and seg-

15. Related works ment labeling. The task of initial segmentation is to obtain subword boundarieswithout using any prior knowledge about speech content of the input utterance.Lee et al. [53] applied a dynamic programming-based maximum-likelihood estima-tion approach to speech segmentation which was first proposed in [55]. Qiao etal. [56] developed a bottom-up hierarchical algorithm, which was later applied inother studies [25]. Pereiro Estevan et al. [57] proposed a maximum-margin cluster-ing algorithm. Scharenborg et al. [58] extended [57] by proposing a two-step methodthat is able to use a mix of bottom-up information from speech signals and top-downinformation. Torbati et al. [59] suggested to use the Bayesian HMM with DP prior.In [60], segment boundaries were detected by locating the peaks on a spectral transi-tion measure. Vetter et al. [61] and our previous study [52] exploited out-of-domaincross-lingual ASR systems to obtain hypothesized phoneme boundaries. These twostudies were presented almost at the same time. Michel et al. [62] presented a blindphoneme segmentation method that designates peaks in the error curve of a modeltrained to predict speech frame by frame as potential boundaries.Segment labeling is a clustering problem by nature. Speech segments derivedfrom initial segmentation are first represented by fix-dimension feature vectors. Thesevectors are used as inputs to perform clustering. The resulting cluster indices areregarded as segment labels. In [53], the fixed-dimension representation of variable-length speech segments was obtained by performing vector quantization (VQ). Siuet al. [18] investigated a segmental GMM approach to speech segment clustering.Segmental GMMs use a polynomial function to approximate the trajectory of speechfeatures within a segment. Wang et al. [63] applied GMM to cluster and label speechsegments. In this approach, each segment is labeled with the index of the Gaussiancomponent which provides the highest likelihood of the speech segment. This ap-proach was further extended to the use of more sophisticated clustering algorithms,namely and direct segment clustering [64]. In [16], variants of GCC were investigatedand discussed. Different objective functions, constraint formulations and similaritymeasures were compared. Wang et al. [25] investigated the use of spectral clusteringand its combination with GCC in segment labeling, and reported improved perfor-mance compared to using GCC only.At the stage of iterative acoustic modeling, GMM-HMM [18, 53] and DNN-

16. Related works

HMM [65] were adopted. With these models, Viterbi decoding can be performedto update the segment labels and time alignment. The iterative process generallyresults in better tokenization and subword models [25].

In addition to the ASM based approaches, nonparametric inference has beeninvestigated extensively for unsupervised unit discovery [19–21, 66–68]. In [21], aBayesian model was applied to perform speech segmentation, segment clustering andsubword modeling, as a joint optimization process. These three tasks are arrangedin a sequence in ASM. The nonparametric Bayesian model infers subword units,segment frames and cluster segments in an integrated manner. The Gibbs sampling(GS) method was used for parameter inference.The Bayesian model [21] was refined in a broad range of investigations. Forinstance, in the work by Ondel et al. [20], variational Bayes (VB) was used to replaceGS, so as to enable inference on a large-scale speech dataset. It was shown thatVB is better than GS in terms of both speed and accuracy. A follow-up work [66]done by Ondel et al. extended [20] by applying a nonparametric Bayesian languagemodel to better utilize contextual information in speech. Another follow-up work [67]investigated the use of informative prior in the VB model [20]. The informative prioris obtained from out-of-domain resource-rich language resource. Ebbers et al. [68]extended the works in [18, 20, 21] by replacing GMM with VAE in order to leveragethe capability of VAE in modeling sophisticated emission distribution. The proposedmodel was named HMM-VAE, which was applied in a series of subsequent works[19, 67]. In [19], the HMM-VAE model is embedded in a Bayesian framework witha DP prior for the distribution of the acoustic units. In this way, the number ofsubword units to be discovered does not need to be pre-defined, as DP automaticallydetermines the optimal number. In a most recent work by Ondel et al. [54], BayesianSubspace GMM (SGMM) model was proposed to restrict the unit discovery system inmodeling phonetic content in speech, and ignore linguistic-irrelevant variations suchas speaker and noise. In order to estimate phonetic space, the models were trainedwith transcribed speech of resource rich languages. The underlying assumption ismade that the phonetic subspace of the target low-resource language and a resource-

17. Related works rich language could be close, given that these two languages have phones in common.

The ASM and the nonparametric inference models described above are relatedin that they both aim at discovering subword units, and start modeling at framelevel. They could be regarded as bottom-up approaches [15], which clearly have draw-backs. In the review article [15], it was concluded that bottom-up methods tend toover-cluster the realization variants of the same phonetic identity. Such fine clustersmay reflect the variation caused by speaker difference, change of speaking style andenvironment, etc. As a result, it is hard for bottom-up modeling approaches to dis-criminate whether two speech realizations belong to the same phone or not. On theother hand, unsupervised discovery of word-size units is considered less ambiguousthan subword-like ones [69]. The similarity between realizations of a word spokenby different speakers is much more prominent than that of a phoneme [15]. It isexpected that word-level information, if available, could be exploited as top-downsupervision or constraints to subword unit discovery [40,69]. This reflects the intrin-sic structure of speech, which embraces and encodes multi-scale information, rangingfrom low-level subwords (phonemes) to higher-level syllables, words etc.There were a few attempts to incorporating word-level information as top-downconstraints to assist unsupervised unit discovery [69–72]. Under the zero-resourceassumption, word entity information is unavailable. One of the possible approaches isto assume that same-different word pair information is available or at least obtainable[69, 73]. Jansen et al. [69] tackled subword unit discovery by first generating same-different word pair information via a spoken term discovery system [42], followed byapplying the word-pair supervision to guide the partition of a pre-estimated UBMinto different GMMs. Each of the GMMs corresponded to a learned subword unit.In the UBM partitioning process, word-pair information was used to construct asimilarity matrix of UBM components, with which spectral clustering was adoptedto cluster UBM components. The results in [69] demonstrated the usefulness ofword-pair information as top-down constraints in subword unit discovery.Another approach to exploiting top-down constraints in unsupervised unit dis-covery is to simultaneously model subword- and word-like units. Chung et al. [71]

18. Related works presented a two-level acoustic pattern discovery model which jointly discover sub-word and word units. The system comprises three stages that correspond to the opti-mization of acoustic model, language model and pronunciation lexicon respectively.Word-like acoustic patterns were discovered by identifying subword-like patterns ap-pearing frequently together. Top-down constraints were constructed by learning alexicon of word-like acoustic patterns so that each word unit consisted of a sequenceof subword units. In this model, top-down constraints and subword unit discoverywere iteratively updated. Their experimental results revealed that word-like unitconstraints were useful in subword unit discovery. The approaches in [71] was laterapplied in Chung et al. [72] as speech tokenizers to tackle both unsupervised unitdiscovery and feature representation learning. hapter 3 Unsupervised subwordmodeling: a DNN-BNFframework

This chapter introduces a DNN-BNF modeling framework to tackle unsuper-vised subword modeling, and defines the task of frame-level feature representationlearning. The DNN-BNF framework serves as the baseline in this thesis. Our pro-posed approaches to robust unsupervised subword modeling as discussed in Chapters4 and 5 are all based on this framework.This chapter also describes the database and evaluation metric for ZeroSpeech2017 Track 1: unsupervised subword modeling [14], which are used in system per-formance evaluation throughout this thesis.This Chapter is organized as follows. Section 3.1 provides an overview of theDNN-BNF modeling framework. Frame labeling and supervised DNN-BNF modelingare presented in Sections 3.2 and 3.3. In Section 3.4, an overview of the ZeroSpeech2017 database is given. Section 3.5 introduces the evaluation metric of ZeroSpeech2017. Experiments and discussion on the baseline system is shown in Section 3.6.

20. Unsupervised subword modeling: a DNN-BNF framework

MFCCs Frame labeling DNN-BNF modeling EvaluationLabels

Target untranscribed speech data

Multilingual BNFs

Figure 3.1:

DNN-BNF framework for unsupervised subword modeling.

The DNN-BNF modeling framework consists of two stages, namely, frame la-beling and supervised DNN-BNF modeling. Frame labels obtained in the first stageprovide supervision for DNN training in the second stage. The trained DNN is usedto extract BNFs as the subword-discriminative representation of target zero-resourcespeech. The general framework is illustrated in Figure 3.1. The DNN-BNF model-ing framework was investigated in many previous studies [27, 30], and achieved verygood performance in ZeroSpeech 2015 [11] and 2017 [14].

Frame labeling is an essential step to prepare the target untranscribed speechfor supervised DNN-based subword modeling. The generated frame labels that havegood correspondence to ground-truth phoneme time-alignments are highly desired.A possible way to perform frame labeling is by clustering. DPGMM clustering [50]has been widely applied for frame labeling [27, 30, 45]. It is adopted in our system.

DPGMM is an a non-parametric Bayesian extension to GMM, where a Dirichletprocess prior replaces the vanilla GMM. DPGMM does not require a pre-definedcluster number. This makes DPGMM intrinsically suitable for unsupervised subwordmodeling, as the number of subword units contained in a zero-resource language isusually unknown.

21. Unsupervised subword modeling: a DNN-BNF framework k  i z i x k θ  θ N   Figure 3.2:

Graphical illustration of DPGMM.

The graphical illustration of DPGMM is shown in Figure 3.2. Given a set of ob-servations { x , x , . . . , x N } , DPGMM assumes that they are generated by a randomprocess,1. Mixture weights π = { π k } ∞ k =1 are generated according to a stick-breakingprocess with concentration parameter α [74].2. GMM parameters θ = { θ k } ∞ k =1 are generated according to Normal-inverse-Wishart (NIW) distribution with parameters θ [75].3. A discrete label z i is sampled from all the Gaussian components { k } ∞ k =1 ac-cording to the distribution of π .4. x i is drawn from the z i -th Gaussian component θ z i .Here, θ k = { µ k , Σ k } consists of mean and covariance parameters of the k -th Gaus-sian component. θ = { m , S , κ , ν } consists of parameters of NIW where m and S are prior means for µ k and Σ k respectively, κ and ν are belief-strenths in m and S respectively. Inference of DPGMM has been investigated by a number of studies [76–80]. Themethods include Markov chain Monte Carlo (MCMC) based sampling [76, 77] andvariational inference [78–80], etc. In this thesis, a parallelizable split and merge sam-pler [50] is adopted for two main reasons. First, this sampler explicitly represents themixture weights π for GMM posteriorgram computation [27], which is needed in ourframe labeling process. Second, this sampler is parallelizable, making it scalable toa large amount of speech frames (e.g. a -hour speech database typically comprises

22. Unsupervised subword modeling: a DNN-BNF framework . × frames) [27, 32].The parallelizable split and merge sampler conducts inference by alternatingbetween restricted DPGMM Gibbs sampling and split/merge sampling . Restricted DPGMM Gibbs sampling

Restricted DPGMM Gibbs sampling assumes a discrete label z is sampled fromthe finite number ( K ) of Gaussian components. Let us denote the set of discretelabels as Z = { z , z , . . . , z N } . The posterior sampling of π is denoted as, π = ( π , π , . . . , π K , π ′ K +1 ) ∼ Dir( N , N , . . . , N K , α ) , (3.1) π ′ K +1 = 1 − K ∑ k =1 π k , (3.2) N k = N ∑ i =1 δ ( z i = k ) , (3.3)where α could be understood as the relative probability of assigning an observationwith a new Gaussian component label, as opposed to an existing label. The samplingof { θ , θ , . . . , θ K } is denoted as, θ k = { µ k , Σ k } ∝ ∼ NIW( m k , S k , κ k , ν k ) , k ∈ { , , . . . , K } , (3.4)where x ∝ ∼ y represents sampling x from distribution proportional to y . The param-eters of NIW are computed as, κ k = κ + N k , (3.5) ν k = ν + N k , (3.6) m k = κ m + N k ¯ x k κ k , (3.7) S k = S + N ∑ i =1 δ ( z i = k )( x i x i ⊺ + κ m m ⊺ − κ k m k m k ⊺ ) , (3.8)where ¯ x k is computed as, ¯ x k = ∑ Ni =1 δ ( z i = k ) x i ∑ Ni =1 δ ( z i = k ) , (3.9)

23. Unsupervised subword modeling: a DNN-BNF framework and could be interpreted as the mean of observation data that are assigned by label k . Note that Equations (3.5) to (3.8) were applied in [81] to perform maximum aposteiori (MAP) estimation towards GMM parameters, with the application of modeladaptation in GMM-HMM based ASR systems. The sampling of z i is denoted as, z i ∝ ∼ K ∑ k =1 π k N ( x i | µ k , Σ k ) ( z i = k ) , (3.10)where ( z i = k ) is a K -dimension one-hot vector whose z i -th dimension is . Split/merge sampling

Split/merge sampling consists of two steps, namely, splitting Gaussian compo-nent into sub-components, and Metropolis-Hastings split/merge.In the first step, each Gaussian component is split into sub-components withmixture weights ˜ π k = { ˜ π kl , ˜ π kr } and parameters ˜ θ k = { ˜ θ kl , ˜ θ kr } . The observationdata x i is assigned with a sub-component label ˜ z i ∈ { l, r } . The following steps areused to sample sub-components: { ˜ π kl , ˜ π kr } ∼ Dir( N kl + α , N kr + α , (3.11) ˜ θ kl ∝ ∼ N ( x kl | ˜ θ kl )NIW( ˜ θ kl | θ ) , (3.12) ˜ θ kr ∝ ∼ N ( x kr | ˜ θ kr )NIW( ˜ θ kr | θ ) , (3.13) ˜ z i ∝ ∼ N ∑ i =1 δ (˜ z i = s )˜ π z i ,s N ( x i | ˜ θ z i ,s ) , s ∈ { l, r } , (3.14)where N kl and N kr are defined as, N kl = N ∑ i =1 δ ( z i = k ) δ (ˆ z i = l ) , (3.15) N kr = N ∑ i =1 δ ( z i = k ) δ (ˆ z i = r ) , (3.16)and could be interpreted as the number of observation data belonging to z i andassigned with sub-component label l and r respectively.In the second step, split or merge moves in a Metropolis-Hastings (MH) fashionis proposed [27]. In the following description, the notation ˆ A denotes the proposal for

24. Unsupervised subword modeling: a DNN-BNF framework the variable A . Let Q ∈ { Q split c , Q merge m,n } denote proposal move selected randomlyfrom split or merge . Q split c denotes splitting Gaussian component c into m and n . Q merge m,n denotes merging components m and n into c . Conditioned on Q split c ,variables are sampled as, ( ˆ Z m , ˆ Z n ) = split c ( Z , ˜ Z ) , (3.17) (ˆ π m , ˆ π n ) = π c π sub , π sub = ( π m , π n ) ∼ Dir( ˆ N m , ˆ N n ) , (3.18) ( ˆ θ m , ˆ θ n ) ∼ q ( ˆ θ m , ˆ θ n |X , ˆ Z , ˆ˜ Z ) , (3.19) ( ˆ˜ v m , ˆ˜ v n ) ∼ p ( ˆ˜ v m , ˆ˜ v n |X , ˆ Z ) . (3.20)Conditioned on Q merge m,n , variables are sampled as, ˆ Z c = merge m,n ( Z ) , (3.21) ˆ π c = ˆ π m + ˆ π n , (3.22) ˆ θ c ∼ q ( ˆ θ c |X , ˆ Z , ˆ˜ Z ) , (3.23) ˆ˜ v c ∼ p ( ˆ˜ v c |X , ˆ Z ) . (3.24)The function split c ( · ) splits the labels of Gaussian component c according to theassignment of sub-components, merge m,n ( · ) merges labels of components m and n .With the Hasting ratio H computed as suggested in [50], the proposed split/mergemoves are accepted with probability min { , H } in a Metropolis-Hastings MCMCframework. DPGMM is applied to perform frame labeling towards untranscribed speechdata. Let us consider M zero-resource languages. For the i -th language, frame-levelMFCC features are denoted as { o i , o i , . . . , o iT } , where T is the total number offrames. By applying DPGMM clustering towards T frames, K Gaussian components θ together with their mixture weights π are obtained to represent K clusters offrame-level features. The frame-level labels { l i , l i , . . . , l iT } are obtained by l it = arg max ≤ k ≤ K P ( k | o it ) , (3.25)

25. Unsupervised subword modeling: a DNN-BNF framework where P ( k | o it ) denotes the posterior probability of o it with respect to the k -th Gaus-sian component, which is computed as, P ( k | o it ) = π k N ( o it | µ k , Σ k ) ∑ Kj =1 π j N ( o it | µ j , Σ j ) . (3.26)These frame labels are referred to as DPGMM labels in this thesis.

BNF representation refers to output representation of a designated low-dimensionDNN hidden layer. This low-dimension layer is usually named as the BN layer [82].BNF has been shown able to provide a compact and phonetically-discriminativerepresentation of speech, and suppress linguistically-irrelevant variation e.g. speakerchange [82]. It has been widely applied in conventional acoustic modeling tasks[83–85]. Recently, BNFs were also studied in the zero-resource scenario [30, 31, 33].In our study, BNF representation is adopted as the frame-level feature representationfor unsupervised subword modeling.BNF representation is learned in the second stage of our DNN-BNF modelingframework. In this stage, a supervised DNN model is trained with speech featuresto predict their corresponding DPGMM labels. The training procedure is similar tothat of a conventional DNN acoustic model [3]. The only difference is that DPGMMlabels, instead of HMM forced-alignments, are used as the supervision for DNNtraining. During the extraction of BNFs, speech features are fed into the DNN inputtill the BN layer. The BNF representation is generated as subword-discriminativefeature representation for target zero-resource languages.In principle, the choice of hidden layer structures is flexible, such as feed forward,time-delay, convolutional, LSTM, etc. In practice, the feed forward structure is mostcommonly used in the concerned task [30, 31], probably because MLP models areeasier to be trained than CNN and LSTM with limited amount of data. In thisthesis, MLP is adopted as the DNN architecture in the baseline system.

26. Unsupervised subword modeling: a DNN-BNF framework … ...DPGMM labels for Language 1 DPGMM labels for Language M Language1MFCCs Language M MFCCs … ... Multilingual BNF for subword modeling Figure 3.3:

MTL-DNN model used to extract multilingual BNF.

If there are multiple zero-resource languages’ untranscribed speech to be mod-eled, the DNN model can be trained in a multilingual manner [86]. Past workshave shown that multilingual BNFs outperform monolingual ones in representingspeech [85]. This can be partially explained that multilingual DNN-BNF modelingleverages a wider range of phonetic diversity, and that the amount of data for DNNtraining is increased.Multilingual BNFs can be generated through multi-task learning DNN (MTL-DNN) modeling. MTL [87] is a commonly adopted strategy to learn multilingualacoustic models [86] and BNFs [30]. The structure of MTL-DNN adopted in thebaseline system is illustrated in Figure 3.3. There are in total M tasks in the MTL-DNN, each corresponding to a target zero-resource language. The hidden layers,including the BN layer, are shared across all tasks, while the output layers aretask (language)-specific. The input features for the M languages are merged beforetraining the MTL-DNN. The loss function of MTL-DNN is weighted cross-entropy,which is defined as, L = M ∑ i =1 ω i L i ( o i → T , l i → T , θ ) , (3.27)where ω i and L i are the weight and the cross-entropy loss of the i -th task, θ denotesparameters of the MTL-DNN model, o i → T and l i → T are input speech features and

27. Unsupervised subword modeling: a DNN-BNF framework corresponding labels related to the i -th task. L i of an arbitrary T -frame speechsegment { o i , o i , . . . , o iT } is computed as, L i = − T ∑ t =1 K ∑ k =1 δ ( l it = k ) log P DNN ( k | o it ) , (3.28)where δ ( · ) is the indicator function, P DNN ( k | o it ) is the k -th element of the MTL-DNN softmax output corresponding to the i -th task. The probability distribution P DNN ( ·|· ) is parameterized by the MTL-DNN.During training, given a pair of input feature o it and its DPGMM label l it , sharedhidden layers and the output layer corresponding to the i -th language are updated,while the other output layers are unchanged. After training, layers of the MTL-DNNinside the dashed box in Figure 3.3 are discarded, and the remaining structure ofthe MTL-DNN is used as the feature extractor to generate multilingual BNFs. The database in ZeroSpeech 2017 Track 1 [14] consists of development data and surprise data . The development data are provided with open-source evaluation soft-ware and ground-truth phoneme alignment information, which enables immediateperformance evaluation. The surprise data are not provided with such software andinformation publicly. In this thesis, experiments are conducted on development data.The development data of ZeroSpeech 2017 covers three target languages, namelyEnglish, French and Mandarin. Although the ultimate goal of our research is to de-velop systems that perform well on real-world zero-resource languages, it is never-theless reasonable to adopt resource-rich languages for experimental purposes, pro-vided that during training phase transcriptions and linguistic knowledge for theselanguages are assumed unavailable.For each of the three languages in ZeroSpeech 2017 development data, thereare separate training set and test set of untranscribed speech. Speaker identity in-formation is provided for the train sets but not available for the test sets. The testdata are organized into subsets of different utterance lengths: second, secondand second. Detailed information about the dataset are given as in Table 3.1.

28. Unsupervised subword modeling: a DNN-BNF framework

Table 3.1:

Development data of ZeroSpeech 2017 Track 1.

Training TestDuration No. speakers-L No. speakers-R DurationEnglish hrs

60 9 27 hrsFrench hrs

18 10 18 hrsMandarin . hrs hrsNote that the amounts of training data for the three target languages cover a largediversity. This is aimed to measure the degree to which the proposed systems aresensitive to the amount of training data. The evaluation metric adopted for ZeroSpeech 2017 Track 1 task is the ABX sub-word discriminability. Inspired by the match-to-sample task in human psychophysics,it is a simple method to measure the discriminability between two categories ofspeech units [11]. The basic ABX task is to decide whether X belongs to x or y , if A belongs to x and B belongs to y , where A , B and X are three data samples, x and y are the two pattern categories concerned. The performance evaluation in Ze-roSpeech 2017 is carried out on the triphone minimal-pair task. A triphone minimalpair comprises two triphone sequences, which have different center phones and iden-tical context phones, for examples, “beg”-“bag”, “api”-“ati”. Discriminating triphoneminimal pairs is a non-trivial task. The performance of a feature representation onthe triphone minimal-pair ABX task is considered a good indicator of its efficacy inspeech modeling [48].Let x and y denote a pair of triphone categories. Consider three speech segments A , B and X , where A and X belong to category x and Y belongs to y . The ABXdiscriminability of x from y is measured in terms of the ABX error rate ϵ ( x, y ) , whichis defined as the probability that the distance of A from X is greater than that of speakers-L/-R denotes speakers with rich/limited amount of speech data

29. Unsupervised subword modeling: a DNN-BNF framework B from X , i.e., ϵ ( x, y ) = 1 | S ( x ) | ( | S ( x ) | − | S ( y ) | ∑ A ∈ S ( x ) ∑ B ∈ S ( y ) ∑ X ∈ S ( x ) \{ A } ( d ( A,X ) >d ( B,X ) + 12 d ( A,X )= d ( B,X ) ) , (3.29)where S ( x ) and S ( y ) denote the sets of features that represent triphone categories x and y , respectively. d ( · , · ) denotes the dissimilarity between two speech segments,which is computed by dynamic time warping (DTW) in our study. The frame-leveldissimilarity measure used for DTW scoring is the cosine distance. The function d ( A,X ) >d ( B,X ) has the value if d ( A, X ) > d ( B, X ) is satisfied, otherwise its valueis . Note that ϵ ( x, y ) is asymmetric to x and y . A symmetric form can be definedby taking average of ϵ ( x, y ) and ϵ ( y, x ) . The overall ABX error rate is obtained byaveraging over all triphone categories and speakers in the test set. A high ABX errorrate means that the feature representation is not discriminative, and vice versa.Intuitively, the error rate should be no larger than , as by random decision, theexpectation of ABX error rate is .There are two evaluation conditions defined in ZeroSpeech 2017, namely within-speaker and across-speaker . In both conditions, the segments A and B to be evaluatedare generated by the same speaker. In the within-speaker condition, segment X isgenerated by the same speaker as A and B ; In the across-speaker condition, X isgenerated by a speaker different from A and B . Frame labeling by DPGMM clustering is implemented by an open-source tooldeveloped by Chang et al. [50]. Speech frames for the three target languages arefirst processed by extracting -dimension MFCC features with cepstral mean nor-malization (CMN) and augmented with ∆ and ∆∆ to form a -dimension featurerepresentation. The window length and step size for computing MFCCs are msand ms respectively. Frame-level MFCC features are clustered by the DPGMM al-

30. Unsupervised subword modeling: a DNN-BNF framework gorithm for each of three languages. The numbers of clustering iterations are , and for English, French and Mandarin, respectively. The resulted numbers ofDPGMM clusters are , and . Each speech frame is assigned a DPGMMlabel.The MTL-DNN model is trained with MFCCs+CMN spliced by ± contextualframes. There are training tasks in MTL, each corresponding to a target language.The shared-hidden-layer (SHL) neural network structure is {1024, 1024, 1024, 1024,40, 1024}. The dimensions of the language-specific output layers are , and , as determined by the outcome of DPGMM clustering. All the layers are feed-forward. The activation function for SHLs is Sigmoid, except the -dimension linearBN layer. The weighted cross-entropy criterion is chosen as the objective function.Stochastic gradient descent (SGD) is used to optimize the objective function. Thethree tasks are assigned the equal weights in training. A subset of training datais randomly chosen for cross-validation. The learning rate is . at the beginningof the training phase, and is halved when no improvement is observed on the cross-validation data. The mini-batch size is . After training, MTL-DNN is used toextract multilingual BNFs for test sets of target languages. The BNFs are furtherevaluated by the ABX subword discriminability task. The training of MTL-DNNand extraction of BNFs are implemented by Kaldi [88]. The experimental results of ABX subword discriminability on multilingual BNFsare shown as in Tables 3.2 (across-speaker) and 3.3 (within-speaker). In addition, theABX performance on raw MFCC features, supervised phone posteriorgram (providedby the Challenge organizers [14]), and multilingual BNFs proposed in [30], are alsolisted in the Tables as reference.It can be observed from Tables 3.2 and 3.3 that the multilingual BNF represen-tation obtained in our baseline system achieves significant improvements in bothacross- and within-speaker ABX discriminability tasks, as compared to the rawMFCC feature representation. The relative ABX error rate reduction is . inacross-speaker condition, and . in within-speaker condition. It is also observedthat multilingual BNFs perform consistently better on s and s test lengths than

31. Unsupervised subword modeling: a DNN-BNF framework

Table 3.2:

Across-speaker ABX error rates ( % ) on the MFCC features, supervised phoneposteriorgram and multilingual BNFs of ZeroSpeech 2017. English French Mandarin Avg.1s 10s 120s 1s 10s 120s 1s 10s 120sMFCC [14] . . . . . . . . . . Phone posteriorgram (supervised topline) [14] 8.6 6.9 6.7 10.6 9.1 8.9 12.0 5.7 5.1 8.2Multilingual BNF [30] 13.7 12.1 12.0 17.6 15.6 14.8 12.3 10.8 10.7 13.3Multilingual BNF 13.5 12.4 12.4 17.8 16.4 16.1 12.6 11.9 12.0 13.9

Table 3.3:

Within-speaker ABX error rates ( % ) on the MFCC features, supervisedphone posteriorgram and multilingual BNFs of ZeroSpeech 2017. English French Mandarin Avg.1s 10s 120s 1s 10s 120s 1s 10s 120sMFCC [14] . . . . . . . . . . Phone posteriorgram (supervised topline) [14] 6.5 5.3 5.1 8.0 6.8 6.8 9.5 4.2 4.0 6.2Multilingual BNF [30] 8.5 7.3 7.2 11.1 9.5 9.4 10.5 8.5 8.4 8.9Multilingual BNF 8.0 7.3 7.3 10.3 9.4 9.3 10.1 8.8 8.9 8.8 on s. On the contrary, MFCC features perform the same over different lengths. Thiscan be explained by that MTL-DNN input features are MFCCs with CMN, whileCMN is known to be ineffective for short utterances.The multilingual BNF representation performs worse than phone posteriorgramover all the test utterance lengths and languages. This is reasonable, as the toplinerepresentation assumes transcribed training speech available during system devel-opment. The proposed multilingual BNFs are slightly better than that in [30] underthe within-speaker condition, and slightly worse under the across-speaker condition.The system framework proposed in [30] is similar to our baseline. However, sincethe implementation details are publicly available, we are not able to reproduce theresults. The baseline system with multilingual BNF representation has shown significantadvancement over conventional spectral features, without requiring any in-domainor out-of-domain supervision information, e.g. transcription and phoneme inventory.Nevertheless, the present system design has a few shortcomings. For instance, MFCCfeature as input features to DPGMM and MTL-DNN is not optimal, as it is known to

32. Unsupervised subword modeling: a DNN-BNF framework depend significantly on non-linguistic factors such as speaker, emotion, channels. Toaddress this issue, various approaches can be adopted, including speaker adaptation.Indeed MFCC features are affected significantly by speaker variation. This can beseen from the performance gap between across- and within-speaker ABX error ratesin Tables 3.2 and 3.3. Speaker adaptation is an important and challenging problemparticularly in acoustic modeling in the zero-resource scenario. While in the super-vised scenario, reliable transcription can be used to ensure robustness of the learnedsubword units towards speaker change. In the unsupervised scenario, subword unitscan only be inferred from speech features.On the other hand, DPGMM frame labels for DNN-BNF modeling in the base-line system could be improved in several aspects. DPGMM clustering assumes thatneighboring speech frames are independent of each other. This is not in accordancewith the nature of speech. To address this limitation, modeling of temporal depen-dency could be incorporated into the process of frame label acquisition. Improve-ment on frame labels could also be achieved by exploiting out-of-domain language-mismatched ASR systems. hapter 4 Speaker adaptation forunsupervised subwordmodeling

Speaker adaptation is important in unsupervised subword modeling. Speechdata usually contains a diverse range of speaker variation. Removing speaker varia-tion is beneficial to subword modeling. This chapter presents various speaker adapta-tion approaches that are applied to learning speaker-invariant feature representationsfor unsupervised subword modeling. These approaches include,(a) fMLLR estimation by an out-of-domain ASR;(b) disentangled speech representation learning, separating linguistic content andspeaker characteristics into disjoint parts of speech representation;(c) speaker adversarial training.They can be illustrated by the system flow diagrams as shown in Figures 4.1a, 4.1band 4.1c. These approaches can be directly applied to the baseline system mentionedin Chapter 3.Generally speaking, all these approaches can be considered as learning to trans-form speech features from an original unadapted space to a speaker-adapted space. Interms of the type of transform, the approaches can be classified into two categories,namely, linear transform and nonlinear transform. FMLLR is a linear transform-

34. Speaker adaptation for unsupervised subword modeling

MFCCs Frame labeling DNN-BNF modeling EvaluationLabels

Target untranscribed speech data

Multilingual BNFs

Out-of-domain ASR fMLLRs

MFCCs Frame labeling DNN-BNF modeling EvaluationLabels

Target untranscribed speech data

Multilingual BNFs

FHVAE , z x MFCCs Frame labeling

Speaker adversarial DNN-BNF modeling

EvaluationLabels Multilingual BNFs

Speaker infomation (a)

FMLLR estimation by out-of-domain ASR

MFCCs Frame labeling DNN-BNF modeling EvaluationLabels

Target untranscribed speech data

Multilingual BNFs

Out-of-domain ASR fMLLRs

MFCCs Frame labeling DNN-BNF modeling EvaluationLabels

Target untranscribed speech data

Multilingual BNFs

Disentangled representation learning

MFCCs Frame labeling

Speaker adversarial DNN-BNF modeling

EvaluationLabels Multilingual BNFs

Speaker infomation

Speaker-invariant features (b)

Disentangled speech representation learning

MFCCs Frame labeling DNN-BNF modeling EvaluationLabels

Target untranscribed speech data

Multilingual BNFs

Out-of-domain ASR fMLLRs

MFCCs Frame labeling DNN-BNF modeling EvaluationLabels

Target untranscribed speech data

Multilingual BNFs

FHVAE , z x MFCCs Frame labeling

Speaker adversarial DNN-BNF modeling

EvaluationLabels Multilingual BNFs

Speaker infomation (c)

Speaker adversarial training

Figure 4.1:

Three speaker adaptation approaches applied to improve subword modeling.In the above three sub-figures, shallow components are the same as in the baselineframework discussed in Chapter 3. based approach, while disentangled representation learning and speaker adversarialtraining are both based on DNNs, which realizes nonlinear transforms of speechfeatures. As shown in Figure 4.1, fMLLR and disentangled representation learningoperate at front-end level, i.e., on MFCC or other types of unadapted features, whilespeaker adversarial training is applied to the back-end DNN-BNF modeling. Fromthe perspective of resource utilization, fMLLR estimation exploits additional out-of-domain resources, i.e., transcribed speech data from resource-rich languages. On thecontrary, disentangled representation learning and speaker adversarial training donot rely on any out-of-domain resources. A summary of properties of the approachesis given as in Table 4.1.Section 4.1 describes fMLLR-based speaker adaptation based on an out-of-domain ASR system. Section 4.2 introduces disentangled speech representation learn-

35. Speaker adaptation for unsupervised subword modeling

Table 4.1:

Summary of properties of the three speaker adaptation approaches. ‘OOD’stands for out-of-domain.

Approach ID Transform type Front/back-end Require OOD resource(a) Linear Front (b) Nonlinear Front (c) Nonlinear Back ing. Section 4.3 discusses speaker adversarial training. In Section 4.4, we investigatethe combined use of speaker adaptation approaches. Section 4.5 describes the exper-iments to validate and evaluate the effectiveness of different approaches, as well astheir combination. Section 4.6 summarizes the experimental results and comparisonof all the three approaches. We propose to apply fMLLR-based speaker adaptation method in the concernedtask. This is motivated by the study achieving the best performance in ZeroSpeech2017 [32]. Estimation of fMLLR features requires transcription. In [32], clustering-based pseudo transcription is generated beforehand to enable fMLLR estimation. Inour thesis, an out-of-domain language-mismatched ASR system is exploited to esti-mate fMLLR features for in-domain low-resource speech. For major languages suchas English and Chinese, large-scale speech corpora that include hundreds of speak-ers are available for training high-performance ASR systems [89, 90]. The richness ofspeaker diversity in these out-of-domain speech corpora could be leveraged to learnrobust features representation of low-resource speech data.FMLLR [91, 92] is a feature-based speaker adaptation method. It is effectivein improving speaker invariance of speech features. FMLLR has been successfullyapplied in conventional acoustic modeling for large vocabulary ASR systems [93–95].The general idea of fMLLR is to estimate speaker-specific linear transforms to re-duce inter-speaker variability. By applying the transforms to MFCCs, the speaker-dependent features are mapped to a speaker-independent feature space. The trans-

36. Speaker adaptation for unsupervised subword modeling

MFCCs

Out-of-domain resource-rich language

SI-GMM-HMMTranscriptions AlignmentsVTLN, LDA, MLLT, fMLLR estimationSpeaker information LMSA-GMM-HMM Out-of-domain ASRMFCCs

In-domain low-resource language

Speaker informationfMLLRs

Offline training Online estimation

Figure 4.2:

Process flow of utilizing an out-of-domain language-mismatched ASR toestimate in-domain fMLLR features. formed feature representation is referred to as fMLLR.Figure 4.2 depicts the process flow of utilizing an out-of-domain language-mismatched ASR to estimate in-domain fMLLR features. Given the out-of-domaintraining speech and transcriptions, speaker-independent GMM-HMM (SI-GMM-HMM) AMs are estimated with MFCCs as input features. The AMs are used toforced-align the out-of-domain training data to provide the supervision required forapplying vocal tract length normalization (VTLN) [47], linear discriminant analy-sis (LDA) [96], maximum likelihood linear transforms (MLLT) [97] and fMLLR ina sequential manner. Subsequently, the fMLLR features are used to train speaker-adapted GMM-HMM AMs (SA-GMM-HMM). An LM is trained using transcrip-tions. The SA-GMM-HMM AMs and LM are used to build the out-of-domain ASRsystem. This ASR system is used to decode in-domain utterances and finds the bestpath for each utterance. Each path comprises a sequence of phone labels with timeboundary information. The best paths are regarded as alignments and used in theestimation of fMLLR features for in-domain data.

37. Speaker adaptation for unsupervised subword modeling

Speaker-invariant features could be generated via disentangled representationlearning and transformation. The main idea is to separate linguistically irrelevantfactors, e.g. speaker, emotion etc. from linguistic information, e.g., phoneme, whichare simultaneously encoded in the speech signal. Subsequently, a frame-level repre-sentation is constructed to mainly embed the linguistic part.Separating linguistic content and speaker characteristics in speech is a non-trivial task in the zero-resource scenario [15]. In this study, we adopt the followingassumption as proposed in [98]:• Speaker characteristics tend to have a smaller amount of variation than lin-guistic content within a speech utterance, and linguistic content tends to havesimilar amounts of variation within and across utterances.Intuitively and simplistically, this could be understood as the contrast that speakeridentity affects the fundamental frequency (F0) at utterance level, and linguisticcontent affects spectral characteristics at segment level. Based on this assumption,the factorized hierarchical VAE (FHVAE) [99] is adopted for disentangling speakerand linguistic information in speech.The FHVAE model was first proposed by Hsu et al. [99]. It is an unsupervisedgenerative model extended from the VAE [100]. FHVAE learns to factorize sequence-level and segment-level attributes of sequential data (e.g. speech) into different la-tent variables, as compared to the VAE in which a single latent representation islearned. In the present study, the sequence- and segment-level attributes refer tospeaker characteristics and linguistic content respectively. By discarding or unifyingsequence-level latent representation, speaker-invariant features can be learned withFHVAE. This is a straightforward approach to learning speaker-invariant features.Its effectiveness has been proved in domain adaptation tasks, such as noise robustASR [98], distant conversational ASR [101], and dialect identification [102]. Thismotivates us to apply this approach to learning speaker-invariant features in theunsupervised scenario.

38. Speaker adaptation for unsupervised subword modeling

In FHVAE, the generation process of sequential data is formulated by impos-ing sequence-dependent priors and sequence-independent priors to different sets ofvariables. The overall structure of FHVAE is illustrated as in Figure 4.3a. Followingnotations and terminologies in [99], let z and z denote the latent segment variableand the latent sequence variable, respectively, µ be the sequence-dependent prior,known as s-vector . θ and ϕ denote the parameters of generative and inference mod-els of FHVAEs. Let D = { X i } Mi =1 denote a speech dataset containing M sequences.This sequence X i contains N i speech segments { x ( i,n ) } N i n =1 . Each segment x ( i,n ) contains a fixed number of frames. The FHVAE model generates X from a randomprocess as follows :(1) µ is drawn from a prior distribution p θ ( µ ) defined as p θ ( µ ) = N ( , σ µ I ); (4.1)(2) z n and z n are drawn p θ ( z n ) = N ( , σ z I ) , (4.2) p θ ( z n | µ ) = N ( µ , σ z I ); (4.3)(3) The segment x n is drawn from p θ ( x n | z n , z n ) = N ( f µ x ( z n , z n ) , diag ( f σ x ( z n , z n )) . (4.4)Here N denotes standard normal distribution, f µ x ( · , · ) and f σ x ( · , · ) are pa-rameterized by DNN models.A graphical illustration of FHVAE generative model θ is given as in Figure 4.3b.The joint probability for the generation of sequence X is formulated as, p θ ( µ ) N ∏ n =1 p θ ( z n ) p θ ( z n | µ ) p θ ( x n | z n , z n ) . (4.5) Without causing any confusion, sequence index i in X i is omitted for simplicity.

39. Speaker adaptation for unsupervised subword modeling

Inference model  Generative model  Original

MFCC x Reconstructed

MFCC  x z z (i,n) x (i,n)1 z (i,n)2 z (i)2 μ   ( ) i N M Within a sequenceWithin a segment

Sequence i ...Segment ID1 2 (i,n) x (i,n)1 z (i,n)2 z (i)2 μ ( ) i N M Within a sequenceWithin a segment

Encoder  Decoder  Original

MFCC x Reconstructed

MFCC  x z z (a) Overall structure

Inference model  Generative model  Original MFCC x Reconstructed MFCC x z z (i,n) x (i,n)1 z (i,n)2 z (i)2 μ   ( ) i N M Within a sequenceWithin a segment

Sequence i ... Segment ID (i,n) x (i,n)1 z (i,n)2 z (i)2 μ ( ) i N M Within a sequence

Within a segment

Encoder  Decoder  Original MFCC x Reconstructed MFCC x z z (b) Generative model

Inference model  Generative model  Original MFCC x Reconstructed MFCC x z z (i,n) x (i,n)1 z (i,n)2 z (i)2 μ   ( ) i N M Within a sequenceWithin a segment

Sequence i ... Segment ID (i,n) x (i,n)1 z (i,n)2 z (i)2 μ ( ) i N M Within a sequenceWithin a segment

Encoder  Decoder  Original MFCC x Reconstructed MFCC x z z (c) Inference model

Figure 4.3:

Overall structure and graphic illustration of FHVAE generative and infer-ence models.

In FHVAE, the exact posterior inference is intractable. Therefore an inferencemodel q ϕ ( ·| X ) is introduced to approximate the true posterior p θ ( ·| X ) as follows, q ϕ ( µ ) N ∏ n =1 q ϕ ( z n | x n ) q ϕ ( z n | x n , z n ) , (4.6)where q ϕ ( µ ) , q ϕ ( z n | x n ) and q ϕ ( z n | x n , z n ) are defined as, q ϕ ( µ ) = N ( g µ µ ( i ) , σ ˜ µ I ) , (4.7) q ϕ ( z n | x n ) = N ( g µ z ( x n ) , diag ( g σ z ( x n )) , (4.8) q ϕ ( z n | x n , z n ) = N ( g µ z ( x n , z n ) , diag ( g σ z ( x n , z n )) . (4.9) g µ z ( · ) , g σ z ( · ) , g µ z ( · , · ) and g σ z ( · , · ) are parameterized by two DNNs. For q ϕ ( µ ) ,during FHVAE training, a trainable lookup table containing posterior mean of µ foreach sequence is updated. For unseen test sequences, maximum a posteriori (MAP)estimation is used to infer µ . Details of µ estimation can be found in [99]. Agraphic illustration of FHVAE inference mode ϕ is given as in Figure 4.3c.The training of FHVAE is done by optimizing discriminative segmental

40. Speaker adaptation for unsupervised subword modeling variational lower bound L ( θ, ϕ ; x ( i,n ) ) [99], which is defined as, E q ϕ ( z ( i,n )1 , z ( i,n )2 | x ( i,n ) ) [log p θ ( x ( i,n ) | z ( i,n )1 , z ( i,n )2 )] − E q ϕ ( z ( i,n )2 | x ( i,n ) ) [KL( q ϕ ( z ( i,n )1 | x ( i,n ) , z ( i,n )2 ) || p θ ( z ( i,n )1 ))] − KL( q ϕ ( z ( i,n )2 | x ( i,n ) ) || p θ ( z ( i,n )2 | ˜ µ i ))+ 1 N i log p θ ( ˜ µ i ) + α log p ( i | z ( i,n )2 ) , where ˜ µ i denotes posterior mean of µ for the i -th sequence, α denotes the discrim-inative weight. The discriminative objective log p ( i | z ( i,n )2 ) is formulated as, log p ( i | z ( i,n )2 ) := log p θ ( z ( i,n )2 | ˜ µ i ) − log M ∑ j =1 p θ ( z ( j,n )2 | ˜ µ j ) . (4.10)After FHVAE training, z is expected to encode factors that are relativelyconsistent within a sequence. The discriminative objective encourage z to capturesequence-dependent information, instead of being collapsed into a trivial value [99]. z encodes the residual factors that are sequence-independent. In this work, FHVAE is applied to learn speaker-invariant features in an unsu-pervised manner. The input data used to train the FHVAE are frame-level spectralfeatures, e.g. MFCCs. Prior to training, feature sequences from the utterances of thesame speaker are concatenated into a single sequence. With this arrangement, z is expected to encode largely speaker identity information and carry little phoneticinformation. z is expected to encode the residual information, primarily related tolinguistic content.Two different methods of FHVAE-based speaker-invariant feature learning canbe performed. The first method in that latent segment variables { z ( i,n )1 } inferredfrom Equation 4.9 is treated as the learned representation.In the second method, the feature representation is reconstructed based on aunified s-vector. After FHVAE training, a representative speaker is selected fromthe dataset. The s-vector of this speaker is denoted as µ ∗ . Then given an arbitraryspeaker i , the respective latent sequence variable z ( i,n )2 inferred from Equation 4.8,

41. Speaker adaptation for unsupervised subword modeling is modified to ˆ z ( i,n )2 by applying a linear transformation as shown below, ˆ z ( i,n )2 = z ( i,n )2 − µ i + µ ∗ , (4.11)where µ i is the s-vector of speaker i . Finally, the FHVAE decoder (generative model)reconstructs speech segment ˆ x ( i,n ) based on p θ ( ˆ x ( i,n ) | z ( i,n )1 , ˆ z ( i,n )2 ) (Equation 4.4).The reconstructed features { ˆ x ( i,n ) } are used as the learned feature representation.This method is named as s-vector unification . Compared to the original features,the reconstructed features are expected to retain linguistic content and capture thespeaker’s characteristics. In other words, speech synthesized from { ˆ x ( i,n ) } wouldsound as if they were all spoken by the representative speaker.In essence, Equation (4.11) is to add a universal bias µ ∗ towards z ( i,n )2 of allthe speech segments of all the speakers. One could also choose to set the universalbias to , or any consonant vector. However, as will be discussed in Section 4.5.2,the choice of µ ∗ affects the quality of reconstructed MFCC features. Speaker adversarial training [103] is realized by the adversarial multi-task learn-ing (AMTL) architecture as proposed in [104]. The main idea is to establish a speakerclassification task on top of the hidden layers and reversely back-propagate the clas-sification error, such that the lower layers are guided to extract representationsirrelevant to speakers.Figure 4.4 shows the architecture of a speaker adversarial training model. Themodel comprises a subword classification network ( M p ), a speaker classification net-work ( M s ) and a shared-hidden-layer feature extractor ( M h ). This architecture issimilar to the MTL-DNN as used in multilingual DNN ASR [86]. The difference ofAMTL from MTL is on how learning error is propagated from M s to M h . In AMTL,the speaker classification error is back-propagated by multiplying with a negativecoefficient named the adversarial weight . By doing so, the output layer of M h isforced to learn speaker-invariant features so as to confuse M s , while M s aims tocorrectly classify the outputs of M h into the corresponding speaker. Meanwhile, M p learns to predict subword-like labels, i.e. DPGMM frame labels, and back-propagate

42. Speaker adaptation for unsupervised subword modeling

Zero-resource speech data M s M p M h Speaker labels DPGMM labelsAdversarial task

Lsh   

Lph 

BNFs

AABCCACDDED.. Frame index

Figure 4.4:

Speaker adversarial training based on AMTL. subword classification error to M h in a usual way. As a result, the output of M h gives a speaker-invariant and subword discriminative representation of input speech.The output dimension of M h is usually set to a small value, e.g. to , to enableBNF extraction of M h .Let θ p , θ s and θ h denote the network parameters of M p , M s and M h , respec-tively. Using the stochastic gradient descent (SGD) algorithm, these parameters areupdated as, θ p ← θ p − δ ∂ L p ∂θ p , (4.12) θ s ← θ s − δ ∂ L s ∂θ s , (4.13) θ h ← θ h − δ [ ∂ L p ∂θ h − λ ∂ L s ∂θ h ] , (4.14)where δ is the learning rate, L p and L s are the loss values of subword and speakerclassification tasks respectively, for both of which cross-entropy is adopted. A gra-dient reversal layer (GRL) [104] was designed and put in the middle of M h and M s .The GRL acts as identity transform during forward-propagation and changes thesign of loss during back-propagation.There are two main reasons why speaker adversarial training is adopted in thiswork. First, the idea of adversarial training is straightforward yet effective in vari-ous speech related tasks, such as domain-invariant ASR [103, 105], language recog-nition [106]. It only requires domain labels as supervision, such as speaker labels inspeaker adversarial training. Second, speaker adversarial training is potentially com-

43. Speaker adaptation for unsupervised subword modeling

MFCCs Frame labeling DNN-BNF modeling EvaluationLabels

Target untranscribed speech data

Multilingual BNFs

Out-of-domain ASR fMLLRs

MFCCs Frame labeling DNN-BNF modeling EvaluationLabels

Target untranscribed speech data

Multilingual BNFs

FHVAE , z x MFCCs Frame labeling

Speaker adversarial DNN-BNF modeling

EvaluationLabels Multilingual BNFs

Speaker infomation

Speaker labels

BNFs

DPGMM labels M s M p Adversarial taskMFCCs M h FHVAE Reconstructed MFCCs(speaker-invariant) DPGMM clusteringOut-of-domain ASR fMLLRsTarget untranscribed speech data

Figure 4.5:

An illustration of combining speaker adaptation approaches. plementary to previously mentioned two adaptation methods in this chapter. FM-LLR and disentangled representation learning are applied to the DNN-BNF frame-work at front-end, while speaker adversarial training is applied at back-end. It wouldbe interesting to study the effectiveness of combining these methods.

In this section, combinations of speaker adaptation approaches described in pre-vious sections are investigated. Specifically, speaker adversarial training is jointlyused with either out-of-domain fMLLR estimation or disentangled speech represen-tation learning. An illustration on combining these approaches is given in Figure4.5. The input features to DPGMM and speaker AMTL are replaced by fMLLRs orreconstructed MFCCs. It is expected that with the replacement, speaker adversarialtraining could be further improved.

Experiments that are carried out to validate the effectiveness of the speakeradaptation approaches in improving unsupervised subword modeling. The experi-ments are conducted using the ZeroSpeech 2017 development database, and on theABX discriminability task.

44. Speaker adaptation for unsupervised subword modeling

Out-of-domain ASR system

A Cantonese ASR system is used as the out-of-domain language-mismatchedASR. Cantonese is a major Chinese dialect widely spoken in Hong Kong, Macau,southern part of mainland China and many overseas Chinese communities [107].There is good reason to select Cantonese as the out-of-domain resource-rich lan-guage, despite that Mandarin, one of the target languages in ZeroSpeech 2017, andCantonese are both Chinese languages. In fact, Cantonese and Mandarin are consid-ered largely different in terms of acoustic-phonetic properties. These two languagesare mutually unintelligible.The Cantonese ASR system is trained with CUSENT, a read speech corpus de-veloped by the Chinese University of Hong Kong [90]. CUSENT contains , training utterances from male and female speakers, with a total of . -hour speech. The ASR system is comprised of a context-dependent GMM-HMM(CD-GMM-HMM) AM with speaker adaptive training (CD-GMM-HMM-SAT), anda syllable trigram LM. For the AM, there are phonemes (including a silencephoneme), each modeled by a -state HMM. The AM contains in total CD-HMM states. The input features to the AM are -dimension fMLLRs derivedfrom sequentially applying VTLN, LDA, MLLT and fMLLR transforms towards -dimension MFCCs+ ∆ + ∆∆ with CMN. The AM is trained using Kaldi [88]. TheLM is trained with CUSENT transcriptions using SRILM [108]. FMLLR estimation of target speech

The Cantonese ASR is used to perform fMLLR-based speaker adaptation of tar-get zero-resource speech on the -dimension MFCCs in a two-pass procedure. In thefirst-pass, input speech utterances are decoded by the ASR in a speaker-independentmanner, using unadapted features, from which initial fMLLR transforms are es-timated. In the second-pass, input speech are decoded with initial fMLLRs in aspeaker-adaptive manner. After the decoding, final fMLLR transforms for targetspeech utterances are estimated. The dimension of fMLLR features is . The ASRdecoding result depends on the relative weighting of AM and LM. In our experi-

45. Speaker adaptation for unsupervised subword modeling ments, the LM carries a very small weight, such that the decoding result mainlyreflects acoustic properties of target speech.

Frame labeling and MTL-DNN training

After obtaining fMLLR features for the three target languages, DPGMM clus-tering is applied to generate frame-level labels. The numbers of clustering iterationsfor English, French and Mandarin corpora are , and . The resulted num-bers of DPGMM clusters are , and , respectively. Each frame is assignedwith a DPGMM label.The MTL-DNN model is trained with -dimension fMLLR features with con-text size ± . There are tasks involved in MTL, each corresponding to a targetlanguage. The DNN structure is the same as that used in the baseline system men-tioned in Section 3.6.1, i.e., -dimension -layer MLP except the -dimensionlinear BN layer located at the second topmost layer. The dimensions of the language-specific output layers are , and . The objective function is weightedcross-entropy. SGD is used to optimize the objective function. After MTL-DNNtraining, the model is used to extract multilingual BNFs for test sets of target lan-guages. Results and analyses

Experimental results of applying fMLLR features estimated by a CantoneseASR are summarized in Tables 4.2 (across-speaker) and 4.3 (within-speaker). Eachtable contains two system groups marked as A and B . Group A comprises systemswithout MTL-DNN training, thus are developed without using ZeroSpeech 2017training data. The second row in A denotes fMLLR features obtained by our pro-posed methods. The third row in A was reported by Shibata et al. [1], which utilizeda Japanese ASR for fMLLR estimation. Group B contains two systems generatingmultilingual BNFs but are trained with different input feature representations. ForMUBNF0, the inputs to DPGMM frame labeling and MTL-DNN modeling are rawMFCCs. This system was reported in Section 3.6.2 as the baseline. For MUBNF, theinputs to DPGMM and MTL-DNN are fMLLRs estimated by the Cantonese ASR.From Tables 4.2 and 4.3, several observations are made.

46. Speaker adaptation for unsupervised subword modeling

Table 4.2:

Across-speaker ABX error rates ( % ) on raw MFCCs, fMLLRs by OOD ASRsand multilingual BNFs. English French Mandarin Avg.1s 10s 120s 1s 10s 120s 1s 10s 120s A MFCC [14] . . . . . . . . . . FMLLR by OOD ASR . . . . . . . . . . FMLLR by OOD ASR [1] . . . . . . . . . . B MUBNF0 (Section 3.6.2) 13.5 12.4 12.4 17.8 16.4 16.1 12.6 11.9 12.0 13.9

MUBNF . . . . . . . . . . Table 4.3:

Within-speaker ABX error rates ( % ) on raw MFCCs, fMLLRs by OOD ASRsand multilingual BNFs. English French Mandarin Avg.1s 10s 120s 1s 10s 120s 1s 10s 120s A MFCC [14] . . . . . . . . . . FMLLR by OOD ASR . . . . . . . . . . FMLLR by OOD ASR [1] . . . . . . . . . . B MUBNF0 (Section 3.6.2) 8.0 7.3 7.3 10.3 9.4 9.3 10.1 8.8 8.9 8.8

MUBNF . . . . . . . . . . English French Mandarin A c r o ss - s p ea k e r A B X e rr o r r a t e ( % ) fMLLR (ours)fMLLR [Shibata] English French Mandarin W it h i n - s p ea k e r A B X e rr o r r a t e ( % ) fMLLR (ours)fMLLR [Shibata] Figure 4.6:

Comparison of fMLLRs obtained by our methods and by Shibata et al. [1].The performance is computed by averaging over all utterance lengths within a targetlanguage. (1). The fMLLR features consistently outperform MFCC features on all target lan-guages. The relative performance improvement of our fMLLR features com-paring to MFCCs is . in across-speaker condition and . in within-speaker condition. The results demonstrate that speaker adaptation based onan out-of-domain ASR system is both effective and efficient for unsupervised

47. Speaker adaptation for unsupervised subword modeling

EN 1s EN 10s EN 120s FR 1s FR 10s FR 120s MA 1s MA 10s MA 120s A c r o ss - s p ea k e r A B X e rr o r r a t e ( % ) MUBNF ∆ MUBNF

EN 1s EN 10s EN 120s FR 1s FR 10s FR 120s MA 1s MA 10s MA 120s W it h i n - s p ea k e r A B X e rr o r r a t e ( % ) MUBNF ∆ MUBNF

Figure 4.7:

Comparison of MUBNF0 and MUBNF. ‘ ∆ MUBNF’ denotes difference ofABX error rate between MUBNF0 and MUBNF. ‘EN, FR, MA’ are abbreviations ofEnglish, French and Mandarin. subword modeling. Unlike MFCCs which perform equally on different lengthconditions, fMLLRs achieve better performance on longer test utterances. Thisis reasonable, as fMLLR-based speaker adaptation is widely known to workless well on very short speech.(2). Our fMLLR features perform slightly better than fMLLRs in [1]. Note that [1]used a -hour Japanese transcribed dataset to train an out-of-domain ASR,while our system used a . -hour Cantonese dataset. It is interesting to seefrom Figure 4.6 that the two fMLLRs achieve similar ABX error rates onEnglish and French, while our fMLLRs perform much better in Mandarin.

48. Speaker adaptation for unsupervised subword modeling

This can be partially explained that Cantonese and Mandarin are both Chi-nese languages, which may potentially provide additional benefit in modelingMandarin speech, as compared to exploiting Japanese as the out-of-domainlanguage resource.(3). MUBNF outperforms MUBNF0 to a large extent, especially in the across-speaker test condition. The relative ABX error rate reduction of MUBNF overMUBNF0 is . in across-speaker condition and . in within-speakercondition. This suggests that speaker adaptation at input feature level is acritical step in obtaining speaker-invariant and subword-discriminative BNFrepresentations. It is clearly seen from Figure 4.7 that MUBNF consistentlyoutperform MUBNF0 over all the target languages and utterance lengths (ex-cept in Mandarin 10s). Moreover, Figure 4.7 also reveals different propertiesof MUBNF and MUBNF0 on 10s/120s test conditions. MUBNF0 performsalmost the same on 10s and 120s conditions, whilst MUBNF performs signif-icantly better on 120s. This indicates that fMLLR benefits from increasingthe utterance from 10s to 120s, while CMN does not benefit from the increaselonger than 10s. FHVAE setup and parameter tuning

FHVAE model parameters are determined by reference to [98]. The encoder anddecoder networks of FHVAE are both -layer LSTMs with neurons per layer.The dimensions of z and z are . Training data for the three target languages aremerged to train the FHVAE. Input features are fixed-length speech segments ran-domly chosen from training utterances. The segment length l is determined througha parameter tuning process, and will be discussed in the next paragraph. Each frameis represented by a -dimensional MFCC with CMN at speaker level. During theinference of latent segment variable z and reconstructed MFCC features, the inputsegments are shifted by frame. To match the length of extracted features withoriginal MFCCs, the first and last frame are padded. Adam [109] with β = 0 . and β = 0 . is used to train the FHVAE. A subset of training data is randomly

49. Speaker adaptation for unsupervised subword modeling A c r o ss - s p ea k e r ( % ) MFCC 5 10 15 20Segment length (frames)10.51111.51212.5 W it h i n - s p ea k e r ( % ) MFCC

Figure 4.8:

ABX error rates on z with different segment lengths and raw MFCCs. Theperformance is computed by averaging over all languages. selected for cross-validation. The rest part of training data is used for training. Thetraining process is terminated if the lower bound on the cross-validation set does notimprove for epochs. The open-source tool [99] is used to train FHVAEs.In our preliminary experiments, the ABX performance of z was found to besensitive to the input segment length l . This could be explained as: a too large l wouldreduce the capability of z in modeling linguistic content at subword level; a too small l would restrict the FHVAE from capturing sufficient temporal dependencies whichare essential in modeling speech. ABX error rates on z with different values of l are shown in Figure 4.8. As a reference, ABX error rates on raw MFCCs given byChallenge organizers [14] are also shown in this Figure as dash-dotted lines. It canbe seen that the optimal value of l is . For the remaining experiments, l is fixedto . Selecting representative speaker for reconstructed MFCC

As mentioned in Section 4.2.2, the extraction of reconstructed MFCCs { ˆ x } usings-vector unification assumes a pre-defined representative speaker. In order to validatethe generalization ability of our proposed s-vector unification method and evaluateits sensitivity to the gender of the representative speaker, English speakers {s0107,s3020, s4018, s0019, s1724, s2544}, French speakers {M02R, M03R, F01R, F02R}and Mandarin speakers {A08, C04}—namely, in total speakers are randomlychosen from ‘speaker-R’ sets (defined in Table 3.1) of ZeroSpeech 2017 training

50. Speaker adaptation for unsupervised subword modeling

Male Female W it h i n - s p ea k e r ( % ) z Male Female A c r o ss - s p ea k e r ( % ) z s0107s4018 s0107s4018 Figure 4.9:

ABX error rates (%) on reconstructed MFCCs ˆ x using s-vector unificationand latent segment variable z . Each bar corresponds to a representative speaker. Theperformance is computed by averaging over all languages. data. The speakers constitute candidates of the representative speaker in ourexperiments. The first half speakers inside each language set are male and the secondhalf are female .Each time when extracting reconstructed MFCCs { ˆ x } , a representative speakeris determined from the speakers. The s-vector corresponding to this speaker isdenoted as µ ∗ . During the extraction of { ˆ x } , s-vectors { µ i } of all three targetlanguages’ utterances are modified to µ ∗ . After extraction, there are in total groups of { ˆ x } . They are evaluated by the ABX discriminability task. Their resultsare shown in Figure 4.9.It can be observed from Figure 4.9 that in the across-speaker condition, { ˆ x } outperform { z } regardless of choosing any of the speakers as the representative.In the within-speaker condition, { ˆ x } perform slightly better than { z } in most casesif the representative speaker is male, and are worse than { z } in all female cases. Itcan be concluded that selecting a male representative speaker is generally better inextracting speaker-invariant reconstructed MFCC features. The male speaker ‘s4018’achieves the best overall performance. Further studies are required to explore whymale speakers are more suitable than females for s-vector unification. Frame labeling and MTL-DNN training

Figure 4.10 summarizes frame labeling and MTL-DNN training procedure inour experiments. Frame labels are generated by DPGMM clustering towards recon- Gender information is not released in ZeroSpeech database. We acquired this information via listening

51. Speaker adaptation for unsupervised subword modeling

DPGMM labels for Lang 1

BNF for subword discriminability evaluation

DPGMM labels for Lang M FHVAEMFCCs ˆ x z DPGMM … ... Figure 4.10:

An illustration on applying speaker-invariant features extracted from anFHVAE in DPGMM and MTL-DNN. structed MFCC features { ˆ x } appended by ∆ and ∆∆ . The representative speakerfor extracting { ˆ x } is selected from the candidates. The numbers of clusteringiterations for English, French and Mandarin are , and .The MTL-DNN model is trained with equally-weighted tasks correspondingto the three target languages. The input features to MTL-DNN are either latentsegment variables { z } or reconstructed MFCCs { ˜ x } . Note that { ˜ x } is slightly dif-ferent from { ˆ x } . During the inference of { ˜ x } for training sets, s-vector unification isnot applied; during the inference for test sets, s-vector unification is applied withinevery test subset with a subset-specific µ ∗ . The reason is that MTL-DNN trainedwith { ˜ x } was found to outperform that trained with { ˆ x } . The layer-wise structure,objective function and training strategies of MTL-DNN are the same as those de-scribed in the baseline system in Section 3.6.1. After MTL-DNN training, the modelis used to extract -dimension multilingual BNFs of test sets for ABX evaluation. Results and analyses

Experimental results on multilingual BNFs trained with and without FHVAE-based speaker-invariant features are summarized in Tables 4.4 (across-speaker) and4.5 (within-speaker). In these Tables, each row represents a system, in which theinput features to DPGMM clustering and MTL-DNN modeling are listed in the sec-ond and third column respectively. The baseline system using raw MFCC featuresas inputs to both DPGMM and MTL-DNN, and the system exploiting a CantoneseASR for fMLLR estimation, are shown in the first two rows of the Tables. A similarwork done by Chen et al. [30], which utilized VTLN-processed MFCCs as inputs toDPGMM is also listed for reference. The systems adopting FHVAE-based speaker-

52. Speaker adaptation for unsupervised subword modeling

Table 4.4:

Across-speaker ABX error rates ( % ) on multilingual BNFs trained with-/without FHVAE-based speaker-invariant features. Ref. Input feature English French Mandarin Avg.DPGMM DNN 1s 10s 120s 1s 10s 120s 1s 10s 120sSec. 3.6.2 MFCC MFCC 13.5 12.4 12.4 17.8 16.4 16.1 12.6 11.9 12.0 13.9Sec. 4.5.1 fMLLR fMLLR . . . . . . . . . . [30] MFCC FB+F0 . . . . . . . . . . +VTLN FB+F0 . . . . . . . . . . ‹ MFCC z ˜ x › ˆ x -s0107 z ˆ x -s0107 ˜ x ﬁ ˆ x -s4018 z . x -s4018 ˜ x Table 4.5:

Within-speaker ABX error rates ( % ) on multilingual BNFs trained with-/without FHVAE-based speaker-invariant features. Ref. Input feature English French Mandarin Avg.DPGMM DNN 1s 10s 120s 1s 10s 120s 1s 10s 120sSec. 3.6.2 MFCC MFCC 8.0 7.3 7.3 10.3 9.4 9.3 10.1 8.8 8.9 8.8Sec. 4.5.1 fMLLR fMLLR . . . . . . . . . . [30] MFCC FB+F0 8.5 7.3 7.2 11.1 9.5 9.4 10.5 8.5 8.4 8.9+VTLN FB+F0 8.5 7.3 7.2 11.2 9.4 9.4 10.5 8.7 8.5 9.0 ‹ MFCC z ˜ x › ˆ x -s0107 z ˆ x -s0107 ˜ x ﬁ ˆ x -s4018 z . x -s4018 ˜ x invariant features are indexed as ‹ , › and ﬁ . ‘ ˆ x -s0107’ and ‘ ˆ x -s4018’ denote recon-structed MFCCs with speaker s0107 and speaker s4018 as the representative speakerrespectively. Here, ˆ x -s4018 is used to represent the ideal scenario, as s4018 performsthe best among all the representative candidates (shown in 4.9). ˆ x -s0107 is used torepresent the general scenario, as s0107 performs moderately among all the malecandidates.From Tables 4.4 and 4.5, several observations can be made.(1). By comparing Sec. 3.6.2 with ‹ , it is shown that without improving inputfeature representation to DPGMM clustering, the MTL-DNN trained with z or ˜ x outperforms that trained with raw MFCCs in the across-speaker test

53. Speaker adaptation for unsupervised subword modeling condition. The relative ABX error rate reduction is . for the first line in ‹ and . for the second line in ‹ .(2). Reconstructed MFCC features ˆ x significantly outperform original MFCC fea-tures in DPGMM clustering. In the ideal scenario where s4018 is selectedas the representative speaker, by comparing the first lines in ﬁ and ‹ , theacross- and within-speaker relative ABX error rate reduction are . and . respectively. Even in the general scenario where s0107 is selected as therepresentative speaker, by comparing the first lines in › and ‹ , the across-and within-speaker relative ABX error rate reduction are . and . .These results clearly demonstrate that improving speaker invariance of inputfeatures to DPGMM clustering is crucial in achieving better BNFs for unsu-pervised subword modeling, especially in the across-speaker scenario.(3). The advancement of ABX task performance contributed from improving DPGMMinputs is more prominent than that from improving DNN inputs. This obser-vation is true for both across- and within-speaker conditions. In fact, Figure4.5 tells that replacing DNN input features from original MFCCs to either z or ˜ x (system Sec. 3.6.2 −→ ‹ ) does not achieve within-speaker performanceimprovement, if DPGMM labels keep unchanged.(4). Our best system (first line in ﬁ ) achieves . and . relative ABX errorrate reduction compared to the baseline system reported in Section 3.6.2. Thisimprovement is achieved without exploiting any out-of-domain resources. Itis attributed to better speaker-invariant features learned from FHVAE-baseddisentangled representation learning. Compared to system Sec. 4.5.1, whichutilized Cantonese transcribed speech, the best system is slightly better in thewithin-speaker condition, while slightly worse in the across-speaker condition.(5). Speaker-invariant features generated by FHVAEs are better than VTLN inimproving unsupervised subword modeling. While our baseline system is infe-rior to the baseline MFCC system in [30] in the across-speaker condition, ourbest system consistently outperforms MFCC+VTLN in all target languagesand utterance lengths. In the within-speaker condition, our proposed methodsalso achieve better performance.

54. Speaker adaptation for unsupervised subword modeling

Visualization of speaker-invariant features

Apart from evaluating the effectiveness of our proposed FHVAE-based speaker-invariant features by ABX discriminability, we also show 2-dimension visualizationof the learned features to demonstrate their robustness towards speaker variation.To this end, two Mandarin speakers from ZeroSpeech 2017 training data are ran-domly selected. For each speaker, frame-level speech features from -second speechutterances (i.e. frames) are used for t-SNE based -dimension visualization,using open-source tools developed by [110].The visualization results are shown in Figure 4.11. This Figure contains five sub-Figures corresponding to different feature representations, namely, original MFCC,original MFCC+CMN, z , z and reconstructed MFCC ˆ x with representative speakers4018. Each data point inside the sub-Figure denotes a speech frame. The two speak-ers are marked by different colors. Figure 4.11e clearly shows the disentanglement ofspeech features towards speaker variation, which is in agreement of our expectation.Figures 4.11a and 4.11b indicate that CMN alleviates speaker variation encoded inMFCC features while this variation is still perceptible. In comparison, reconstructedMFCCs, as shown in 4.11d, demonstrate much higher robustness towards speakervariation. Experimental setup

The AMTL-DNN architecture [104] is adopted to train multilingual BNFs, asshown in Figure 4.4. There are two types of labels involved during training, namely,DPGMM labels and speaker labels. DPGMM labels are obtained following thebaseline experimental settings in Section 3.6.1. Speaker labels are released in thedatabase. DPGMM labels support the training of M p , while speaker labels support M s . The input features to AMTL-DNN are -dimension MFCCs with ∆ and ∆∆ with context size ± . The layer-wise structure of M h is { × , } . M s and M p both have sub-networks, each corresponding to a target language. Each sub-network contains a -dimension feed-forward layer followed by a soft-max layer.During AMTL-DNN training, the learning rate starts from · − to · − with

55. Speaker adaptation for unsupervised subword modeling -60 -40 -20 0 20 40 60-60-40-200204060

Speaker 1Speaker 2 (a)

Original MFCC -60 -40 -20 0 20 40 60-60-40-200204060

Speaker 1Speaker 2 (b)

Original MFCC with CMN -50 0 50-60-40-200204060

Speaker 1Speaker 2 (c)

Latent segment variable z -60 -40 -20 0 20 40 60 80-60-40-200204060 Speaker 1Speaker 2 (d)

Reconstructed MFCC ˆ x -s4018 -60 -40 -20 0 20 40 60-50050 Speaker 1Speaker 2 (e)

Latent sequence variable z Figure 4.11:

T-SNE visualization of frame-level feature representations from two speak-ers in ZeroSpeech 2017 Mandarin subset. exponential decay. The number of epochs is . Speaker adversarial weight λ rangesfrom to . , with an interval of . . After training, multilingual BNFs extracted

56. Speaker adaptation for unsupervised subword modeling

Table 4.6:

Across-speaker ABX error rates ( % ) on multilingual BNFs extracted fromspeaker AMTL-DNN with different adversarial weights λ . Ref. λ English French Mandarin Avg.1s 10s 120s 1s 10s 120s 1s 10s 120s0 13.1 12.0 12.0 17.9 15.7 15.6 12.2 11.5 11.5 13.500.02 13.0 11.9 12.0 17.5 15.6 15.4 12.3 11.3 11.3 13.370.04 12.8 11.8 11.8 17.3 15.4 15.3 12.1 11.3 11.2 . from the output of M h are evaluated by ABX discriminability.It must be noted that experiments applying speaker AMTL-DNN are conductedusing a different Kaldi implementation from those applying MTL-DNN in Sections4.5.1 and 4.5.2. Results and analyses

Experimental results on multilingual BNFs extracted from speaker AMTL-DNNare listed in Tables 4.6 (across-speaker) and 4.7 (within-speaker). The Tables alsocontain two reference systems. ‘R0’ denotes our baseline system which is introducedin Section 3.6.2. ‘R-F’ denotes the best system adopting FHVAE-based disentangledspeech representation learning, which is discussed in Section 4.5.2. Note that when λ = 0 , speaker adversarial training is not adopted, and the AMTL-DNN model iscollapsed to an MTL-DNN.From the Tables, it can be observed that speaker adversarial training couldreduce ABX error rates when λ is set between and . . The absolute error ratereduction is . in the across-speaker condition, and . in the within-speakercondition. The amount of performance improvement is in accordance with the find-ings in a relevant study [34], despite that [34] exploited transcription resources forin-domain English speech utterances during AMTL-DNN training.In principle, the system with λ = 0 is the same as our baseline system R0. Theperformance difference is caused by different implementation in MTL-DNN training. AMTL-DNN is implemented as Kaldi nnet3 , while MTL-DNN in Sections 4.5.1 and 4.5.2 is implementedas Kaldi nnet1 .

57. Speaker adaptation for unsupervised subword modeling

Table 4.7:

Within-speaker ABX error rates ( % ) on multilingual BNFs extracted fromspeaker AMTL-DNN with different adversarial weights λ . Ref. λ English French Mandarin Avg.1s 10s 120s 1s 10s 120s 1s 10s 120s0 7.7 6.9 6.9 10.6 9.0 8.9 10.0 8.7 8.6 8.590.02 7.5 6.8 6.9 10.5 9.1 8.9 9.9 8.6 8.4 8.510.04 7.5 6.7 6.8 10.2 8.9 8.8 9.9 8.6 8.4 8.420.06 7.4 6.7 6.8 10.1 8.8 8.8 9.8 8.5 8.3 . By comparing R-F with the best system adopting speaker adversarial training, it canbe observed that although the baseline is inferior, R-F achieves better performancein both across- and within-speaker conditions.

Experimental setup

For experiments combining approaches of out-of-domain ASR based fMLLRand speaker adversarial training, the input features to DPGMM are -dimensionfMLLR features estimated by a Cantonese ASR. The input features to AMTL-DNNare either MFCCs or fMLLRs. For experiments combining approaches of disentan-gled representation learning and speaker adversarial training, the input featuresto DPGMM reconstructed MFCCs with s-vector unification. The input features toAMTL-DNN are either original or reconstructed MFCCs. The layer-wise structureof AMTL-DNN model keeps the same as mentioned in Section 4.5.3. Results and analyses

Experimental results on multilingual BNFs by combining speaker adaptationapproaches are summarized in Tables 4.8 (across-speaker) and 4.9 (within-speaker)respectively. The Tables comprise two groups marked with A and B . Group A con-tains systems using fMLLR features as inputs to DPGMM based frame labeling,while B contains systems using reconstructed MFCCs as inputs to frame labeling.

58. Speaker adaptation for unsupervised subword modeling

Table 4.8:

Across-speaker ABX error rates ( % ) on systems combining different speakeradaptation approaches. Ref. Input feature λ English French Mandarin Avg.DPGMM DNN 1s 10s 120s 1s 10s 120s 1s 10s 120s A fMLLR MFCC 0 10.3 9.3 9.2 14.5 12.9 12.8 10.3 9.2 9.1 . . . . B ˆ x -s4018 MFCC 0 11.4 9.8 9.8 15.6 13.3 13.1 11.4 9.9 9.9 . . ˜ x . -st in ZRSC17) 10.1 8.7 8.5 13.6 11.7 11.3 8.8 7.4 7.3 9.71 The best submitted system to ZeroSpeech 2017 proposed by Heck et al. [32] is alsolisted in the Tables for reference.It can be observed from Tables 4.8 & 4.9 that,(1). Speaker adversarial training is complementary to out-of-domain ASR basedspeaker adaptation, especially in the across-speaker condition. The absoluteerror rate reduction in the two conditions achieved by adopting fMLLRs asinputs to DPGMM and AMTL-DNN are . and . .(2). Speaker adversarial training has little effectiveness on systems adopting FHVAE-based disentangled representation. In the within-speaker condition, systemsusing reconstructed MFCCs in DPGMM frame labeling witness a severe degra-dation.(3). The best performance is achieved with fMLLR features as inputs to DPGMMand AMTL-DNN. The across-/within-speaker ABX error rate is . / . .

59. Speaker adaptation for unsupervised subword modeling

Table 4.9:

Within-speaker ABX error rates ( % ) on systems combining different speakeradaptation approaches. Ref. Input feature λ English French Mandarin Avg.DPGMM DNN 1s 10s 120s 1s 10s 120s 1s 10s 120s A fMLLR MFCC 0 6.9 6.1 6.1 9.5 8.2 8.1 9.5 8.1 8.1 . . . . B ˆ x -s4018 MFCC 0 7.4 6.2 6.2 9.8 8.4 8.1 10.3 8.4 8.4 . ˜ x . -st in ZRSC17) 6.9 6.2 6.0 9.7 8.7 8.4 8.8 7.9 7.8 7.82 Our best system is competitive with the Challenge winner [32] in the within-speaker condition, while slightly worse than theirs in the across-speaker con-dition.

Three different approaches to speaker adaptation have been investigated towardimproving the robustness of DNN-BNF features for unsupervised subword modeling.They serve to achieve different goals and at the same time, they have different re-quirements on the adaptation data. fMLLR and disentangled representation learningare applied at the front end to make the input features speaker-invariant for DNN-BNF training and DPGMM frame labeling, whilst speaker adversarial training isapplied at the bank end to suppress speaker variation of the BNF representation.

60. Speaker adaptation for unsupervised subword modeling

The experimental results show that techniques applied at the front end are moreeffective than at the back end.The use of fMLLR requires transcribed out-of-domain speech data. As out-of-domain data are literally unlimited in amount and diversity, the benefit of this ap-proach could be further exploited. Indeed the experimental results show that fMLLRwith out-of-domain data give better performance than disentangled representationlearning and speaker adversarial training without out-of-domain data. It must benoted that adversarial training can also be applied with out-of-domain data (eithertranscribed or untranscribed). Its effectiveness is worth further investigation.

In this Section, three speaker adaptation approaches that can be directly ap-plied to our baseline unsupervised subword modeling system are introduced andextensively studied. The three approaches, namely, fMLLR estimation by an out-of-domain ASR, disentangled speech representation learning and speaker adversarialtraining, are investigated individually in our concerned task. Their combination isfurther studied.The experiments are conducted using official database and evaluation metrics inZeroSpeech 2017. Experimental results demonstrate the effectiveness of all the threeapproaches. The out-of-domain ASR based adaptation achieves the most significantperformance improvement compared to our baseline. Speaker adversarial trainingachieves the least amount of improvement among the three approaches. Combiningout-of-domain ASR based adaptation and adversarial training contributes to furtherimprovement, in which our best performance ( . / . ) is achieved. In contrast,adversarial training is not complementary to disentangled representation learning. hapter 5 Frame labeling inunsupervised subwordmodeling

In the baseline DNN-BNF system, frame labeling is realized by applying DPGMMclustering on speech frames. DPGMM clustering does not require a pre-defined clus-ter number, making it suitable for zero-resource speech modeling. Previous stud-ies showed its effectiveness in unsupervised subword modeling [27, 32]. Neverthe-less, DPGMM-based frame labeling has two major limitations. First, as neighboringframes are assumed to be independent, contextual information in speech is not takeninto account in determining the labels. Second, DPGMM is prone to producing over-fragmented speech units [111, 112].Towards both limitations, methods of improving DPGMM frame labels are de-veloped in this chapter. On one hand, a full-fledged GMM-HMM is trained to fa-cilitate better modeling of contextual information. The transcriptions required forthe training are initialized via DPGMM clustering. Following the terminologies usedin [46], the resulted model is referred to as

DPGMM-HMM .To alleviate the over-fragment problem, a new algorithm is proposed to filterout infrequent labels in DPGMM clustering results, so as to control the number ofinferred clusters. Experimental results reveal that these infrequent labels adversely

62. Frame labeling in unsupervised subword modeling affect the performance of unsupervised subword modeling, and that the proposedlabel filtering algorithm is effective.This chapter also presents our attempt of leveraging benefits of different typesof frame labels. Out-of-domain speech and language resources are exploited to enablelanguage-mismatched frame labeling. Such frame labels are obtained from decodingresults of one or multiple out-of-domain ASR systems. It is expected that labelsgenerated by out-of-domain ASR decoding and those via DPGMM-HMM acous-tic modeling are complementary and can be jointly used in unsupervised subwordmodeling.Section 5.1 describes the DPGMM-HMM frame labeling approach, and the pro-posed label filtering algorithm. The out-of-domain ASR based frame labeling ap-proach, as well as the combined use of different frame label types, are introducedin Section 5.2. Experiments are reported in Section 5.3, followed by the chaptersummary in Section 5.4.

The proposed DPGMM-HMM frame labeling approach consists of three stages,i.e., DPGMM clustering, frame label filtering and supervised GMM-HMM acousticmodeling. DPGMM clustering is performed to provide initial labels. These initiallabels are processed by a label filtering algorithm to discard infrequent frame clusters.The filtered labels are subsequently used in supervised context-dependent GMM-HMM (CD-GMM-HMM) acoustic modeling. Finally, the CD-GMM-HMM is appliedto force-align target zero-resource speech to generate the desired frame labels.DPGMM clustering is implemented in the same way as discussed in Section 3.2.The label filtering algorithm and DPGMM-HMM acoustic modeling are explainedin the following sections.

For a specific target language, let us assume that K Gaussian components (clus-ters) are obtained by DPGMM clustering. The labels are denoted as l , l , . . . , l N foran utterance with N frames. Let c k be the number of frames labeled as cluster k ,

63. Frame labeling in unsupervised subword modeling ˆ c ˆ c ˆ c ˆ c ˆ c ˆ c c c c c c c       c c c c c c (3) 1 m  (1) 2 m  (2) 3 m  (6) 6 m  (5) 4 m  (4) 5 m  Descending order

Figure 5.1:

Example of cluster-size sorting. i.e., c k = N ∑ i =1 ( l i = k ) , k ∈ { , , . . . , K } , (5.1)where ( · ) denotes the indicator function.The elements in { c , c , . . . , c K } are sorted in descending order { ˆ c , ˆ c , . . . , ˆ c K | ˆ c ≥ ˆ c ≥ . . . ≥ ˆ c K } . (5.2) m ( · ) denotes the index mapping function, i.e., ˆ c k = c m ( k ) . (5.3)Figure 5.1 shows an example of cluster-size sorting.Let P be the designated percentage of frame labels to be retained. The relevantframes are from K cut “dominant” clusters, where K cut = arg min K ′ ∑ K ′ k =1 ˆ c k N ≥ P. (5.4)Let O denote the collection of all removed frame labels, i.e., O = { l i : l i ∈ F , i ∈ { , , . . . , N } } , (5.5)where F = { m ( K cut + 1) , . . . , m ( K ) } . (5.6) F contains the indices of K − K cut clusters that are the least frequent to occur.

64. Frame labeling in unsupervised subword modeling

Speech frames being assigned to these clusters are considered as outliers.In the extreme case when P is set to , F and O would be empty sets. Thesmaller the value of P , the larger the proportion of removed labels. The label filteringalgorithm is summarized as in Algorithm 5.1. Algorithm 5.1:

DPGMM label filtering algorithm

Input: l , l , . . . , l N , P Output: O Calculate c k by Equation (5.1). Sort { c , c , . . . , c K } in descending order. Calculate m ( k ) by Equation (5.3). Calculate K cut by Equation (5.4) and P . Select a subset of l , l , . . . , l N as O , by Equation (5.5)&(5.6). {Frame labels thatare removed.}It is worth noting that filtering out small-sized clusters produced by DPGMM ismainly for the consideration of practical implementation. The DPGMM algorithmautomatically generates variable-sized clusters according to the underlying struc-ture of input data. Small-sized clusters are meaningful from the perspective of theDPGMM algorithm itself. Nevertheless, in the concerned task, filtering out small-sized clusters is a simple yet effective (as will be shown in experiments) method todiscard less useful labels and improve the overall quality of frame labels. Each of the DPGMM clusters can be regarded as a phone-like speech unit, orpseudo phone. The sequence of frame labels (after label filtering) can be convertedinto a pseudo phone transcription by collapsing neighboring identical labels. Forexample, a sequence of frame labels “1,3,3,3,7,10,10” would lead to transcription“1,3,7,10”. Based on pseudo transcriptions, DPGMM-HMM acoustic modeling isdone by following the standard supervised training pipeline, i.e., proceeding frommonophone model training with uniform time alignment to CD-GMM-HMM. TheCD-GMM-HMM AM is further refined by performing speaker adaptive training. Theresulted AM is referred to as the DPGMM-HMM AM.The DPGMM-HMM AM is used to produce time alignment information forsubsequent MTL-DNN modeling. To be distinguishable from DPGMM labels, the

65. Frame labeling in unsupervised subword modeling labels obtained form DPGMM-HMM forced alignment are referred to as DPGMM-HMM labels.

Frame labeling could be done with an out-of-domain ASR system, which is typ-ically trained with a large amount of transcribed speech in a resource-rich language.The AM in such an ASR system provides fine-grained speech representation of theoriginal language. Given a speech utterance in a different language, the ASR systemcan be applied to assign a language-mismatched state-level or phone-level label toeach frame in the utterance. The idea can be naturally extended to using multipleASR systems that desirably provide a wide coverage of phonetic diversity.The ASR decoding output depends on the relative weighting of AM and LM.In this study, the LM is assigned a very small weight, such that the acquired framelabels mainly reflect acoustic properties of the target speech.

Out-of-domain ASR decoding and DPGMM-HMM acoustic modeling providetwo different types of frame labels for target speech utterances. The DPGMM-HMMlabels incorporate statistical information of the acoustic properties of target speech.The ASR-decoded labels leverage the phonetic information acquired from otherlanguages. It is expected that they would contribute complementarily in subword-discriminative feature learning. In this study, their complementarity is investigatedin the multi-task learning (MTL) framework.The proposed MTL-DNN system is depicted as in Fig 5.2. The training involvesa total of M + N tasks, which includes M zero-resource target languages and N out-of-domain ASR systems. Each of the tasks is represented by a task-specific softmax

66. Frame labeling in unsupervised subword modeling … ...Alignments for Lang 1 Alignments for Lang M OOD Lang label 1Lang 1features Lang M features … ... BNF for subword discriminability taskTasks for MUBNFTasks for OSBNFTasks for LI-BNFOOD Lang label N … ... Figure 5.2:

MTL-DNN for extracting LI-BNF, MUBNF and OSBNF. “OOD” standsfor out-of-domain. output layer in the DNN. For the zero-resource language tasks, state-level or phone-level DPGMM-HMM labels are used as target labels. The decoding output fromeach of the out-of-domain ASR systems provides one set of labels.If the MTL-DNN trained only on the M target language tasks, the extractedBNFs are referred to as multilingual unsupervised BNFs (MUBNFs). When out-of-domain ASR tasks are added, the BNFs are named language-independent BNFs(LI-BNFs). In the case that only the out-of-domain ASR tasks are involved, theextracted BNFs are referred to as out-of-domain supervised BNFs (OSBNFs).For the shared-hidden-layer structure in the MTL-DNN, multi-layer perceptron(MLP) has been commonly used [30, 31, 34, 113]. In this study, in addition to MLP,we investigate the use of long short-term memory (LSTM) [114] and bi-directionalLSTM (BLSTM) [115], which were shown to perform better than MLP in conven-tional supervised acoustic modeling.On the other hand, BNF representation can also be obtained from the DNN AMpre-trained on a resource-rich language [1]. This is considered as a transfer learningapproach [116]. The resulted BNF, denoted as transfer learning BNF (TLBNF), isexpected to further enrich the feature representation for subword modeling. The complete system framework of multi-label assisted unsupervised subwordmodeling is shown as in Figure 5.3. As can be seen, the input features to DPGMM-

67. Frame labeling in unsupervised subword modeling

Speaker adapted feature extraction Forced-alignmentDecode and align BNF extraction

MFCCs DPGMM clusteringMTL-DNN training GMM-HMM trainingCA ASR fMLLRs EvaluationLabel filtering

Labels

DPGMM-HMM acoustic modeling

Labels

Multiple ASRs

Out-of-domain languagesZero-resource languages

Feature concatenation

BNF extraction

Figure 5.3:

Complete system framework of multi-label assisted unsupervised subwordmodeling.

HMM acoustic modeling and MTL-DNN are both fMLLR features estimated byan out-of-domain ASR. Frame labels required for MTL-DNN training is performedwith multiple types of labels as described in previous sections. BNFs are extractedfrom the trained MTL-DNN and evaluated on the ABX discriminability task. TheTLBNF representation is optionally concatenated to the BNFs by the MTL-DNN,and evaluated on the same task.

Out-of-domain ASR systems

Four out-of-domain ASR systems are utilized and investigated in our experi-ments. They cover the languages of Cantonese (CA), Czech (CZ), Hungarian (HU)and Russian (RU). The Cantonese ASR system is trained with the settings as given inSection 4.5.1. Two sets of AMs, i.e., CD-GMM-HMM-SAT and DNN-HMM models,are investigated. The CD-GMM-HMM-SAT is the same as that in the baseline. TheDNN-HMM is trained using Kaldi [88]. The training procedure follows the settingsused in training the baseline DNN as described in Section 3.6.1. The input featuresfor DNN-HMM are fMLLRs with contextual splicing { , ± , ± , ± , ± , ± } , with

68. Frame labeling in unsupervised subword modeling the fMLLRs estimated by the CD-GMM-HMM-SAT. The DNN-HMM training labelsare acquired from CD-GMM-HMM-SAT time alignment. The DNN-HMM model isa -layer MLP, with the layer configuration (from bottom to top) - × - - - . The dimension of the output layer is determined by the number ofstates in CD-GMM-HMM-SAT. The layers are activated with sigmoid function, ex-cept for the -dimension linear BN layer. The network is trained to optimize thecross-entropy criterion.The CZ, HU and RU phone recognizers are developed by the Brno Universityof Technology (BUT) [117]. They adopt a -layer MLP structure, in which the firsttwo are sigmoid layers and the third is a softmax layer. The MLP were trained withthe SpeechDat-E databases [118]. The numbers of modeled phones for CZ, HU andRU are , and , respectively. The amount of training data are . , . and . hours, respectively. The cross-entropy criterion was used for MLP training. DPGMM frame clustering and label filtering

For each target language, training speech frames for different languages areclustered by applying the DPGMM algorithm to -dimension fMLLR features esti-mated by the Cantonese ASR, (refer to Section 4.5.1). Hyper-parameters of DPGMMclustering is not tuned.The numbers of iterations for English and French are de-termined by preliminary experiments. More specifically, the number of iterationstested for English is in the range { , , . . . , } and that for French in the range { , , . . . , } . The optimal numbers of iterations were and respectively.For Mandarin, the number of iterations was empirically determined to be .The resulted numbers of DPGMM clusters for English, French and Mandarin are , and , respectively. After frame clustering, each frame is assigned acluster label. Figure 5.4 shows the frame labeling results in the form of cumulativedistribution function (CDF) for the three target languages. The clusters are sortedaccording to their cluster sizes in descending order. The horizontal axis denotes thenumber of DPGMM clusters and the vertical axis denotes the proportion of frames.Each point ( K i , Q i ) on the CDF curve represents the proportion of frame labels Q i that the largest K i clusters can cover.For label filtering, we evaluated the value of P in the range from . to .

69. Frame labeling in unsupervised subword modeling

DPGMM cluster number P r opo r ti on Cumulative distribution function

EnglishFrenchMandarin

Figure 5.4:

Clustering results in the form of cumulative distribution function for thethree target languages. Clusters are sorted according to cluster size in descending order. (step size of . ). After filtering, the frame-level label sequences are converted intopseudo transcriptions, for the training of DPGMM-HMM AMs. DPGMM-HMM acoustic modeling

Supervised training of the DPGMM-HMM AMs is carried out with the pseudotranscriptions. Different from the conventional -state HMM topology, during DPGMM-HMM training a -state HMM is used to model each pseudo phone. This alle-viates the problem of unsuccessful forced alignments, as the numbers of pseudophones for target languages are significantly larger than the number of linguistically-defined phones in a typical language. The input features for DPGMM-HMM are -dimension fMLLRs estimated by the Cantonese ASR. The training starts fromCI-GMM-HMM to CD-GMM-HMM, followed by VTLN and fMLLR-based SAT .After training, the numbers of CD-HMM states for English, French and Mandarinare , and , respectively. The training of DPGMM-HMM AMs are im-plemented by Kaldi. MTL-DNN training for BNF generation

The MTL-DNN model is trained with all the three target zero-resource lan-guages, from which BNFs are extracted and evaluated by the ABX subword dis- LDA and MLLT are not estimated, as no improvement was found.

70. Frame labeling in unsupervised subword modeling

Table 5.1:

Configurations for (HMM-)MUBNF, OSBNF and (HMM-)LI-BNF represen-tations. ‘EN’, ‘FR’ and ‘MA’ are abbreviations for English, French and Mandarin.

Task label from DPGMM DPGMM-HMM CA CZ HU RUTrain set EN FR MA EN FR MA Pooling EN, FR and MAMUBNF ✓ ✓ ✓

OSBNF1 ✓ OSBNF2 ✓ ✓ ✓ ✓

LI-BNF1 ✓ ✓ ✓ ✓

LI-BNF2 ✓ ✓ ✓ ✓ ✓ ✓ ✓

HMM-MUBNF ✓ ✓ ✓

HMM-LI-BNF1 ✓ ✓ ✓ ✓

HMM-LI-BNF2 ✓ ✓ ✓ ✓ ✓ ✓ ✓ criminability task. There are two types of tasks for MTL, namely, DPGMM-HMMalignment prediction task and out-of-domain ASR label prediction task. In the firstcase, three tasks are included, i.e., frame alignments generated by DPGMM-HMMAMs, one for each target zero-resource language. In the second case, four tasks corre-sponding to Cantonese, Czech, Hungarian and Russian recognizers’ senone labels areincluded. The senone labels are generated by decoding with LM to AM weight ratioset to . . After MTL-DNN training, -dimension HMM-LI-BNFs are extractedfor the ABX task evaluation. Similarly, HMM-MUBNFs , extracted by MTL-DNNwith DPGMM-HMM alignment tasks, and OSBNFs, extracted by MTL-DNN withone or more out-of-domain phone recognizers’ senone labels, are also evaluated bythe ABX task. The dimensions of both HMM-MUBNFs and OSBNFs are . As illus-trated in Figure 5.2, we defined several BNF representations according to the tasksincluded in MTL-DNN training. The configurations for (HMM-)MUBNF, OSBNFand (HMM-)LI-BNF are listed in Table 5.1. DNN structures

The MTL-DNN is implemented in three different model structures: MLP, LSTMand BLSTM. The input features are -dimension fMLLRs spliced with context size ± . For MLP, we follow the structure in our baseline system, i.e., the dimensions ofshared hidden layers in the MLP are - × - - . Sigmoid activation is used The prefix ‘HMM-’ emphasizes the use of DPGMM-HMM alignments, rather than DPGMM cluster labels.

71. Frame labeling in unsupervised subword modeling in all hidden layers except that the neurons in the BN layer use linear activationfunctions. The learning rate for MLP training is set at . at the beginning, andhalved when no improvement is observed on a cross-validation set. The mini-batchsize is .The LSTM model comprises LSTM layers with -dimension cell activationvectors, and -dimension outputs. A -dimension BN layer followed by a -dimension fully-connected (FC) layer is set on top of LSTMs. For the BLSTM model,there are pairs of forward and backward LSTM layers. Each bi-directional layerhas -dimension cell activation vectors and -dimension outputs. A BN layerfollowed by an FC layer is set on top of BLSTMs, with the same configuration asin the LSTM. The activation function in (B)LSTMs is tanh . The learning rate is − initially, and halved under the same criteria as for MLP. The truncated back-propagation through time (BPTT) algorithm [119] is used to train (B)LSTM, witha fixed time step T bptt = 20 . Note that the model parameters of LSTM and BLSTMstructures were tuned in preliminary studies. TLBNF generation

The TLBNFs for target zero-resource languages are generated by applying theCantonese DNN-HMM AM as the feature extractor. During TLBNF extraction, allthe parameters of the DNN-HMM are fixed. The fMLLR features for target languagesare fed as inputs to the DNN-HMM till its BN layer to generate TLBNFs.

Tables 5.2 and 5.3 provide a master summary to facilitate performance compar-ison among different BNF representations, in the across- and within-speaker con-ditions respectively. The feature representations being compared are organized inthree groups, marked by A , B and C in the Tables.Systems in groups A and B all use multilingual BNF representations, whichare learned with different combinations of frame labels. The DPGMM labels andDPGMM-HMM labels are used in group A and group B respectively. As described inSection 5.3.1 and Table 5.1, OSBNF1 and OSBNF2 are trained with out-of-domainASR senone labels, and LI-BNF1 and LI-BNF2 are trained with both DPGMM

72. Frame labeling in unsupervised subword modeling

Table 5.2:

Across-speaker ABX error rates ( % ) on BNFs learned by our proposed mul-tiple frame labeling approaches and state of the art of ZeroSpeech 2017. MLP is adoptedas the DNN structure. Label filtering is not applied. English French Mandarin Avg.1s 10s 120s 1s 10s 120s 1s 10s 120sMUBNF (in Section 4.5.1) 10.9 9.5 8.9 15.2 13.0 12.0 10.5 8.9 8.2 10.8 A OSBNF1 10.0 9.7 8.6 13.9 13.4 11.6 9.0 8.4 7.5 10.2OSBNF2 9.5 9.2 7.9 13.1 13.0 11.3 9.4 8.7 7.9 10.0LI-BNF1 10.0 8.9 8.2 14.3 12.9 11.5 9.5 8.5 7.7 10.2LI-BNF2 9.4 8.7 7.8 13.4 12.7 11.0 9.3 8.6 7.7 9.8 B HMM(S)-MUBNF . . . . . . . . . . HMM(P)-MUBNF . . . . . . . . . . HMM(P)-LI-BNF1 . . . . . . . . . . HMM(P)-LI-BNF2 . . . . . . . . . . C TLBNF 10.6 9.6 8.7 14.2 13.2 11.5 8.5 7.6 6.7 10.1TLBNF+LI-BNF1 10.3 9.3 8.4 13.9 12.9 11.4 8.5 7.6 6.7 9.9TLBNF+LI-BNF2 10.4 9.4 8.5 14.0 13.0 11.3 8.5 7.6 6.6 9.9TLBNF+HMM(P)-LI-BNF1 10.3 9.4 8.4 13.9 12.9 11.3 8.5 7.6 6.6 9.9TLBNF+MUBNF+OSBNF1 9.9 9.0 8.2 13.6 12.6 11.1 8.4 7.7 6.7 9.7TLBNF+HMM(P)-MUBNF+OSBNF1 10.0 9.0 8.2 13.6 12.6 11.1 8.4 7.6 6.7 9.7TLBNF+HMM(P)-MUBNF+OSBNF2 10.0 9.0 8.2 13.6 12.6 11.1 8.4 7.6 6.7 9.7Heck et al. [32] (1-st in ZRSC17) 10.1 8.7 8.5 13.6 11.7 11.3 8.8 7.4 7.3 9.7Chorowski et al. [37] 9.3 9.3 9.3 11.9 11.4 11.6 8.6 8.5 8.5 9.8Supervised topline [14] 8.6 6.9 6.7 10.6 9.1 8.9 12.0 5.7 5.1 8.2 labels and out-of-domain ASR senone labels. In the third group, “HMM(S)” and“HMM(P)” denote the use of state-level and phone-level HMM alignments respec-tively for label generation. Systems in group C are built on different combination ofBNF features. The “+” sign represents the concatenation of two frame-level featurerepresentations. Our experimental results shown in Tables 5.2 and 5.3 are obtainedby using the MLP structure in MTL-DNN. Label filtering is not applied at thisstage. In addition, two representative systems that achieved very good performancesin ZeroSpeech 2017 [32, 37] are also listed in the Tables. Effectiveness of multilingual BNFs

The following observations can be made on the performances of the learnedmultilingual BNF representations:(1). The effectiveness of MUBNF, which is reported in Section 4.5.1, can be im-proved by training the MTL-DNN with additional out-of-domain ASRs’ senonelabels. With the Cantonese ASR’s senone labels included as one of the trainingtasks, the LI-BNF1 representation reduces within-/across-speaker ABX error

73. Frame labeling in unsupervised subword modeling

Table 5.3:

Within-speaker ABX error rates ( % ) on BNFs learned by our proposedmultiple frame labeling approaches and state of the art of ZeroSpeech 2017. MLP isadopted as the DNN structure. Label filtering is not applied. English French Mandarin Avg.1s 10s 120s 1s 10s 120s 1s 10s 120sMUBNF (in Section 4.5.1) 7.4 6.9 6.3 9.6 9.0 8.1 9.8 8.8 8.1 8.2 A OSBNF1 7.2 7.1 6.3 10.2 9.7 8.7 9.1 8.6 7.6 8.3OSBNF2 6.8 6.7 5.9 9.5 9.2 8.3 9.7 8.9 8.0 8.1LI-BNF1 6.9 6.6 6.1 9.5 9.2 8.4 9.2 8.5 7.9 8.0LI-BNF2 6.6 6.4 5.7 9.1 9.3 8.2 9.5 8.7 8.1 8.0 B HMM(S)-MUBNF 7.2 6.7 6.3 9.7 9.2 8.3 10.4 9.2 8.5 8.4HMM(P)-MUBNF 7.1 6.6 6.2 9.4 9.1 7.8 9.9 8.8 8.2 8.1HMM(P)-LI-BNF1 6.8 6.3 5.8 9.1 8.7 7.8 9.1 8.5 7.6 7.7HMM(P)-LI-BNF2 6.6 6.4 5.7 9.2 8.8 8.1 9.2 8.6 7.9 7.8 C TLBNF 7.2 6.8 6.1 9.6 9.0 8.0 8.7 7.6 6.8 7.8TLBNF+LI-BNF1 7.0 6.6 6.0 9.3 8.8 7.9 8.6 7.5 6.7 7.6TLBNF+LI-BNF2 7.1 6.6 6.0 9.4 8.9 7.8 8.7 7.5 6.8 7.6TLBNF+HMM(P)-LI-BNF1 7.0 6.6 6.0 9.4 8.8 7.8 8.6 7.5 6.7 7.6TLBNF+MUBNF+OSBNF1 6.8 6.4 5.8 9.0 8.8 7.8 8.5 7.7 6.8 7.5TLBNF+HMM(P)-MUBNF+OSBNF1 6.8 6.4 5.7 8.8 8.7 7.5 8.4 7.5 6.8 7.4TLBNF+HMM(P)-MUBNF+OSBNF2 6.7 6.4 5.8 9.0 8.8 7.5 8.3 7.5 6.8 7.4Heck et al. [32] (1-st in ZRSC17) 6.9 6.2 6.0 9.7 8.7 8.4 8.8 7.9 7.8 7.8Chorowski et al. [37] 5.8 5.7 5.8 7.1 7.0 6.9 7.4 7.2 7.1 6.7Supervised topline [14] 6.5 5.3 5.1 8.0 6.8 6.8 9.5 4.2 4.0 6.2 rates by absolute . / . as compared to MUBNF. When the senone labelsof Czech, Hungarian and Russian are added, the resulted LI-BNF2 represen-tation shows a further improvement of absolute . under the across-speakercondition. This shows that out-of-domain acoustic-phonetic knowledge pro-vides complementary information to the in-domain clustering labels for featurelearning. The performance gain of OSBNF2 over OSBNF1, as well as that ofLI-BNF2 over LI-BNF1, confirm the benefit of exploiting a wider coverage oflanguage resources.The performance of OSBNF2 is inferior to OSBNF1 on Mandarin test set, butnot on English and French. It is noted that OSBNF1 is learned by using theCantonese ASR senone labels while OSBNF2 is learned by involving Cantoneseand the other three European languages. Cantonese, being a Chinese dialect,is apparently closer to Mandarin than Czech, Hungarian and Russian in termsof acoustic-phonetic properties. The experimental results imply that the framelabels generated by involving highly-mismatched out-of-domain languages maybe of low quality and not suitable for feature learning.

74. Frame labeling in unsupervised subword modeling (2). As discussed in Sections 5.1.2 and 3.2, DPGMM-HMM labels are obtained bymodeling temporal dependency of speech and DPGMM labels are determinedwith the assumption that neighboring speech frames are independent. Com-paring the corresponding systems in groups A and B in Tables 5.2 and 5.3,it is noted that DPGMM-HMM labels perform slightly better than DPGMMlabels. The ABX error rates attained with HMM(P)-MUBNF, HMM(P)-LI-BNF1 and HMM(P)-LI-BNF2 are about absolute . - . lower than thosewith MUBNF, LI-BNF1 and LI-BNF2 respectively, except for HMM(P)-LI-BNF2 under the across-speaker condition. This demonstrates that capturingtemporal dependency in speech is beneficial to feature learning for subwordmodeling, as was found in [46]. It is also noted that phone-level HMM align-ments are better than state-level ones.(3). Combining different types of BNF feature representations leads to further im-provement of performance. Specifically, by concatenating HMM(P)-MUBNF,OSBNF1 and TLBNF, the best ABX error rates under both within-speakerand across-speaker conditions are achieved ( . and . ). It is found thatBNFs learned from in-domain unsupervised data (HMM(P)-MUBNF, OS-BNF1) and learned via transfer learning (TLBNF) can be jointly used tocompose an optimal feature representation that is better than any individualBNF.The best performance attained in this study is competitive to the best sub-mitted system for the ZeroSpeech 2017 challenge, which is based on the com-bination of multiple DPGMM posteriorgrams [32]. These posteriograms weregenerated with unsupervisedly estimated fMLLRs based on different imple-mentation parameters. The combination of posteriorgrams led to . and . relative error rate reduction under the within-speaker and across-speakerconditions, compared to the use of single posteriorgram representation. In ourwork, concatenating the three aforementioned BNF representations results in . and . relative error rate reduction, as compared with the best systemwith single BNF. It must be noted that no out-of-domain transcribed speechwas involved in the system of [32].In a very recent work [37], vector quantized VAE (VQ-VAE) was applied to

75. Frame labeling in unsupervised subword modeling develop a system of unsupervised subword modeling. The reported average ABXerror rate was . for within-speaker condition, which is the best among all reportedsystems so far. For the across-speaker condition, our proposed systems with combinedBNF features have slightly better performance than VQ-VAE ( . ). Our systemsare found to be more effective on long utterances than VQ-VAE. In Tables 5.2 and5.3, it is noted that the performance of VQ-VAE does not depend on utteranceduration. For English and Mandarin, the ABX error rates are almost exactly thesame between the cases of s and s. One possible reason is that the VQ-VAEsystem does not perform explicit utterance-level speaker normalization on inputfeatures. On the contrary, the BNF representations investigated in the study performsignificantly better on longer utterances ( s & s) than on s ones. It is also notedthat our systems are more effective for Mandarin in the across-speaker condition.This may be due to the use of Cantonese speech in feature learning. VQ-VAE maybe over-fitting to Mandarin due to small data size [37]. Effectiveness of label filtering

The effectiveness of the proposed label filtering algorithm is evaluated with theHMM(P)-MUBNF representation, which is trained exclusively based on DPGMM-HMM labels, without involving out-of-domain speech data. Algorithm 5.1 requiresone tunable parameter P , i.e., the percentage of frame labels to be retained. Theaverage ABX error rates attained with different values of P are plotted as in Figure5.5. P = 1 means that all labels are kept, which is the setting used to obtain theresults in Tables 5.2 and 5.3.Under both within-speaker and across-speaker conditions, the optimal valuesof P are in the range of . to . . That is, when on average about − ofthe frame labels are removed, the ABX error rates could be slightly reduced. Thisindicates that indeed a certain portion of the labels are not reliable. However, iftoo many labels are removed, e.g., more than , the system performance woulddegrade significantly, because some good labels are lost.The proposed label filtering method is very simple in that only the occurrencecounts of the labels are considered. Figure 5.5 shows that this criterion is appropriateto a certain extent. However, there may exist infrequent subword units that are

76. Frame labeling in unsupervised subword modeling A B X e rr o r r a t e s ( % ) Across-speaker A B X e rr o r r a t e s ( % ) Within-Speaker

Figure 5.5:

ABX error rates ( % ) on HMM(P)-MUBNF representation w.r.t percentageof labels to be retained. The performance is computed by averaging over all languages. meaningful and crucial in conveying linguistic content. In [111,112], it was suggestedto reduce the number of DPGMM clusters without ignoring any frame labels. Sincethese studies were carried out on a different database, direct comparison of systemperformance can not be made. Comparison of DNN structures

In this part, layer-wise structures of MTL-DNN other than MLP are used. Ta-bles 5.4 and 5.5 compare the system performances obtained by using MLP, LSTMand BLSTM as DNN structures. The feature representations being investigated in-clude MUBNF, HMM(P)-MUBNF and HMM(P)-LI-BNF1, and label filtering is notapplied.It is noted that LSTM and BLSTM do not perform as well as MLP on allthree types of BNF representations. Experiments were carried out with differentparameter settings on LSTM and BLSTM, and the system performance remained

77. Frame labeling in unsupervised subword modeling

Table 5.4:

Across-speaker ABX error rate ( % ) comparison of DNN structures. English French Mandarin Avg.1s 10s 120s 1s 10s 120s 1s 10s 120sMUBNF MLP 10.9 9.5 8.9 15.2 13.0 12.0 10.5 8.9 8.2 . LSTM 10.4 9.6 9.0 14.6 13.3 12.3 10.9 9.3 8.6 10.9BLSTM 10.4 9.6 9.0 14.7 13.3 12.1 10.7 9.3 8.6 10.9HMM(P)-MUBNF MLP 10.4 9.2 8.7 14.5 12.7 11.7 10.4 8.9 8.2 . LSTM 10.0 9.3 8.6 14.3 13.1 11.8 10.7 9.3 8.6 10.6BSLTM 10.1 9.4 8.9 14.2 13.0 11.9 10.8 9.4 8.7 10.7HMM(P)-LI-BNF1 MLP 9.7 8.7 8.0 13.7 12.3 11.1 9.7 8.4 7.6 . LSTM 9.6 9.1 8.1 14.1 13.3 11.6 10.2 9.1 8.0 10.3BLSTM 9.5 9.0 8.2 13.7 13.0 11.6 9.7 8.7 7.8 10.1

Table 5.5:

Within-speaker ABX error rate ( % ) comparison of DNN structures. English French Mandarin Avg.1s 10s 120s 1s 10s 120s 1s 10s 120sMUBNF MLP 7.4 6.9 6.3 9.6 9.0 8.1 9.8 8.8 8.1 . LSTM 7.4 7.1 6.8 10.0 9.5 8.7 10.4 9.5 8.7 8.7BLSTM 7.4 7.1 6.7 9.9 9.5 8.9 10.4 9.4 8.7 8.7HMM(P)-MUBNF MLP 7.1 6.6 6.2 9.4 9.1 7.8 9.9 8.8 8.2 . LSTM 7.2 6.8 6.4 9.9 9.4 8.7 10.4 9.5 8.8 8.6BSLTM 7.3 6.9 6.5 9.6 9.5 8.4 10.5 9.4 9.0 8.6HMM(P)-LI-BNF1 MLP 6.8 6.3 5.8 9.1 8.7 7.8 9.1 8.5 7.6 . LSTM 6.7 6.6 5.9 9.5 9.4 8.2 9.6 8.9 7.9 8.1BLSTM 7.0 6.6 6.1 9.3 9.2 8.2 9.4 8.7 8.0 8.1

EN FR MA68101214 A c r o ss - s p ea k e r A B X e rr o r r a t e ( % ) EN FR MA68101214 W it h i n - s p ea k e r A B X e rr o r r a t e ( % ) MLPLSTMBLSTM

Figure 5.6:

Average ABX error rates ( % ) of HMM(P)-MUBNF representation imple-mented by MLP, LSTM and BLSTM over different utterance lengths. largely unchanged. Figure 5.6 gives the performances of HMM(P)-MUBNF learnedby MLP, LSTM and BLSTM for each target language. For English (EN), differentDNN structures have similar performance. For French (FR) and Mandarin (MA),the advantage of MLP over (B)LSTM is more prominent. This may be related to

78. Frame labeling in unsupervised subword modeling that the amount of training data for English is significantly greater than those forFrench and Mandarin. The advantage of LSTM and BLSTM over MLP in conven-tional supervised acoustic modeling has been widely recognized and attributed tothe capability of capturing temporal characteristics of speech. With limited trainingdata, the benefits of recurrent structures can not be fully exploited. In our systems,contextual information is incorporated via the use of DPGMM-HMM labels and itseffectiveness has been demonstrated by the experimental results.

This chapter addressed the problem of frame labeling in unsupervised subwordmodeling. Various approaches are proposed and evaluated on the feature represen-tation learning task of ZeroSpeech 2017. DPGMM-HMM acoustic modeling is ap-plied to capture contextual information in the speech signal, and generate timealignment as the desired frame labels. A label filtering algorithm is proposed todiscard unreliable initial labels from DPGMM clustering, so as to benefit DPGMM-HMM frame labeling. Multiple out-of-domain ASRs are utilized to produce language-mismatched phonetically-informed labels as complementary information to the in-domain DPGMM-HMM labels.The proposed approaches are evaluated by thorough experimental studies. Theresults demonstrate the advantage of DPGMM-HMM frame labels compared toDPGMM labels. The label filtering algorithm is effective in further improving DPGMM-HMM labels. The system trained with both out-of-domain ASR based labels andDPGMM-HMM labels achieves better performance than that trained with eithertype of labels only. Combining different types of BNFs by vector concatenationleads to further performance improvement. The best performance achieved by ourproposed approaches is . in terms of across-speaker ABX error rate. It is equal tothe performance of the best submitted system in ZeroSpeech 2017 and better thanother recently reported systems.Our proposed approaches are expected to be effective for any combination oflanguages other than those in ZeroSpeech 2017. Nevertheless, our investigation hassuggested that the closeness between target languages and out-of-domain languages

79. Frame labeling in unsupervised subword modeling and the amount of available training data for individual target languages might havesignificant impact on the goodness of learned features. hapter 6 Unsupervised unit discovery

This chapter is focused on unsupervised unit discovery, automatically discover-ing basic subword units of a language without any transcribed data. In this study,the problem is tackled with the acoustic segment modeling (ASM) approach [53].As explained in Section 2.2.1, the ASM comprises the sequential steps of initial seg-mentation, segment labeling and iterative subword modeling. We focus on initialsegmentation and segment labeling, and propose to exploit out-of-domain language-mismatched speech resources in these two steps.In initial segmentation, one or multiple language-mismatched phone recognizersare used to decode target speech into phone sequences with time alignment. Thephone boundaries are regarded as segmentation results. In segment labeling, phoneposteriorgrams derived from language-mismatched phone recognizers are used asinput features to spectral clustering [25] and generate segment-level cluster labels.Language-mismatched phone recognizers provide fine-grained speech representationsfor out-of-domain languages. They potentially benefits unit discovery for in-domainzero-resource languages.Linguistic relevance of discovered units is one of the major concerns in un-supervised unit discovery. In the literature, the performance of unsupervised unitdiscovery is usually evaluated by clustering-based metrics, e.g. purity. These metricsare appropriate for straightforward comparison of the overall efficacy, but not ableto provide detailed insights on the fitness of individual clusters and their relation. Inour concerned problem, these metrics do not reflect linguistic relevance of discovered

81. Unsupervised unit discovery ...............c i c j c k Phone recognizer 1Phone recognizer 2Phone recognizer N Averaging frame-level posterior within a segment Spectral clusteringSegment boundaryFrame-level phone posterior Labels of all segments

Zero-resource speech data

Segment-level phone posterior

Figure 6.1:

Exploiting language-mismatched phone recognizers in unsupervised unitdiscovery. acoustic units. To address this problem, a Kullback-Leibler (KL) divergence-baseddistance metric is defined to analyze linguistic relevance of acoustic units discoveredfrom an unknown language.This chapter is organized as follows. Section 6.1 discusses our approaches toexploiting language-mismatched phone recognizers in unsupervised unit discovery.Section 6.2 introduces the analysis of linguistic relevance of discovered subword unitsby our proposed KL divergence based method. Experiments are presented in Section6.1. Section 6.4 draws the summary of this chapter.

Figure 6.1 illustrates the proposed approach to unsupervised unit discovery.Given untranscribed speech utterances of a target language, one or more phonerecognizers pre-trained for other languages are utilized to generate phone boundaries.

82. Unsupervised unit discovery

Figure 6.2:

Example of segmentation of an English utterance using Czech, Hungarianand Russian phone recognizers.

Single phone recognizer

Phone recognition is the process of decoding a sequence of speech observations.The decoding result comprises a sequence of phone symbols and their time bound-aries. With a single phone recognizer, the phone boundaries could be used directlyas an initial segmentation of the input utterance. Only the time boundaries are usedwhile the phone identities are ignored.

Multiple phone recognizers

A single phone recognizer may not be able to provide a good coverage of thephonetic space to support the modeling of a new language, especially when thereis a significant mismatch between the two languages concerned. It would be ad-vantageous to make use of multiple phone recognizers from different languages toderive a better informed initial segmentation. Figure 6.2 compares the segmentationresults obtained by applying three language-mismatched phone recognizers to anEnglish utterance. The upper three panes show phone-level time alignment given bythe Czech, Hungarian and Russian phone recognizers, respectively. The fourth paneshows the manual segmentation of the speech waveform with English phone labels.The following method is developed to infer a unified speech segmentation frommultiple decoding results. Given N phone recognizers, let S j denote the segmentboundaries produced by the j -th recognizer, S j = { s j , s j , . . . , s jK } , j = 1 , , . . . , N, (6.1)

83. Unsupervised unit discovery where s ji is the location of the i -th segment boundary, i = 1 , , . . . , K j ). The algo-rithm processes { S , . . . , S N } by merging segment boundaries and eliminating theboundaries of segments shorter than a pre-defined threshold of ms. The output ofAlgorithm 6.1 is denoted as S ∗ . Algorithm 6.1:

Fusion of segmentation results

Input:

Segment boundaries resulted from N phone recognizers S , S , . . . , S N . Output:

Fusion of segment boundaries S ∗ . Concatenate all boundaries into S long = { s , s , . . . , s K , s , s , . . . , s K , . . . , s N , s N , . . . , s NK N } . (6.2) Sort S long in ascending order, denoted as S sort = { s , s , . . . , s K } , K = N ∑ j =1 K j . (6.3) Drop coincided elements in S sort . Denote the output by S sort _ uni = { su , su , . . . , su K _ u } . (6.4) Eliminate segments represented by S sort _ uni with duration less than ms. if su m − su m − = 10 ms then drop su m ; else if su m − su m − = 20 ms then su m − ← su m + su m − ; drop su m ; else Do nothing; end if return S ∗ ; After initial segmentation, each training utterance is divided into a numberof variable-length segments. The speech segments from all training utterances areorganized into a limited number of clusters, based on their acoustic similarities.Segment labels are generated according to the clustering results.

84. Unsupervised unit discovery

Feature representation

Previous studies showed that posterior features are more robust than conven-tional spectral features like MFCCs [120]. In this study, segment clustering is per-formed with phone posteriorgram representation. Let us first consider the case thatone phone recognizer is available. Let C = { c , c , . . . , c M } denote the M phonescovered by the recognizer. The posterior feature vector representing frame t is givenas, q t = p ( c | o t ) p ( c | o t ) ... p ( c M | o t ) (6.5)where p ( c m | o t ) , m = 1 , , . . . , M , denotes the posterior probability of phone c m given the observation o t . If N language-mismatched phone recognizers are used fordecoding, there would be N phone posterior feature vectors q t , q t , . . . , q Nt , for eachtime frame, they are concentrated to form a single feature vector as, ˆ q t = q t q t ... q Nt . (6.6)The dimension of ˆ q t is, ˆ M = N ∑ i =1 M i , (6.7)where M i is the phone number of the i -th recognizer. For a given initial segmentation,the segment-level posterior probabilities are obtained as, ˆ x k = 1 e k − b k + 1 e k ∑ t = b k ˆ q t , k = 1 , , . . . , K, (6.8)where K is the number of segments, b k and e k are the starting and ending timeframes of the k -th segment. The segment-level phone posteriorgram can be expressedin matrix form as, X = [ ˆ x ˆ x · · · ˆ x K ] . (6.9)

85. Unsupervised unit discovery

Algorithm 6.2:

Segment clustering

Input: ˆ M × K phone posteriorgram X , cluster number R . Output: R clusters and cluster label of each segment. Compute A = XX T Compute L = I − D − AD − , where D = diag { A · [1 , , · · · , } Construct matrix Y = [ y , · · · , y R ] , where y r is the r -th smallest eigenvector of L . Normalize each row vector of Y to have unit l -norm. Apply k -means on the ˆ M rows of Y to find R clusters. Assign the i -th phone to cluster r if the i -th row vector of Y was assigned tocluster r , ( i = 1 , , . . . , ˆ M , r = 1 , , . . . , R ). Score each segment with the R clusters, label it with the cluster which scoreshighest. Clustering algorithm

The approach of spectral clustering is applied to segment-level phone posteri-orgrams. In [25], spectral clustering was applied to Gaussian components in GMM.Here the clustering is performed on the language-mismatched phone classes. In theposteriorgram representation X , each row contains the posterior probabilities of aspecific phone. Then the problem is to cluster the ˆ M rows of X . Details of thespectral clustering algorithm are provided as in Algorithm 6.2.After clustering, the speech segments are labeled with respective cluster indices.Each cluster is regarded as a discovered subword unit. The segment labels give akind of time-aligned pseudo transcriptions that can be used to facilitate supervisedacoustic modeling of the target language. In the ideal case, the pseudo transcriptionsare in consistency with the ground-truth transcriptions of the target language. Existing studies usually measure the efficacy of unsupervised unit discovery byclustering-based evaluation metrics, such as purity [25], normalized mutual infor-mation (NMI) [25] and average precision (AP) [69]. One drawback is that they donot reflect detailed insights on the fitness of the individual clusters and the relation

86. Unsupervised unit discovery between the clusters. Let us consider the clustering results produced by two differentclustering algorithms (or the same algorithm with different parameter settings). Inthe first case, the degree of overlap between an automatically learned cluster and itsclosest ground-truth phone varies greatly from one cluster to another, whereas in thesecond case, the degrees of overlap are equal across all clusters. Although the twosets of clustering results may give the same purity value, their linguistic implicationscould be very different.

In this study, a new distance metric is proposed to analyze in detail the linguisticrelevance of discovered subword units. This metric is based on KL divergence.

KL divergence

KL divergence, also called information divergence, or relative entropy, is a mea-sure of the difference between two probability distributions [121]. KL divergencebetween two discrete probability distributions P and Q is defined as, D KL ( P || Q ) = ∑ i P ( i ) log P ( i ) Q ( i ) . (6.10)Equation (6.10) gives a non-symmetric measure. In this study, the symmetric formof KL divergence is adopted as, D KL ( P, Q ) = D KL ( P || Q ) + D KL ( Q || P ) (6.11) = ∑ i ( P ( i ) − Q ( i )) · log P ( i ) Q ( i ) . (6.12)KL divergence can be used to model implicit speech variation related to phoneticcontext, pronunciation variation, speaker characteristics, etc. It has been applied toASR acoustic modeling [122, 123], data selection [124], cross-lingual TTS and voiceconversion [125, 126].

87. Unsupervised unit discovery

Distance between discovered unit and ground-truth phone

The symmetric KL divergence is used to measure the distance between pos-terior probability distributions between each pair of discovered subword unit andground-truth phone of the target language. Let { g , g , . . . , g K } denote K ground-truth phones, { v , v , . . . , v L k } denote posterior probability vectors of L k frames thatare labeled as phone g k according to ground-truth transcription. { v , v , . . . , v L k } is a subset of in total T speech frames { ˆ q , ˆ q , . . . , ˆ q T } (refer to Eqt. (6.6)). Thecentroid of g k is computed as, v k = L k ∑ i =1 v i L k . (6.13) v k is treated as the representative of g k in the phonetic space. Let { u , u , . . . , u R } denote R discovered subword units, { µ , µ , . . . , µ N r } ⊂ { ˆ q , ˆ q , . . . , ˆ q T } are theposterior probability vectors of frames assigned to u r ( r = 1 , , . . . , R ). The distancebetween u r and g k is defined as, D ( u r , g k ) = N r ∑ j =1 D KL ( µ j , v k ) N r (6.14) = N r ∑ j =1 M ∑ m =1 ( µ j ( m ) − v k ( m )) · log µ j ( m ) v k ( m ) N r . (6.15)Here, the distortion between two acoustic models u r and g k is measured by theKL divergence-based distance between posterior features in the phonetic space. Anillustration on the computation process of D ( u r , g k ) is shown in Figure 6.3. Closest ground-truth phones

Let g k ∗ ( u r ) denote the closest ground-truth phone of subword unit u r , where, k ∗ = arg min k D ( u r , g k ) . (6.16)For simplicity, g k ∗ ( u r ) is abbreviated as g ∗ ( u r ) . The distance between u r and g ∗ ( u r ) is denoted as D ∗ ( u r ) , and is computed as,

88. Unsupervised unit discovery

CentroidPosterior vectorDiscovered unit Ground-truth phoneme KL Divergence k v    ( , ) kKL D v  ( , ) kKL D v  ( , ) kKL D v  ( , ) mean( ( , ), ( , ),..., ( , )) r k k kr k KL KL KL N D u g D v D v D v    r u k g ... r N  ... Figure 6.3:

Illustration on the computation process of D ( u r , g k ) . D ∗ ( u r ) = D ( u r , g ∗ ( u r )) . (6.17)Let g k ∗∗ ( u r ) denote the second closest ground-truth phone of subword unit u r , where, k ∗∗ = arg min k ̸ = k ∗ D ( u r , g k ) . (6.18)The distance between u r and g ∗∗ ( u r ) is denoted as D ∗∗ ( u r ) , and is computed as, D ∗∗ ( u r ) = D ( u r , g ∗∗ ( u r )) . (6.19)The discriminability of the cluster u r can be measured by ∆ D ∗ ( u r ) = | D ∗∗ ( u r ) − D ∗ ( u r ) | . (6.20)A small value of D ∗ ( u r ) means that the discovered subword unit matches well withone of the ground-truth phones. Meanwhile, a large value of ∆ D ∗ ( u r ) indicates that g ∗ ( u r ) is discriminatively mapped to u r . The average distance D ∗ ( u r ) , D ∗∗ ( u r ) and ∆ D ∗ ( u r ) over all discovered subword units { u r } in a target language are defined as, D ∗ ( u r ) = ∑ Rr =1 D ∗ ( u r ) R , (6.21) D ∗∗ ( u r ) = ∑ Rr =1 D ∗∗ ( u r ) R . (6.22) ∆ D ∗ ( u r ) = (cid:12)(cid:12)(cid:12) D ∗∗ ( u r ) − D ∗ ( u r ) (cid:12)(cid:12)(cid:12) . (6.23)

89. Unsupervised unit discovery

The distance measure can be further extended to evaluating inherent variability ofeach ground-truth phones. For the phone g k , the inherent variability e D ( g k ) and theaverage inherent variability over all { g k } are defined as, e D ( g k ) = L k ∑ j =1 D KL ( v j , v k ) L k , (6.24) e D ( g k ) = ∑ Kk =1 e D ( g k ) K . (6.25)A small value of e D ( g k ) indicates that the acoustic-phonetic properties of g k arehighly consistent in the training speech. It must be noted that D ∗ ( u r ) computed fordiscovered subword units and e D ( g k ∗ ) for ground-truth phones are comparable, asboth of them measure the deviation from a class of posterior feature vectors to thecentroid of the same phone class g k ∗ , in the same phonetic space. e D ( g k ∗ ) is calculatedfrom ground-truth transcription, and independent of the clustering results. Therefore e D ( g k ∗ ) could be a good reference for D ∗ ( u r ) .The KL divergence metric in this paper is not only applicable to posteriorfeatures extracted from phone recognizers, but also to conventional spectral featureslike MFCCs, or DNN-based representations such as BNFs. Experiments on unsupervised unit discovery are carried out with the OGI Multi-language Telephone Speech Corpus (OGI-MTS) [127]. The spontaneous story-tellingpart of this corpus is used. Five languages are involved: German (GE), Hindi (HI),Japanese (JA), Mandarin (MA) and Spanish (SP). In addition to audio signals, thedatabase provides manual time alignment at phone level for each utterance. Table6.1 summarizes the amount of audio data and the number of phone units (includinga silence unit) in each language.The OGI-MTS database is chosen for several reasons. The present study is fo-cused on the methodology design for unsupervised unit discovery of an unknown

90. Unsupervised unit discovery

Table 6.1:

Multilingual speech data from the OGI-MTS corpus.

Language GE HI JA MA SPDuration (hours) .

31 0 .

95 0 .

86 0 .

57 1 . No. phone units

43 46 29 44 38 language. The exact identities of the target languages do not matter. The key as-sumption is that no labeled data is available. On the other hand, ground-truthlinguistic knowledge and reliably transcribed and aligned test data are necessaryfor performance evaluation purpose. Such resources are generally not available forreal zero-resource languages. In terms of speaking style, the story-telling speech inOGI-MTS is considered a good match with real-world data that one might be ableto collect for a zero-resource language. Furthermore, the use of multiple languagesleads to a wider coverage of phonetic variation and makes the experimental studyrepresentative and convincing.The study in [25] reports experiments on unsupervised unit discovery with thesame database. Our results could be compared with [25] so as to better understandthe effectiveness of using language-mismatched phone recognizers.

Evaluation metric

Purity is a commonly used evaluation metric that measures the degree towhich the cluster results of a clustering process are in accordance with ground-truth classification labels. Let G = { G , G , . . . , G R } denote a set of R clusters, G ′ = { G ′ , G ′ , . . . , G ′ R ′ } a set of R ′ ground-truth phones. Let n r,r ′ be the number offrames assigned to the r -th cluster and labeled as the r ′ -th phone in the referencetranscription. The purity value associated with the r -th cluster is defined as, purity(r) = max r ′ ∈{ , ,...,R ′ } n r,r ′ ∑ R ′ r ′ =1 n r,r ′ . (6.26)

91. Unsupervised unit discovery

Table 6.2:

Purity values obtained by exploiting a single recognizer.

Phone recognizer adopted AverageCZ HU RUGE .

409 0 .

378 0 .

408 0 . HI .

443 0 .

408 0 .

408 0 . JA .

502 0 .

500 0 . . MA .

381 0 .

412 0 .

345 0 . SP .

500 0 .

467 0 .

488 0 . Average .

432 0 .

416 0 .

416 0 . High purity values are desirable since they indicate a large proportion of the clusterare from the same phone. By averaging purity values of R clusters, the overall purityis computed as, purity = ∑ Rr =1 max r ′ ∈{ , ,...,R ′ } n r,r ′ ∑ Rr =1 ∑ R ′ r ′ =1 n r,r ′ . (6.27)In the phone recognizer, silence part of speech is modeled as a phone. As silencesegments occur frequently and can be accurately detected, the computed purityvalues tend to be biased. In this study, the silent labels were removed according tothe reference transcriptions at the beginning of the experiments. Language-mismatched phone recognizers

Four phone recognizers for Czech (CZ), Hungarian (HU), Russian (RU) andCantonese (CA) are used as language-mismatched phone recognizers. The CZ, HUand RU recognizers [117] were described as in Section 5.3.1. The numbers of modeledphones are , and respectively. The CA recognizer is trained in similar settingsas in Section 4.5.1. The total number of modeling units in CA is . Results with single recognizer

The purity values of clustering results obtained with one of the CZ, HU andRU recognizers are given as in Table 6.2. In this part of experiments, the clusternumber R is set to be equal to the number of ground-truth phones in the respectiverecognizer. The following observations are made:1. For the same target language, the purity values achieved do not differ much

92. Unsupervised unit discovery

Table 6.3:

Purity values obtained by jointly exploiting CZ, HU and RU recognizersw.r.t cluster number R . Colored values in the last two columns denote better (red) orworse (blue) performance compared to the proposed approach. R Reference

50 60 70 80

Wang et al. [25] Single recognizerGE .

418 0 . . .

427 0 .

403 0 . HI .

439 0 .

406 0 . . .

457 0 . JA .

508 0 . . .

510 0 .

520 0 . MA .

384 0 .

386 0 . . .

367 0 . SP .

499 0 .

499 0 . . .

549 0 . Average .

435 0 .

430 0 . . .

443 0 . across different phone recognizers. The purity values depend on the target lan-guage. The relative magnitudes of purity of the six target languages are consis-tent across different recognizers. For examples, the purity values for Germanare in the range of . − . , and those for Japanese are . − . ;2. The purity values attained for Japanese were the highest consistently . Theo-retically, the number of ground-truth phones in the language affects the puritywith a negative correlation. Japanese has only 29 phones, significantly less thanthe other five languages. Spanish achieves the second highest purity with thesecond least phone number of 38. Results with multiple recognizers

The purity values obtained by the joint use of the CZ, HU and RU recognizersare given as in Table 6.3. In this part, initial segmentation results are generated byapplying Algorithm 6.1 to the three recognizers’ decoding results. Different clusteringnumbers R (from to ) are attempted. Experimental results on the same taskin [25] are provided for reference. The last column of the table contains the averagepurity value achieved by using a single recognizer (i.e., the last column in Table 6.2).From Table 6.3, the following observations can be made:1. The best purity value attained with multiple recognizers is higher than thatwith any single recognizer alone. Among the six target languages, HI showsthe most significant improvement and MA benefits the least;2. The cluster number R has little influence on the result. For best performance,

93. Unsupervised unit discovery

Table 6.4:

Purity values obtained by jointly exploiting CZ, HU, RU and CA recognizersw.r.t cluster number R . This system is used for analysis of linguistic relevance. R Average

50 60 70 80 90

GE 0.428 0.433 0.439 0.437 0.437 0.435HI 0.494 0.499 0.492 0.489 0.485 0.492JA 0.545 0.556 0.554 0.543 0.540 0.548MA 0.414 0.425 0.434 0.426 0.426 0.425SP 0.556 0.586 0.576 0.568 0.573 0.572 R in the range of 70 to 80. In practical applications, the cluster number caneither be pre-determined or empirically tuned on development data;3. With the same database and the same evaluation metric, the proposed useof multiple phone recognizers gives comparable performance to that reportedin [25]. For GE and MA, our method achieves better results. A multi-view seg-ment clustering (MSC) algorithm was used in [25], to exploit multiple featurerepresentations simultaneously. Our method is considered to have a simplerimplementation. Unit discovery system establishment

In this section, the relation between automatically discovered subword units andground-truth phones is analyzed. A unit discovery system is established beforehandto provide discovered subword units for analysis. This system is developed basedmainly on settings described in Section 6.3.2, with only a few exceptions: (1). CAphone recognizer is used in addition to CZ, HU and RU; (2). cluster number R rangesfrom to .Table 6.4 summarizes purity values of the unit discovery system. Compared withTable 6.3, it is found that the purity values obtained by exploiting four language-mismatched phone recognizers are higher than those by exploiting three recognizersin all of the test languages.

94. Unsupervised unit discovery

Table 6.5: D ∗ ( u r ) /D ∗∗ ( u r ) and e D ( g k ) computed from the discovered subword units. D ∗ ( u r ) /D ∗∗ ( u r ) e D ( g k ) R = 50 R = 60 R = 70 R = 80 R = 90 GE 27.4/30.1 27.3/30.0 27.3/30.1 27.5/30.3 27.4/30.3 27.8HI 27.6/31.5 27.6/31.6 27.8/31.7 28.0/31.7 28.1/31.8 28.5JA 28.5/34.1 28.0/33.2 28.1/33.2 28.3/33.4 28.6/34.1 27.9MA 29.3/31.2 29.3/31.5 29.0/31.0 29.5/31.6 29.4/31.3 29.1SP 28.5/31.9 27.8/32.2 27.9/32.3 28.1/32.5 28.0/32.4 28.0

GE HI JA MA SP0123456 ∆ D ∗ ( u r ) GE HI JA MA SP00.10.20.30.40.50.6 P u r it y Figure 6.4: ∆ D ∗ ( u r ) and purity values computed from the discovered subword units. Results and analysis

For each discovered subword unit u r , Equations (6.14), (6.16) and (6.17) areused to determine its closest ground-truth phone g ∗ ( u r ) and the distance D ∗ ( u r ) between them. Equation (6.18) is used to determine the second closest ground-truthphone g ∗∗ ( u r ) and the distance D ∗∗ ( u r ) . For each ground-truth phone g k , Equation(6.24) is used to calculate the inherent variability e D ( g k ) . Variables D ∗ ( u r ) , D ∗∗ ( u r ) and e D ( g k ) , computed by Equations (6.21), (6.22) and (6.25), are summarized inTable 6.5. It is observed that the average KL divergence is not sensitive to thecluster number R .Figure 6.4 compares ∆ D ∗ ( u r ) (computed by Equation (6.23)) and average pu-rity values for the five target languages. From Table 6.5 and Figure 6.4, the followingobservations are made:1. D ∗ ( u r ) is smaller than or approximately equal to e D ( g k ) for all the target

95. Unsupervised unit discovery

Table 6.6:

Uncovered/total number of vowels and consonants ( R = 90 ). GE HI JA MA SPVowels /

18 0 /

13 0 / /

17 1 / Consonants /

24 3 /

32 0 /

21 4 /

26 1 / languages. In other words, the deviation between a discovered subword unitand its closest ground-truth phone is comparable to, if not smaller than, theinherent variability of the phone itself. In fact, the ground-truth phones arelabeled based on auditory perception of linguistic experts, whereas the auto-matically discovered units, as well as the proposed KL divergence metric, istotally data-driven.2. ∆ D ∗ ( u r ) for different languages have the same trend as the purity values. Thisobservation is consistent with our expectation, as a larger ∆ D ∗ ( u r ) impliesthat the discovered subword unit is mapped to its closest ground-truth phonewith a higher confidence, thus naturally leads to a higher purity.We are interested to understand more about the phonetic coverage of automati-cally discovered subword units. Each subword unit corresponds to one best-matchingground-truth phone based on Equation (6.16). If a ground-truth phone fails to beselected as the best-matching phone for any of the discovered subword units, it isconsidered as not being covered . On the contrary, if a ground-truth phone is se-lected as the best-matching phone for at least one discovered unit, it is consideredas being covered .Table 6.6 shows the counts of uncovered vowels and consonants for each targetlanguage for R = 90 . It can be seen that most of the linguistically-defined phonescould be covered in the process unsupervised unit discovery. Particularly in thecase of Japanese, all phones are covered. However, there are quite a few phonesof Mandarin that are not covered by the discovered units. The missing vowels andconsonants are listed in Table 6.7 . It is interesting to see that the majority ofmissing consonants are unvoiced plosives. These consonants have strong transitorycharacteristics, i.e., rapidly changing spectral properties. In the segmentation process In this table, each phone label is followed by its corresponding IPA transcription. For example, /aa/ is usedin OGI-MTS database [127], its IPA transcription is /a/.

96. Unsupervised unit discovery

Table 6.7:

Mandarin phones that are not covered by automatically discovered subwordunits ( R = 90 ). Vowels /aa/ (/a/), /er/ (/ ˜ /)Consonants /kh/ (/k h /), /ph/ (/p h /), /r/ (/ ı /), /tH/ (/t h /) Table 6.8:

Discovered subword units mapped to /uw/ with R = 50 and . Clusters R r

33 41 25 49 29 35 D ∗ ( u r ) /uw/ . . . . . . D ∗∗ ( u r ) /iy/ . — . . — —/ey/ — . — — . . of unsupervised acoustic modeling, it is assumed that individual frames in the samesegment have similar spectral properties. This assumption is not valid for transitoryphones. Similarly, the missing vowel /er/, known as Erhuayin [128] in Mandarin,also has transitory properties. This inspires us to investigate alternative features andsegment representation which could capture trajectory characteristics of phones.It is not expected that the vowel /aa/ is missed. Although /aa/ is not selectedas the closest phone to any of the subword units, it is actually identified as thesecond closest phone to two different discovered subword units. These two units arecorresponded to /ae/ (/a/) and /aw/ (/au/) according to the KL divergence. In thetranscription of Mandarin speech in OGI-MTS database, /aa/ is used to label thevowel nucleus in the Pinyin Finals /a/ and /ang/, while /ae/ is used to label thevowel nucleus in the Pinyin Final /an/ [129]. The two vowel nuclei are actually verysimilar in articulation. From this perspective, /aa/ is actually not a missing phone.It must be noted that the identities of uncovered ground-truth phones dependalso on experimental configurations, such as initialization of clustering, cluster num-ber, etc.For some of the discovered subword units, the value of D ∗∗ ( u r ) is nearly thesame as D ∗ ( u r ) . In other words, such a discovered subword unit matches equallywell with two different phones. This kind of confusion can be alleviated by increasing R . Table 6.8 gives an example of confusion between a few Japanese vowels. With R = 50 , /uw/ (/ W /) and /iy/ (/i/) are the closest and second closest phones to

97. Unsupervised unit discovery cluster . Similar observation can be made on /uw/ and /ey/ (/e/) to cluster .When R is increased to , the confusion is significantly alleviated. A larger R leadsto smaller-size as well as finer clusters, therefore the learned clusters containingsegments of multiple ground-truth phones tend to split and form linguistically moreexplicit subword units. This chapter presents our efforts to the task of unsupervised unit discovery.A new approach is proposed to assist unit discovery by exploiting out-of-domainlanguage-mismatched phone recognizers in initial segmentation and segment label-ing. While existing segmentation approaches rely on spectral discontinuities of theacoustic signal, our approach uses multiple phone recognizers to decode and segmentspeech. The recognizers provide posteriorgram representation for segment clusteringand labeling.Investigation on linguistic relevance of automatically discovered subword units isanother research focus of this chapter. A symmetric KL divergence metric is definedand used to measure the distance between each pair of subword unit and ground-truth phone.Experiments are carried out with the OGI Multi-language Telephone Speech(OGI-MTS) Corpus. Experimental results demonstrate that out-of-domain phonerecognizers are effective in segmentation and segment labeling. The results are insen-sitive to the cluster number, i.e. the number of discovered subword units. Increasingthe number of phone recognizers is beneficial to unit discovery performance. Ourbest performance in terms of purity is comparable to those reported in [25], whileour approach is relatively simple in implementation.Experimental results also show that our KL divergence-based evaluation metricis consistent with purity. The deviation between a discovered unit and its closestground-truth phone is comparable to the inherent variability of the phone. Whilein general the unit discovery results have a good coverage of linguistically-definedphones, there are a few exceptions, e.g. /er/ in Mandarin, probably due to limitedfeature representation capability of the adopted unit discovery system. The confusion

98. Unsupervised unit discovery of a discovered unit between ground-truth phones can be alleviated with a largecluster number. Further investigation is needed to apply alternative features andsegment representations to better capture trajectory characteristics of phones. hapter 7 Conclusion and Future Work

This research investigates unsupervised acoustic modeling for zero-resource lan-guages. It is assumed that that only untranscribed speech data is available, and lin-guistic knowledge about the target language is absent. This problem is essential inspoken language technology applications, particularly for many languages that havevery limited or no linguistic resources.There are two research problems being tackled in this study. The first problem isautomatic discovery of fundamental speech units (e.g. subword units) of a language.The second problem concerns the learning of frame-level feature representations thatare robust to linguistically-irrelevant variations. Various approaches are proposed toexploit out-of-domain language resources to improve the modeling of in-domain zero-resource languages.Towards the goal of unsupervised subword modeling, this thesis has made con-tributions in the following aspects.• Speaker adaptation approaches are applied and evaluated extensively on learn-ing speaker-invariant features in the unsupervised scenario. The proposedmethods include fMLLR estimation with out-of-domain ASR, disentangledspeech representation learning, and speaker adversarial training. Combina-tions of these approaches are also investigated.• New approaches to frame labeling to improve the efficacy of supervised model training. The approaches include DPGMM-HMM frame labeling and out-of-domain ASR decoding. A label filtering algorithm is proposed to improveDPGMM-HMM frame labeling by removing possibly erroneous labels. Multi-ple types of frame labels are jointly applied under a multi-task learning (MTL)framework to achieve informative learning of feature representation.Towards unsupervised unit discovery, this thesis has made contributions in thefollowing aspects.• Language-mismatched phone recognizers are exploited in the acoustic segmentmodeling (ASM) framework for unsupervised unit discovery. The recognizersare utilized to generate initial segmentation and perform segment labeling.• A symmetric KL divergence metric is proposed for the analysis of linguisticrelevance of discovered subword units.In Chapter 4, speaker adaptation approaches to learning speaker-invariant fea-tures are presented. The fMLLR approach exploits an out-of-domain ASR to esti-mate speaker-specific fMLLR transforms. Disentangled speech representation learn-ing trains an FHVAE to disentangle phonetic information and speaker variation.Speaker adversarial training adds an adversarial task into the MTL-DNN model,forcing the hidden representation to carry less speaker information. Experimentalresults on ZeroSpeech 2017 demonstrate the effectiveness of all the approaches. ThefMLLR approach achieves the most significant performance improvement comparedto the baseline, demonstrating the efficacy of out-of-domain ASR systems in speakeradapted feature learning. Reconstructed MFCC features by disentangled representa-tion learning are shown to capture less speaker-dependent information than originalones, and beneficial to improving DNN-BNF based subword modeling. The hiddenrepresentation of speaker adversarial training contains less speaker-dependent in-formation compare to that without adversarial training. Combining out-of-domainASR based adaptation and adversarial training contributes to further improvement,in which our best performance ( . / . ) is achieved. Adversarial training is notcomplementary to disentangled representation learning.In Chapter 5, frame labeling approaches to improving unsupervised subwordmodeling are studied. The advantage of DPGMM-HMM frame labeling over con-ventional DPGMM is that DPGMM-HMM could model contextual information in speech frames. The proposed label filtering algorithm discards unreliable labels fromDPGMM clustering. Frame labeling is also achieved by out-of-domain ASR decoding.Experimental results on ZeroSpeech 2017 demonstrate the advancement of DPGMM-HMM labels over DPGMM labels, and that label filtering could further improve theDPGMM-HMM approach. The system trained with both out-of-domain ASR basedlabels and DPGMM-HMM labels achieves better performance than that trained witheither type of labels only. Combining different types of BNFs by vector concatenationleads to further performance improvement. The best performance achieved by ourproposed approaches is . in terms of across-speaker ABX error rate. It is equal tothe performance of the best submitted system in ZeroSpeech 2017 and better thanother recently reported systems.In Chapter 6, the use of out-of-domain language-mismatched phone recognizersin unsupervised unit discovery is presented. Multiple language-mismatched phonerecognizers are used to decode and segment target speech utterances, and generatephone posteriorgram for segment clustering and labeling. Our proposed KL diver-gence based metric for analyzing linguistic relevance of discovered subword units isstudied. Experimental results on OGI-MTS corpus demonstrate the efficacy of out-of-domain language-mismatched phone recognizers in unit discovery. Increasing thenumber of recognizers is beneficial to performance improvement. Our best system interms of purity is comparable to the study in [25], while our approach is relativelysimpler in implementation. Experiments also show that KL divergence-based evalu-ation metric is consistent with purity. The deviation between a discovered unit andits closest ground-truth phone is comparable to the inherent variability of the phone.While in general the unit discovery results have a good coverage of linguistically-defined phones, there are a few exceptions, e.g. /er/ in Mandarin, probably due tolimited feature representation capacity in the current discovery system. The confu-sion between ground-truth phones can be alleviated with a large cluster number. We suggest the following directions for future work:

Pair-wise learning in unsupervised subword modeling

It has been shown by previous studies that pair-wise information is beneficial tolearning speaker-invariant and subword-discriminative feature representation. Therehas been no study on investigating and comparing pair-wise learning approach withother approaches in speaker-invariant feature learning. It is believed that our bestsystem could be further improved by incorporating pair-wise learning, by using acascade system, or adding a triplet loss to the cross-entropy loss function in ourMTL-DNN framework.

Unsupervised lexical modeling and ASR

The present study is focused on discovering and modeling basic subword unitsof zero-resource languages. Future work may move a step further to lexical mod-eling and language modeling, and to develop an integrated ASR system for zero-resource languages. Recently, there are studies on unsupervised lexical modeling [22],unsupervised ASR [12, 13, 22], and unsupervised large-vocabulary ASR [130]. TheBayesian model [130] and the generative adversarial network (GAN) [12] are twopossible frameworks. It is believed that our findings on unsupervised subword mod-eling and unit discovery serve as a basis to improving unsupervised lexical modelingand ASR. ibliography [1] H. Shibata, T. Kato, T. Shinozaki, and S. Watanabe, “Composite embeddingsystems for ZeroSpeech2017 Track 1,” in

Proc. ASRU , pp. 747–753, 2017.[2] L. R. Rabiner, “A tutorial on hidden Markov models and selected applicationsin speech recognition,”

Proceedings of the IEEE , vol. 77, no. 2, pp. 257–286,1989.[3] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior,V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kingsbury, “Deep neuralnetworks for acoustic modeling in speech recognition: the shared views of fourresearch groups,”

IEEE Signal Processing Magazine , vol. 29, no. 6, pp. 82–97,2012.[4] D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, and Y. Bengio, “End-to-end attention-based large vocabulary speech recognition,” in

Proc. ICASSP ,pp. 4945–4949, 2016.[5] A. Graves and N. Jaitly, “Towards end-to-end speech recognition with recur-rent neural networks,” in

Proc. ICML , pp. 1764–1772, 2014.[6] S. Watanabe, T. Hori, S. Kim, J. R. Hershey, and T. Hayashi, “HybridCTC/attention architecture for end-to-end speech recognition,”

J. Sel. TopicsSignal Processing , vol. 11, no. 8, pp. 1240–1253, 2017.[7] P. K. Austin and J. Sallabank,

The Cambridge handbook of endangered lan-guages . Cambridge University Press, 2011. [8] F. Wessel and H. Ney, “Unsupervised training of acoustic models for largevocabulary continuous speech recognition,”

IEEE Trans. SAP , vol. 13, no. 1,pp. 23–31, 2004.[9] K. Veselỳ, M. Hannemann, and L. Burget, “Semi-supervised training of deepneural networks,” in

Proc. ASRU , pp. 267–272, 2013.[10] F. Grezl and M. Karafiát, “Semi-supervised bootstrapping approach for neuralnetwork feature extractor training,” in

Proc. ASRU , pp. 470–475, 2013.[11] M. Versteegh, R. Thiollière, T. Schatz, X.-N. Cao, X. Anguera, A. Jansen,and E. Dupoux, “The zero resource speech challenge 2015.,” in

Proc. INTER-SPEECH , pp. 3169–3173, 2015.[12] D.-R. Liu, K.-Y. Chen, H.-Y. Lee, and L.-S. Lee, “Completely unsupervisedphoneme recognition by adversarially learning mapping relationships from au-dio embeddings,”

Proc. INTERSPEECH , pp. 3748–3752, 2018.[13] K.-Y. Chen, C.-P. Tsai, D.-R. Liu, H.-Y. Lee, and L.-S. Lee, “Completely unsu-pervised phoneme recognition by a generative adversarial network harmonizedwith iteratively refined hidden Markov models,” in

Proc. INTERSPEECH ,pp. 1856–1860, 2019.[14] E. Dunbar, X.-N. Cao, J. Benjumea, J. Karadayi, M. Bernard, L. Besacier,X. Anguera, and E. Dupoux, “The zero resource speech challenge 2017,” in

Proc. ASRU , pp. 323–330, 2017.[15] L.-S. Lee, J. Glass, H.-Y. Lee, and C.-A. Chan, “Spoken content retrieval — beyond cascading speech recognition with text retrieval,” IEEE/ACM Trans.ASLP , vol. 23, no. 9, pp. 1389–1420, 2015.[16] H. Wang, T. Lee, C.-C. Leung, B. Ma, and H. Li, “A graph-based Gaussiancomponent clustering approach to unsupervised acoustic modeling,” in

Proc.INTERSPEECH , pp. 875–879, 2014.[17] M.-H. Siu, H. Gish, A. Chan, and W. Belfield, “Improved topic classificationand keyword discovery using an HMM-based speech recognizer trained withoutsupervision,” in

Proc. INTERSPEECH , pp. 2838–2841, 2010. [18] M.-h. Siu, H. Gish, A. Chan, W. Belfield, and S. Lowe, “Unsupervised trainingof an HMM-based self-organizing unit recognizer with applications to topicclassification and keyword discovery,”

Computer Speech & Language , vol. 28,no. 1, pp. 210–223, 2014.[19] T. Glarner, P. Hanebrink, J. Ebbers, and R. Haeb-Umbach, “Full Bayesianhidden Markov model variational autoencoder for acoustic unit discovery,” in

Proc. INTERSPEECH , pp. 2688–2692, 2018.[20] L. Ondel, L. Burget, and J. Černockỳ, “Variational inference for acoustic unitdiscovery,”

Proc. SLTU , vol. 81, pp. 80–86, 2016.[21] C.-Y. Lee and J. Glass, “A nonparametric Bayesian approach to acoustic modeldiscovery,” in

Proc. ACL , pp. 40–49, 2012.[22] H. Kamper, A. Jansen, and S. Goldwater, “Unsupervised word segmentationand lexicon discovery using acoustic word embeddings,”

IEEE/ACM Trans.ASLP , vol. 24, no. 4, pp. 669–679, 2016.[23] C.-Y. Lee, T. J. O’Donnell, and J. R. Glass, “Unsupervised lexicon discoveryfrom acoustic input,”

TACL , vol. 3, pp. 389–403, 2015.[24] S. Feng and T. Lee, “On the linguistic relevance of speech units learned byunsupervised acoustic modeling,” in

Proc. INTERSPEECH , pp. 2068–2072,2017.[25] H. Wang, T. Lee, C.-C. Leung, B. Ma, and H. Li, “Acoustic segment modelingwith spectral clustering methods,”

IEEE/ACM Trans. ASLP , vol. 23, no. 2,pp. 264–277, 2015.[26] H. Chen, C.-C. Leung, L. Xie, B. Ma, and H. Li, “Unsupervised bottleneckfeatures for low-resource query-by-example spoken term detection,” in

Proc.INTERSPEECH , pp. 923–927, 2016.[27] H. Chen, C.-C. Leung, L. Xie, B. Ma, and H. Li, “Parallel inference of Dirichletprocess Gaussian mixture models for unsupervised acoustic modeling: a feasi-bility study,” in

Proc. INTERSPEECH , pp. 3189–3193, 2015. [28] J. Mamou, B. Ramabhadran, and O. Siohan, “Vocabulary independent spokenterm detection,” in

Proc. SIGIR , pp. 615–622, 2007.[29] D. Ram, L. Miculicich, and H. Bourlard, “Multilingual bottleneck features forquery by example spoken term detection,” in

Proc. ASRU , pp. 621–628, 2019.[30] H. Chen, C.-C. Leung, L. Xie, B. Ma, and H. Li, “Multilingual bottle-neckfeature learning from untranscribed speech,” in

Proc. ASRU , pp. 727–733,2017.[31] T. K. Ansari, R. Kumar, S. Singh, and S. Ganapathy, “Deep learning meth-ods for unsupervised acoustic modeling - LEAP submission to ZeroSpeechchallenge 2017,” in

Proc. ASRU , pp. 754–761, 2017.[32] M. Heck, S. Sakti, and S. Nakamura, “Feature optimized DPGMM clusteringfor unsupervised subword modeling: A contribution to ZeroSpeech 2017,” in

Proc. ASRU , pp. 740–746, 2017.[33] Y. Yuan, C.-C. Leung, L. Xie, H. Chen, B. Ma, and H. Li, “Pairwise learn-ing using multi-lingual bottleneck features for low-resource query-by-examplespoken term detection,” in

Proc. ICASSP , pp. 5645–5649, 2017.[34] T. Tsuchiya, N. Tawara, T. Ogawa, and T. Kobayashi, “Speaker invariantfeature extraction for zero-resource languages with adversarial learning,” in

Proc. ICASSP , pp. 2381–2385, 2018.[35] L. Badino, A. Mereta, and L. Rosasco, “Discovering discrete subword unitswith binarized autoencoders and hidden-Markov-model encoders,” in

Proc.INTERSPEECH , pp. 3174–3178, 2015.[36] D. Renshaw, H. Kamper, A. Jansen, and S. Goldwater, “A comparison ofneural network methods for unsupervised representation learning on the zeroresource speech challenge,” in

Proc. INTERSPEECH , pp. 3199–3203, 2015.[37] J. Chorowski, R. J. Weiss, S. Bengio, and A. v. d. Oord, “Unsupervisedspeech representation learning using Wavenet autoencoders,” arXiv preprintarXiv:1901.08810 , 2019. [38] Y. Yuan, C.-C. Leung, L. Xie, H. Chen, B. Ma, and H. Li, “Extracting bottle-neck features and word-like pairs from untranscribed speech for feature repre-sentations,” in

Proc. ASRU , pp. 734–739, 2017.[39] G. Synnaeve and E. Dupoux, “Weakly supervised multi-embeddings learningof acoustic models,” arXiv preprint arXiv:1412.6645 , 2014.[40] H. Kamper, M. Elsner, A. Jansen, and S. Goldwater, “Unsupervised neuralnetwork based feature extraction using weak top-down constraints,” in

Proc.ICASSP , pp. 5818–5822, 2015.[41] E. Hermann, H. Kamper, and S. Goldwater, “Multilingual and unsu-pervised subword modeling for zero-resource languages,” arXiv preprintarXiv:1811.04791 , 2018.[42] A. Jansen and B. Van Durme, “Efficient spoken term discovery using random-ized algorithms,” in

Proc. ASRU , pp. 401–406, 2011.[43] R. Thiollière, E. Dunbar, G. Synnaeve, M. Versteegh, and E. Dupoux, “A hy-brid dynamic time warping-deep neural network architecture for unsupervisedacoustic modeling,” in

Proc. INTERSPEECH , pp. 3179–3183, 2015.[44] M. Heck, S. Sakti, and S. Nakamura, “Supervised learning of acoustic modelsin a zero-resource setting to improve DPGMM clustering,” in

Proc. INTER-SPEECH , pp. 1310–1314, 2016.[45] M. Heck, S. Sakti, and S. Nakamura, “Unsupervised linear discriminant anal-ysis for supporting DPGMM clustering in the zero-resource scenario,” in

Proc.SLTU , pp. 73–79, 2016.[46] M. Heck, S. Sakti, and S. Nakamura, “Iterative training of a DPGMM-HMMacoustic unit recognizer in a zero-resource scenario,” in

Proc. SLT , pp. 57–63,2016.[47] E. Eide and H. Gish, “A parametric approach to vocal tract length normaliza-tion,” in

Proc. ICASSP , vol. 1, pp. 346–348, 1996. [48] N. Zeghidour, G. Synnaeve, M. Versteegh, and E. Dupoux, “A deep scatteringspectrum-deep siamese network pipeline for unsupervised acoustic modeling,”in

Proc. ICASSP , pp. 4965–4969, 2016.[49] T. K. Ansari, R. Kumar, S. Singh, S. Ganapathy, and S. Devi, “Unsuper-vised HMM posteriograms for language independent acoustic modeling in zeroresource conditions,” in

Proc. ASRU , pp. 762–768, 2017.[50] J. Chang and J. W. Fisher III, “Parallel sampling of DP mixture models usingsub-cluster splits,” in

Advances in NIPS , pp. 620–628, 2013.[51] T. Pellegrini, C. Manenti, and J. Pinquier, “The IRIT-UPS system @ Ze-roSpeech 2017 Track 1 : unsupervised subword modeling,”

Technical report ,2017.[52] S. Feng, T. Lee, and H. Wang, “Exploiting language-mismatched phonemerecognizers for unsupervised acoustic modeling,” in

Proc. ISCSLP , pp. 1–5,2016.[53] C.-H. Lee, F. K. Soong, and B.-H. Juang, “A segment model based approachto speech recognition,” in

Proc. ICASSP , pp. 501–504, 1988.[54] L. Ondel, H. K. Vydana, L. Burget, and J. Černocký, “Bayesian SubspaceHidden Markov Model for Acoustic Unit Discovery,” in

Proc. INTERSPEECH ,pp. 261–265, 2019.[55] T. Svendsen and F. Soong, “On the automatic segmentation of speech signals,”in

Proc. ICASSP , vol. 12, pp. 77–80, 1987.[56] Y. Qiao, N. Shimomura, and N. Minematsu, “Unsupervised optimal phonemesegmentation: objectives, algorithm and comparisons,” in

Proc. ICASSP ,pp. 3989–3992, 2008.[57] Y. Pereiro Estevan, V. Wan, and O. Scharenborg, “Finding maximum marginsegments in speech,” in

Proc. ICASSP , vol. 4, pp. 937–940, 2007.[58] O. Scharenborg, V. Wan, and M. Ernestus, “Unsupervised speech segmenta-tion: An analysis of the hypothesized phone boundaries,”

The Journal of theAcoustical Society of America , vol. 127, no. 2, pp. 1084–1095, 2010. [59] A. H. H. N. Torbati, J. Picone, and M. Sobel, “Speech acoustic unit seg-mentation using hierarchical Dirichlet processes.,” in

Proc. INTERSPEECH ,pp. 637–641, 2013.[60] S. Dusan and L. R. Rabiner, “On the relation between maximum spectraltransition positions and phone boundaries,” in

Proc. INTERSPEECH , 2006.[61] M. Vetter, M. Müller, F. Hamlaoui, G. Neubig, S. Nakamura, S. Stüker, andA. Waibel, “Unsupervised phoneme segmentation of previously unseen lan-guages,” in

Proc. INTERSPEECH , pp. 3544–3548, 2016.[62] P. Michel, O. Räsänen, R. Thiollière, and E. Dupoux, “Blind phoneme seg-mentation with temporal prediction errors,” in

Proc. ACL , pp. 62–68, 2017.[63] H. Wang, C.-C. Leung, T. Lee, B. Ma, and H. Li, “An acoustic segment model-ing approach to query-by-example spoken term detection,” in

Proc. ICASSP ,pp. 5157–5160, 2012.[64] H. Wang, T. Lee, C.-C. Leung, B. Ma, and H. Li, “Unsupervised mining ofacoustic subword units with segment-level Gaussian posteriorgrams,” in

Proc.INTERSPEECH , pp. 2297–2301, 2013.[65] M.-L. Sung, S. Feng, and T. Lee, “Unsupervised pattern discovery from the-matic speech archives based on multilingual bottleneck features,” in

Proc. AP-SIPA , pp. 1448–1455, 2018.[66] L. Ondel, L. Burget, J. Cernocký, and S. Kesiraju, “Bayesian phonotacticlanguage model for acoustic unit discovery,” in

Proc. ICASSP , pp. 5750–5754,2017.[67] L. Ondel, P. Godard, L. Besacier, E. Larsen, M. Hasegawa-Johnson,O. Scharenborg, E. Dupoux, L. Burget, F. Yvon, and S. Khudanpur, “Bayesianmodels for unit discovery on a very low resource language,” in

Proc. ICASSP ,pp. 5939–5943, 2018.[68] J. Ebbers, J. Heymann, L. Drude, T. Glarner, R. Haeb-Umbach, and B. Raj,“Hidden Markov model variational autoencoder for acoustic unit discovery,”in

Proc. INTERSPEECH , pp. 488–492, 2017. [69] A. Jansen, S. Thomas, and H. Hermansky, “Weak top-down constraints forunsupervised acoustic model training.,” in

Proc. ICASSP , pp. 8091–8095, 2013.[70] T. J. Hazen, M.-H. Siu, H. Gish, S. Lowe, and A. Chan, “Topic modeling forspoken documents using only phonetic information,” in

Proc. ASRU , pp. 395–400, 2011.[71] C.-T. Chung, C.-A. Chan, and L.-S. Lee, “Unsupervised discovery of linguisticstructure including two-level acoustic patterns using three cascaded stages ofiterative optimization,” in

Proc. ICASSP , pp. 8081–8085, 2013.[72] C.-T. Chung, C.-Y. Tsai, H.-H. Lu, C.-H. Liu, H.-Y. Lee, and L.-S. Lee, “Aniterative deep learning framework for unsupervised discovery of speech fea-tures and linguistic units with applications on spoken term detection,” in

Proc.ASRU , pp. 245–251, 2015.[73] A. Jansen and K. Church, “Towards unsupervised training of speaker inde-pendent acoustic models,” in

Proc. INTERSPEECH , pp. 1693–1692, 2011.[74] J. Sethuraman, “A constructive definition of Dirichlet priors,”

Statistica Sinica ,pp. 639–650, 1994.[75] M. West and J. Harrison,

Multivariate Modelling and Forecasting , pp. 597–651.New York, NY: Springer New York, 1989.[76] R. M. Neal, “Markov chain sampling methods for Dirichlet process mixturemodels,”

Journal of Computational and Graphical Statistics , vol. 9, no. 2,pp. 249–265, 2000.[77] S. Jain and R. M. Neal, “A split-merge Markov chain Monte Carlo proce-dure for the Dirichlet process mixture model,”

Journal of Computational andGraphical Statistics , vol. 13, no. 1, pp. 158–182, 2004.[78] D. M. Blei, M. I. Jordan, et al. , “Variational inference for Dirichlet processmixtures,”

Bayesian analysis , vol. 1, no. 1, pp. 121–143, 2006.[79] K. Kurihara, M. Welling, and Y. W. Teh, “Collapsed variational Dirichletprocess mixture models.,” in

IJCAI , vol. 7, pp. 2796–2801, 2007. [80] K. Kurihara, M. Welling, and N. Vlassis, “Accelerated variational Dirichletprocess mixtures,” in

Advances in NIPS , pp. 761–768, 2007.[81] J.-L. Gauvain and C.-H. Lee, “Maximum a posteriori estimation for multi-variate Gaussian mixture observations of Markov chains,”

IEEE Trans. SAP ,vol. 2, no. 2, pp. 291–298, 1994.[82] F. Grézl, M. Karafiát, and L. Burget, “Investigation into bottle-neck featuresfor meeting speech recognition,” in

Proc. INTERSPEECH , pp. 2947–2950,2009.[83] K. Veselý, M. Karafiát, and F. Grézl, “Convolutive bottleneck network featuresfor LVCSR,” in

Proc. ASRU , pp. 42–47, 2011.[84] H. Xu, H. Su, E. S. Chng, and H. Li, “Semi-supervised training for bottle-neck feature based DNN-HMM hybrid systems,” in

Proc. INTERSPEECH ,pp. 2078–2082, 2014.[85] K. Veselý, M. Karafiát, F. Grézl, M. Janda, and E. Egorova, “The language-independent bottleneck features,” in

Proc. SLT , pp. 336–341, 2012.[86] J.-T. Huang, J. Li, D. Yu, L. Deng, and Y. Gong, “Cross-language knowledgetransfer using multilingual deep neural network with shared hidden layers,” in

Proc. ICASSP , pp. 7304–7308, 2013.[87] R. Caruana, “Multitask learning,” in

Learning to learn , pp. 95–133, Springer,1998.[88] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Han-nemann, P. Motlicek, Y. Qian, P. Schwarz, et al. , “The Kaldi speech recogni-tion toolkit,” in

Proc. ASRU , 2011.[89] D. B. Paul and J. M. Baker, “The design for the Wall Street Journal-basedCSR corpus,” in

Proc. Workshop on Speech and Natural Language , pp. 357–362,1992.[90] T. Lee, W. K. Lo, P. C. Ching, and H. Meng, “Spoken language resources forCantonese speech processing,”

Speech Communication , vol. 36, no. 3, pp. 327–342, 2002. [91] M. J. Gales, “Maximum likelihood linear transformations for HMM-basedspeech recognition,”

Computer Speech & Language , vol. 12, no. 2, pp. 75–98,1998.[92] T. Anastasakos, J. McDonough, R. Schwartz, and J. Makhoul, “A compactmodel for speaker-adaptive training,” in

Proc. ICSLP , vol. 2, pp. 1137–1140,1996.[93] T. Anastasakos, J. McDonough, and J. Makhoul, “Speaker adaptive training:A maximum likelihood approach to speaker normalization,” in

Proc. ICASSP ,vol. 2, pp. 1043–1046, 1997.[94] Y. Miao, H. Zhang, and F. Metze, “Speaker adaptive training of deep neuralnetwork acoustic models using i-vectors,”

IEEE/ACM Trans. ASLP , vol. 23,no. 11, pp. 1938–1949, 2015.[95] X. Cui, V. Goel, and G. Saon, “Embedding-based speaker adaptive trainingof deep neural networks,” in

Proc. INTERSPEECH , pp. 122–126, 2017.[96] R. Haeb-Umbach and H. Ney, “Linear discriminant analysis for improved largevocabulary continuous speech recognition,” in

Proc. ICASSP , vol. 1, pp. 13–16,1992.[97] M. J. Gales, “Semi-tied covariance matrices for hidden Markov models,”

IEEETrans. SAP , vol. 7, no. 3, pp. 272–281, 1999.[98] W.-N. Hsu and J. R. Glass, “Extracting domain invariant features by unsu-pervised learning for robust automatic speech recognition,” in

Proc. ICASSP ,pp. 5614–5618, 2018.[99] W.-N. Hsu, Y. Zhang, and J. R. Glass, “Unsupervised learning of disentangledand interpretable representations from sequential data,” in

Advances in NIPS ,pp. 1876–1887, 2017.[100] D. P. Kingma and M. Welling, “Auto-encoding variational Bayes,” in

Proc.ICLR , 2014. [101] W.-N. Hsu, H. Tang, and J. R. Glass, “Unsupervised adaptation with inter-pretable disentangled representations for distant conversational speech recog-nition,” in

Proc. INTERSPEECH , pp. 1576–1580, 2018.[102] S. Shon, W.-N. Hsu, and J. Glass, “Unsupervised representation learning ofspeech for dialect identification,” in

Proc. SLT , pp. 105–111, 2018.[103] Z. Meng, J. Li, Z. Chen, Y. Zhao, V. Mazalov, Y. Gong, and B.-H. F.Juang, “Speaker-invariant training via adversarial learning,” in

Proc. ICASSP ,pp. 5969–5973, 2018.[104] Y. Ganin and V. Lempitsky, “Unsupervised domain adaptation by backprop-agation,” in

Proc. ICML , pp. 1180–1189, 2015.[105] S. Sun, C.-F. Yeh, M.-Y. Hwang, M. Ostendorf, and L. Xie, “Domain adversar-ial training for accented speech recognition,” in

Proc. ICASSP , pp. 4854–4858,2018.[106] Z. Peng, S. Feng, and T. Lee, “Adversarial multi-task deep features and un-supervised back-end adaptation for language recognition,” in

Proc. ICASSP ,pp. 5961–5965, 2019.[107] J. Wang, Y. Qin, Z. Peng, and T. Lee, “Child speech disorder detection withsiamese recurrent network using speech attribute features,” in

Proc. INTER-SPEECH , pp. 3885–3889, 2019.[108] A. Stolcke, “SRILM – an extensible language modeling toolkit,” in

Proc. IC-SLP , pp. 901–904, 2002.[109] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv ,vol. abs/1412.6980, 2014.[110] L. v. d. Maaten and G. Hinton, “Visualizing data using t-SNE,”

Journal ofmachine learning research , vol. 9, no. Nov, pp. 2579–2605, 2008.[111] B. Wu, S. Sakti, J. Zhang, and S. Nakamura, “Optimizing DPGMM clusteringin zero-resource setting based on functional load,” in

Proc. SLTU , pp. 1–5,2018. [112] M. Heck, S. Sakti, and S. Nakamura, “Dirichlet process mixture of mixturesmodel for unsupervised subword modeling,”

IEEE/ACM Trans. ASLP , vol. 26,no. 11, pp. 2027–2042, 2018.[113] S. Feng and T. Lee, “Exploiting speaker and phonetic diversity of mismatchedlanguage resources for unsupervised subword modeling,” in

Proc. INTER-SPEECH , pp. 2673–2677, 2018.[114] H. Sak, A. W. Senior, and F. Beaufays, “Long short-term memory recurrentneural network architectures for large-scale acoustic modeling.,” in

Proc. IN-TERSPEECH , pp. 338–342, 2014.[115] A. Graves, N. Jaitly, and A.-r. Mohamed, “Hybrid speech recognition withdeep bidirectional LSTM,” in

Proc. ASRU , pp. 273–278, 2013.[116] P. Swietojanski, A. Ghoshal, and S. Renals, “Unsupervised cross-lingual knowl-edge transfer in DNN-based LVCSR,” in

Proc. SLT , pp. 246–251, 2012.[117] P. Schwarz, “Phoneme recognition based on long temporal context,”

PhDTesis. Brno University of Technology. , 2009.[118] H. v. d. Heuvel, J. Boudy, Z. Bakcsi, J. Cernocky, V. Galunov, J. Kochanina,W. Majewski, P. Pollak, M. Rusko, J. Sadowski, et al. , “SpeechDat-E: Fiveeastern European speech databases for voice-operated teleservices completed,”in

Proc. INTERSPEECH , 2001.[119] R. J. Williams and J. Peng, “An efficient gradient-based algorithm for on-linetraining of recurrent network trajectories,”

Neural computation , vol. 2, no. 4,pp. 490–501, 1990.[120] G. Aradilla, J. Vepa, and H. Bourlard, “Using posterior-based features in tem-plate matching for speech recognition,” in

Proc. INTERSPEECH , 2006.[121] S. Kullback and R. A. Leibler, “On information and sufficiency,”

The Annalsof Mathematical Statistics , vol. 22, no. 1, pp. 79–86, 1951.[122] G. Aradilla, J. Vepa, and H. Bourlard, “An acoustic model based on Kullback-Leibler divergence for posterior features,” in

Proc. ICASSP , vol. 4, pp. 657–660,2007. [123] G. Aradilla, H. Bourlard, and M. M. Doss, “Using KL-based acoustic models ina large vocabulary recognition task,” in

Proc. INTERSPEECH , pp. 928–931,2008.[124] T. Asami, R. Masumura, H. Masataki, M. Okamoto, and S. Sakauchi, “Train-ing data selection for acoustic modeling via submodular optimization of jointkullback-leibler divergence,” in

Proc. INTERSPEECH , pp. 3645–3649, 2015.[125] F.-L. Xie, F. K. Soong, and H. Li, “A KL divergence and DNN approach tocross-lingual TTS,” in

Proc. ICASSP , pp. 5515–5519, 2016.[126] F.-L. Xie, F. K. Soong, and H. Li, “A KL divergence and DNN-based ap-proach to voice conversion without parallel training sentences,” in

Proc. IN-TERSPEECH , pp. 287–291, 2016.[127] Y. K. Muthusamy, R. A. Cole, and B. T. Oshika, “The OGI multi-languagetelephone speech corpus.,” in

Proc. ICSLP , vol. 92, pp. 895–898, Citeseer,1992.[128] Wikipedia contributors, “Erhua — Wikipedia, the free encyclopedia.” https://en.wikipedia.org/w/index.php?title=Erhua&oldid=934025348 , 2020. [Online;accessed 20-March-2020].[129] Wikipedia, “Pinyin table — wikipedia, the free encyclopedia.” https://en.wikipedia.org/w/index.php?title=Pinyin_table&oldid=758841913 , 2017. [On-line; accessed 21-March-2017].[130] H. Kamper, A. Jansen, and S. Goldwater, “A segmental framework for fully-unsupervised large-vocabulary speech recognition,”

Computer Speech & Lan-guage , vol. 46, pp. 154–174, 2017. ppendix A

Published work

Journal article Siyuan Feng and Tan Lee. Exploiting Cross-Lingual Speaker and PhoneticDiversity for Unsupervised Subword Modeling. In

IEEE/ACM Trans. Audio,Speech, Lang. Process. , vol 27, no. 12, pp. 2000-2011, 2019.

Peer-reviewed conference proceedings Siyuan Feng and Tan Lee. 2019. Improving Unsupervised Subword Model-ing via Disentangled Speech Representation Learning and Transformation. In

Proc. INTERSPEECH , 2019, pp. 1093-1097. [Oral]2.

Siyuan Feng , Tan Lee and Zhiyuan Peng. 2019. Combining AdversarialTraining and Disentangled Speech Representation for Robust Zero-ResourceSubword Modeling. In

Proc. INTERSPEECH , 2019, pp. 281–285. [Oral]3.

Siyuan Feng and Tan Lee. 2018. Exploiting speaker and phonetic diversity ofmismatched language resources for unsupervised subword modeling. In

Proc.INTERSPEECH , 2018, pp. 2673–2677. [Oral]4.

Siyuan Feng and Tan Lee. 2018. Improving cross-lingual knowledge transfer-ability using multilingual TDNN-BLSTM with language-dependent pre-finallayer. In

Proc. INTERSPEECH , 2018, pp. 2439–2443. [Poster]5.

Siyuan Feng and Tan Lee. 2017. On the linguistic relevance of speech unitslearned by unsupervised acoustic modeling. In

Proc. INTERSPEECH , 2017,pp. 2068–2072. [Oral] Siyuan Feng , Tan Lee and Haipeng Wang. 2016. Exploiting language-mismatchedphoneme recognizers for unsupervised acoustic modeling. In

Proc. ISCSLP ,2016, pp. 1–5. [Oral]7. Zhiyuan Peng*,

Siyuan Feng * and Tan Lee. 2018. Adversarial Multi-TaskDeep Features and Unsupervised Back-End Adaptation for Language Recog-nition. In

Proc. ICASSP , 2019, pp. 5961–5965. [Poster]8. Ying Qin, Tan Lee,

Siyuan Feng and Anthony Pak Hin Kong. 2018. Auto-matic speech assessment for people with aphasia using TDNN-BLSTM withmulti-task learning. In

Proc. INTERSPEECH , 2018, pp. 3418–3422. [Poster]9. Man-Ling Sung,

Siyuan Feng and Tan Lee. 2018. Unsupervised pattern dis-covery from thematic speech archives based on multilingual bottleneck fea-tures. In

Proc. APSIPA ASC , 2018, pp. 1448–1455. [Oral]10. Yuanyuan Liu, Ying Qin,

Siyuan Feng , Tan Lee and P.C. Ching. 2018. Disor-dered speech assessment using Kullback-Leibler divergence features with multi-task acoustic modeling. In

Proc. ISCSLP , 2018, pp. 61–65. [Oral], 2018, pp. 61–65. [Oral]