Bayesian Subspace Hidden Markov Model for Acoustic Unit Discovery
Lucas Ondel, Hari Krishna Vydana, Lukáš Burget, Jan Černocký
BBayesian Subspace Hidden Markov Model for Acoustic Unit Discovery
Lucas Ondel, Hari Krishna Vydana, Luk´aˇs Burget, Jan ˇCernock´y
Brno University of Technology { iondel,vydana,burget,cernocky } @fit.vutbr.cz Abstract
This work tackles the problem of learning a set of language spe-cific acoustic units from unlabeled speech recordings given aset of labeled recordings from other languages. Our approachmay be described by the following two steps procedure: firstthe model learns the notion of acoustic units from the labelleddata and then the model uses its knowledge to find new acous-tic units on the target language. We implement this processwith the Bayesian Subspace Hidden Markov Model (SHMM), amodel akin to the Subspace Gaussian Mixture Model (SGMM)where each low dimensional embedding represents an acousticunit rather than just a HMM’s state. The subspace is trainedon 3 languages from the GlobalPhone corpus (German, Polishand Spanish) and the AUs are discovered on the TIMIT corpus.Results, measured in equivalent Phone Error Rate, show thatthis approach significantly outperforms previous HMM basedacoustic units discovery systems and compares favorably withthe Variational Auto Encoder-HMM.
Index Terms : Bayesian Inference, Hidden Markov Model,Subspace Model, Variational Bayes, Low-resource languages,Acoustic Unit Discovery
1. Introduction
State-of-the-art Automatic Speech Recognition (ASR) systemsrely upon very large amount of speech recordings paired withtextual transcriptions. While this approach has proven to bevery successful, it is however limited to the very few languageshaving enough resources to train an ASR system. Due to thecost of data collection and transcription, broadening the rangeof speech technologies to any language remains an unreachableobjective. Parallel to the mainstream ASR, there has been agrowing interest in the paradigm of unsupervised learning ofspeech [1]. Unsupervised speech learning attempts to use ma-chine learning techniques to extract various information (pho-netic content, speaker identity, . . . ) from unlabeled recordings.While this is considerably harder than standard ASR, solvingthis problem would have a considerable impact on the field byreducing the amount of human labour necessary to build a fullfledged ASR pipeline. It is also important to emphasize thatthe linguistic diversity is diminishing worldwide. Many lan-guages are now considered endangered and risk to disappear ina near future. Affordable speech technologies could be a pre-cious tool to help linguists and communities to document andpreserve these languages.This work focuses on the specific task of acoustic unit dis-covery (AUD). Given a collection of unlabeled recordings in aspecific language, the task is to learn a set of basic speech units(also called pseudo-phones) to describe the language. AUD al-gorithms have to solve three problems: to segment the speech,to cluster the segments into units and to infer how many unitsare necessary to describe the language. Several approaches havebeen proposed relying upon Bayesian non-parametric version of the Hidden Markov Model (HMM) [2, 3, 4]. An important re-cent extension of this model is the VAE-HMM [5, 6, 7] whichcombines the traditional HMM with Variational Auto Encoder[8]. However, most of the AUD algorithms are prone to modelspeaker/channel or any non-phonetic variability. To address thisissue, we propose the Bayesian Subspace HMM (SHMM). TheSHMM is an HMM based AUD model in which the parametersof each unit is constrained to be in the phonetic subspace of thetotal parameter space. This restriction forces the AUD model tofocus on the phonetic content of the speech signal and to ignoreirrelevant information.
2. Model
Let X = ( x , . . . , x N ) be the sequence of N observed speechframes and U = { u , . . . , u P } be the set of P acoustic units. v = ( v , . . . , v N ) , v i ∈ U is a sequence of variables in-dicating to which unit each speech frame is associated, and Z = ( z , . . . , z N ) are model-dependent latent variables. Weconsider generative models for which the complete likelihoodof the data factorizes as: p ( X , Z | v ) = N (cid:89) n =1 p ( x n , z n | v n ) (1)and the likelihood of a speech frame for a given unit is memberof the exponential family of distribution: p ( x n , z n | v n = u ) = exp (cid:8) η Tu T ( x n , z n ) − A ( η u ) (cid:9) (2)where η u ∈ H is the D -dimensional vector of natural parame-ters corresponding to one acoustic unit, T ( x n , z n ) are the suf-ficient statistics and A ( η u ) is the (log-)normalization constantof the density. Note that the nature of the model for the units(HMM, GMM, Linear Dynamical Model, . . . ) will depend onthe value of z n and the sufficient statistics T . In this work weconsider that each unit is modeled by an HMM with a GMMfor each state’s emission but it can be replaced by any modelsatisfying Eq. 1 and Eq. 2. Previous works [2, 3, 6] use specialcases of this model to perform the AUD. More precisely, onecan understand AUD as finding a set of vectors η u , . . . , η u p such that the likelihood of the observation is maximized . Thissearch is difficult because speech recordings encode many fac-tors other than the phonetic information (speaker identity, emo-tions, environment, . . . ) and the AUD algorithm may maximizethe likelihood while modeling non-phonetic information. To avoid the AUD model to capture non-phonetic information,we proposed the Subspace HMM (SHMM) which constrains These algorithm also learn the number of acoustic units P neededto fit the data a r X i v : . [ c s . L G ] J u l u η u bW z n v n x n NP (a) (b)Figure 1: (a) Directed Acyclic Graph of a Generalized Subspace Model. Dashed lines represent deterministic relationship betweenvariables. SHMM, JFA, SGMM are special cases of this model. In this work each embedding h u encodes the parameters of one HMMcorresponding to an acoustic unit. (b) Illustration of the subspace model for acoustic units. Each point of the plane corresponds to theparameters of an acoustic unit model and the blue line represents the subspace defined by f ( W T h + b ) . Given an acoustic unit modelcorresponding to the sound aa , moving its parameters along the subspace will change the model to represent another unit/phone ( ow , z in this example). Conversely, moving the parameters away from the phonetic subspace will push the model to capture non-phoneticinformation (for instance speaker gender). the parameters of the acoustic units to live in the phonetic space.This model extends the unsupervised HMM by assuming thatthe phonetic information of a language is contained in a sub-space of the total parameters space. Formally, it is defined as: η u = f ( W T h u + b ) (3)where f : R D (cid:55)→ H is a differentiable function. We furtherrefine this subspace model by introducing a prior over the sub-space’s parameters: W r,c ∼ N (0 , σ W r,c ) (4) b ∼ N ( , I ) (5) h u ∼ N ( , I ) (6)As depicted in Fig. 1b, the bases of W span the subspace con-taining the phonetic variability. Since the parameters of theacoustic units are constrained to live in a low dimensional sub-space, the AUD algorithm can be seen as finding the set of em-beddings h u , . . . , h u P which maximizes the likelihood of theobservations. By constraining the search in the phonetic sub-space, we therefore force the algorithm to ignore non-phoneticsource of variability.Note that Subspace Gaussian Mixture Model [9], Joint Fac-tor Analysis [10], Subspace Multinomial Model [11], etc. arespecial cases of Eq. 3. In fact, Eq. 3 is the general form of anysubspace model for which the complete likelihood is a mem-ber of the exponential family of distributions. We denote Eq. 3as the Generalized Subspace Model (GSM) of which the Sub-space HMM, like other aforementioned models, is just a spe-cial instance. The graphical representation of the (GSM) is de-picted in Fig. 1a. To complete our definition of the SHMM,we need to specify the mapping f from R D to the natural pa-rameters space H . In our setting, each unit is modeled by aHMM with a 3 states left-to-right topology and each state hasa GMM emissions with K Gaussian components with diago-nal covariance matrix. For convenience, we introduce the vec-tor ψ = W T h + b which can be decomposed into three parts ψ = ( ψ , ψ , ψ ) T . ψ i is the vector of parameters (before the mapping f ) associated with the i th HMM state. ψ i further de-composes into ψ i = ( ψ π i , ψ µ i, , . . . , ψ µ i,K , ψ Σ i, , . . . , ψ Σ i,K ) T where ψ π i is the vector encoding the parameters of the mix-ture’s weights and ψ µ i,j and ψ Σ i,j are the vectors encoding theparameters of the mean and covariance matrix of the jth Gaus-sian component, respectively. We set f such that: π i,j = exp { ψ ( π ) i,j } (cid:80) K − k =1 exp { ψ ( π ) i,k } (7) µ i,j = ψ ( µ ) i,j (8) Σ i,j = diag(exp { ψ ( Σ ) i,j } ) (9)where exp is the elementwise exponential function. One couldalso include the transition probabilities of the HMM but we keptthem as fixed parameters in this work. Unlike previous AUD algorithms, our model requires to spec-ify the phonetic subspace (parameterized by W and b ) beforesearching the acoustic units. This is a ”chicken or egg” prob-lem since we need the phonetic subspace to find the pseudo-phones of the language and we need to know the phones of alanguage to estimate the subspace. However, this problem canbe alleviated by observing that many languages in the worldhave common phones. It is reasonable to believe that the pho-netic subspace of language is well approximated by a phoneticsubspace estimated from one or several other languages forwhich we have labeled data. Interestingly, this rationale natu-rally fits the Bayesian approach of the problem of AUD. Givenunlabeled set of observation X ( t ) in a target language t , previ-ous Bayesian AUD algorithms try to estimate the inventory of(pseudo-)phones U ( t ) of the target language by estimating: p ( U ( t ) | X ( t ) ) = p ( X ( t ) | U ( t ) ) p ( U ( t ) ) p ( X ( t ) ) (10)If we now assume the phonetic subspace to be estimated fromthe observations X ( p ) of another language p with known inven-ory of phones U ( p ) , the problem can be reformulated as: p ( U ( t ) | X ( t ) , L ( p ) , S ) = p ( X ( t ) | U ( t ) , L ( p ) , S ) p ( U ( t ) | L ( p ) , S ) p ( X ( t ) | L ( p ) , S ) (11) L ( p ) = { X ( p ) , U ( p ) } (12) S = { W , b } (13)The term p ( U ( t ) | L ( p ) , S ) may be seem as some edu-cated/informative prior which embeds the notion of phone intothe AUD algorithm. This educated prior needs to be estimatedas well which leads to a two steps procedure for the SHMMAUD algorithm. First, given the labeled data of one or severallanguages, the prior over the acoustic units is estimated. Infor-mally speaking, we force the model to learn ”what is a phone”.Second the unlabeled data of the target language is clusteredinto pseudo-phones given the phonetic knowledge acquired bythe model during the first step. The two steps of the training (learning the prior and cluster-ing the units) are carried out by optimizing the same objectivefunction except that when estimating the prior, the acoustic unittranscription of each utterance is known. The presence or ab-sence of the transcription will be reflected in p ( v ) . When thereis no transcription, p ( v ) can be understood as a ”pseudo-phone”loop (see [3] for details) and when the transcription is knownthen p ( v ) is just the inference graph used for forced alignmentin a traditional HMM based ASR system.Since the estimation of the exact posterior of the model’sparameters is intractable, we use the Variational Bayes (VB)objective function to find an approximate posterior: L [ q ] = (cid:10) ln p ( X | Ξ , Θ ) (cid:11) q − D KL (cid:0) q ( Ξ , Θ ) || p ( Ξ , Θ ) (cid:1) (14) Ξ = { Z , v } (15) Θ = { W , b , h u , . . . , h u p } (16)where (cid:104) . . . (cid:105) q denote the expectation w.r.t. the distribution q and D KL is the Kullback-Leibler divergence. Eq. 14 is not tractablefor arbitrary distribution q we therefore consider the restrictedset of distributions with the following mean-field factorizationand the given parameterization: q ( Ξ , Θ ) = q ( Ξ ; φ ) q ( Θ ; ζ ) (17) ζ = { m , λ } (18) q ( Θ ; ζ ) = N ( m , diag(exp { λ } ) (19)The parameters φ of the variational posterior over Ξ will de-pend on the type of the model of the acoustic unit. For the caseof an HMM, this is the probability to be in a particular stategiven the sequence of observations. Under these restrictions theoptimization reduces to: φ ∗ , ζ ∗ = arg max φ , ζ L ( φ , ζ ) (20)Since we assume each unit to be modeled by an HMM, φ ∗ hasan analytical solution which can be efficiently calculated us-ing the forward-backward algorithm [12]. ζ ∗ has no analyticalsolution but can be found through a stochastic gradient ascentscheme. Noting that ∇ φ L ζ ( φ ∗ ) = we have: ∇ ζ L ( φ ∗ , ζ ) = ∇ ζ L φ ∗ ( ζ ) + ∇ ζ φ ∗ ∇ φ L ζ ( φ ∗ ) (21) = ∇ ζ L φ ∗ ( ζ ) (22) Finally, we approximate ∇ ζ L ( φ ∗ , ζ ) ≈ ∇ ζ L (cid:48) ( φ ∗ , ζ ) by usingthe so called re-parameterization trick introduced in [8]: (cid:15) l ∼ N ( , I ) (23) Θ l = m + diag(exp { λ } ) (cid:15) l (24) L ( φ , ζ ) ≈ L L (cid:88) l =1 ln p ( X | Ξ , Θ l ) (25) − D KL (cid:0) q ( Ξ , Θ ) || p ( Ξ , Θ ) (cid:1) = L (cid:48) ( φ , ζ ) (26)In practice we use the ADAM optimizer [13] to update ζ andwe use L = 10 samples to compute the empirical expectation.The parameters φ are re-estimated every updates of ζ .
3. Experiments
We conducted our experiments with the TIMIT [14] databaseand 3 languages from the GlobalPhone corpus [15]: German(GE), Polish (PO) and Spanish (SP). For each of the three Glob-alPhone languages, we kept only 3000 randomly selected ut-terances. We used two sets of features: (i) the MFCC fea-tures concatenated with their first and second derivatives (ii) theMulti-Lingual bottleNeck (MBN) features trained on 17 Babel’slanguages [16]. The set of languages used to train the MBNfeatures does not include English, German, Polish or Spanish.Both set of features were extracted at a rate of 100 Hz. For thecase the MBN features, the audio signal was down-sampled to8kHz.We evaluated the different AUD algorithms in terms of pho-netic segmentation and equivalent Phone Error Rate (eq. PER)([17, 5]). For the phonetic segmentation we used the standardRecall, Precision and F-score measured against the timing pro-vided in the TIMIT database with the 61 original phones. Wetolerated boundary shifted by +- 2 frames (20 milliseconds). Tocompute the eq. PER, we mapped each acoustic unit to oneof the 61 phones it overlaps the most with. Then, we reducedthe reference transcription and proposed transcription to the 39phone set [18] and computed the PER.
First, we ran a controlled experiment to assess whether theSHMM is able to properly learn the phonetic subspace of a lan-guage. In this experiment, we used the MBN features and eachHMM state had 8 Gaussian components. First, we trained aBayesian HMM phone recognizer on the 48 phone set with aflat phonotactic language model on the traditional TIMIT train-ing set (no SA* utterances) and decoded on the test set map-ping the phones to the 39 phone set. This phone recognizerachieved 36.4 % Phone Error Rate (PER). This number is veryhigh since we have removed crucial elements of the traditionalASR pipeline (language model, context-dependent phones, . . . )in order to evaluate the quality the acoustic model. For compari-son, we trained a monophone system with a flat phonotactic lan-guage model using the Kaldi toolkit [9] which yielded 37.3 %PER. We then trained an SHMM based phone recognizer withvarying subspace dimension using the same training and test-ing setup as the baseline HMM. We used the baseline model toprovide the first estimate of φ which we modified so that all theGaussian components within a state have equal responsibility.We pre-trained the subspace for 15000 updates before updating φ then we re-estimated φ after every 1000 updates of ζ for 30 odel Features Prior Language Recall Precision F-score eq. PERHMM [5] MFCC + ∆ + ∆∆ None - - - 65.4VAE-HMM [5] MFCC + ∆ + ∆∆ None - - - 58.9VAE-BHMM [6] log-mel FBANK + ∆ + ∆∆ None - - - 56.57HMM MFCC + ∆ + ∆∆ None 66.47 57.81 61.84 64.92HMM MBN None 63.98 54.21 58.69 68.25SHMM MFCC + ∆ + ∆∆ GE ∆ + ∆∆ GE+PO 73.94 74.47 74.20 58.23SHMM MFCC + ∆ + ∆∆ GE+PO+SP 75.03 74.00 74.51 56.91SHMM MBN GE 56.57 69.34 62.31 55.14SHMM MBN GE+PO 59.18 69.12 63.76 54.1SHMM MBN GE+PO+SP 60.89 68.41 64.43
Table 1:
Comparison of the SHMM against other AUD models in terms of phonetic segmentation (Recall, Precision, F-score) andequivalent Phone Error Rate (%). iterations. Results, shown in Fig. 2, indicate that the SHMMFigure 2:
PER of the SHMM for varying subspace dimension. is perfectly able to learn the phonetic subspace of a languageby compressing the 3861-dimensional parameter space to asubspace as small as 30 dimensions and yet achieving the samePER as the HMM baseline. We now consider the case of unsupervised learning of speechwhere English is assumed to be a low-resourced language. Inthis setup, we use the complete TIMIT set (training, develop-ment and testing set including the SA* utterances) as the cor-pus from which to extract acoustic units. In this experiment,all the HMM/SHMM have 4 Gaussian components per state.Our baseline is the HMM based AUD system described in [3]and the VAE-(B)HMM based AUD system proposed in [5, 6].We compare these baselines with 3 SHMM based AUD mod-els for which the posterior of the phonetic subspace q ( W , b ) was estimated using: (i) German language (ii) German and Pol-ish languages (iii) German, Polish and Spanish languages. Foreach case the phonetic subspace had 35, 70 and 100 dimensionsrespectively. Note that the choice of the languages and the or-der of combination was arbitrary and it is likely that choosinglanguages closely related to the target language would be ben-eficial. We considered all the phones of all the languages to beunique and didn’t merge them while estimating the subspace.The posteriors of the embeddings q ( h u ) corresponding to theGerman, Polish and Spanish phones were discarded before theAUD clustering.The results are presented in Table 1 and differ signifi-cantly depending on the input features. The SHMM alwaysbenefits from learning the phonetic subspace in terms of eq. × (8 Gaussian × × . 80 is the features dimension, accounts for the mean and the diagonal of the covariance matrix and7 is the dimension of the per-state mixture weights. PER. Interestingly, the baseline HMM fails to benefit fromthe MBN features as it underperforms compared to the HMMtrained on MFCC features. The SHMM, thanks to its sub-space, learns from other languages to fully exploit the discrim-inatively trained features. Regarding the segmentation evalua-tion, the SHMM better segments the speech compared to thesimple HMM. However, we observe that using more than onelanguage does not necessarily improves the segmentation. Also,contrary to the eq. PER, the MBN features does not seem to beideal to get accurate segmentation.Finally, we tried to label the TIMIT corpus with a HMMphone-recognizer (MBN features) trained on German, Germanand Polish and German, Polish and Spanish and we interpretedthe output phones as acoustic units. For these 3 models theeq. PER was 61.22 % (GE), 66.47 % (GE+PO) and 71.96 %(GE+PO+SP). Contrary to the SHMM, this naive approach doesnot benefit from having more languages.
4. Conclusions
We proposed a new model for AUD: the Subspace HMM.Unlike other AUD models the SHMM is trained in a super-vised fashion on one or several languages to learn the notionof ”phone”. This phonetic knowledge is encoded into a non-linear subspace of the total parameter space. Then, the SHMMsearches a set of of acoustic units in this subspace which maxi-mizes the likelihood of the observations of the target language.The SHMM outperforms the HMM based AUD and is compet-itive with the VAE-HMM. When using discriminatively trainedfeatures, the SHMM achieves 49.2 % equivalent PER on TIMITwhithout any supervision in the target language.
5. Acknowledgements
The work was supported by Czech National Science Foundation(GACR) project ”NEUREM3” No. 19-26934X, Czech Ministryof Interior project No. VI20152020025 ”DRAPAK”, and CzechMinistry of Education, Youth and Sports from the NationalProgramme of Sustainability (NPU II) project ”IT4Innovationsexcellence in science - LQ1602”. This work was also sup-ported by by the Office of the Director of National Intelli-gence (ODNI), Intelligence Advanced Research Projects Ac-tivity (IARPA) MATERIAL program, via Air Force ResearchLaboratory (AFRL) contract . References [1] J. R. Glass, “Towards unsupervised speech processing,” in
ISSPA .IEEE, 2012, pp. 1–4.[2] C. Lee and J. R. Glass, “A nonparametric bayesian approach toacoustic model discovery,” in
ACL (1) . The Association for Com-puter Linguistics, 2012, pp. 40–49.[3] L. Ondel, L. Burget, and J. Cernocky, “Variational inferencefor acoustic unit discovery,” in
Procedia Computer Science , vol.2016, no. 81. Elsevier Science, 2016, pp. 80–86.[4] L. Ondel, L. Burget, J. ˇCernock´y, and S. Kesiraju, “Bayesianphonotactic language model for acoustic unit discovery,” in
Pro-ceedings of ICASSP 2017 . IEEE Signal Processing Society,2017, pp. 5750–5754.[5] J. Ebbers, J. Heymann, L. Drude, T. Glarner, R. Haeb-Umbach,and B. Raj, “Hidden markov model variational autoencoderfor acoustic unit discovery,” in
Interspeech 2017, 18th An-nual Conference of the International Speech CommunicationAssociation, Stockholm, Sweden, August 20-24, 2017
Interspeech . ISCA, 2018, pp. 2688–2692.[7] L. Ondel, P. Godard, L. Besacier, E. Larsen, M. Hasegawa-Johnson, O. Scharenborg, E. Dupoux, L. Burget, F. Yvon, andS. Khudanpur, “Bayesian Models for Unit Discovery on a veryLow Resource Language,” in
IEEE International Conference onAcoustics, Speech and Signal Processing , ser. ICASSP, Calgary,Canada, 2018. [Online]. Available: sources/Ondel18bayesian.pdf[8] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,”in
ICLR , 2014.[9] D. Povey, A. Ghoshal, G. Boulianne, N. Goel, M. Hannemann,Y. Qian, P. Schwarz, and G. Stemmer, “The kaldi speech recogni-tion toolkit,” in
In IEEE 2011 workshop , 2011.[10] P. Kenny, “Joint factor analysis of speaker and session variability:Theory and algorithms,” Tech. Rep., 2005.[11] S. Kesiraju, L. Burget, I. Sz˝oke, and J. ˇCernock´y, “Learning docu-ment representations using subspace multinomial model,” in
Pro-ceedings of Interspeech 2016 . International Speech Communi-cation Association, 2016, pp. 700–704.[12] L. R. Rabiner, “A tutorial on hidden markov models and selectedapplications in speech recognition,” in
PROCEEDINGS OF THEIEEE , 1989, pp. 257–286.[13] D. P. Kingma and J. Ba, “Adam: A method for stochastic opti-mization,” in
ICLR , 2015.[14] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S.Pallett, and N. L. Dahlgren, “Darpa timit acoustic phonetic con-tinuous speech corpus cdrom,” 1993.[15] T. Schultz, N. T. Vu, and T. Schlippe, “Globalphone: A multilin-gual text & speech database in 20 languages.” in
ICASSP . IEEE,2013, pp. 8126–8130.[16] R. Fer, P. Matejka, F. Grezl, O. Plchot, K. Vesely, and J. H. Cer-nocky, “Multilingually trained bottleneck features in spoken lan-guage recognition,”
Computer Speech & Language , vol. 46, no.Supplement C, pp. 252 – 267, 2017.[17] H. Kamper, A. Jansen, and S. Goldwater, “A segmental frame-work for fully-unsupervised large-vocabulary speech recogni-tion,”